Stateful stream processing is where Kafka Streams stops feeling like a lightweight DSL and starts behaving like a real storage system. The code can stay compact, but the operational reality now includes local disk, changelog topics, restore time, and recovery behavior under restart.
Part 1 is about respecting that reality. Before tuning RocksDB, you need a baseline that tells you how much state exists, how long it takes to recover, and how much load your changelog replay imposes during restart.
Why Stateful Topologies Surprise Teams
A simple count or join can hide a lot of machinery:
- local RocksDB state stores
- changelog topics for durability
- restore work during restart or rebalance
- disk usage that grows with key cardinality and retention
flowchart LR
A[Input topic] --> B[Stateful topology]
B --> C[(Local RocksDB store)]
C --> D[Changelog topic]
D --> C
The topology is not just processing a stream anymore. It is continuously maintaining recoverable state.
The Baseline You Need First
For Part 1, the goal is not premature tuning. The goal is to answer:
- how large does the local state become
- how fast does restore complete after restart
- how much broker traffic is generated by the changelog
If you do not know those numbers, RocksDB tuning later is mostly superstition.
A Simple Stateful Topology
Start with a straightforward aggregation:
KTable<String, Long> counts = builder.stream("orders.events")
.groupByKey()
.count(Materialized.as("orders-count-store"));
That is enough to make state visible without introducing windowing or joins too early.
If you want to make restart cost even clearer, move to a windowed count:
builder.stream("orders.events")
.groupByKey()
.windowedBy(TimeWindows.ofSizeWithNoGrace(Duration.ofMinutes(5)))
.count(Materialized.as("orders-counts"));
The code is still readable, but now the store behavior is concrete enough to inspect.
What to Observe Locally
Local state size
How much disk the application consumes under realistic cardinality.
Restore duration
How long the app spends rebuilding state after restart, rebalance, or local-store deletion.
Changelog pressure
Whether changelog traffic and compaction behavior are sized reasonably for recovery.
These signals matter together. Looking only at local disk misses broker pressure. Looking only at broker traffic misses restore pain.
Local Setup
Prerequisites
- Docker Desktop
- Java 21
- Kafka CLI tools
Local Stack
services:
zookeeper:
image: confluentinc/cp-zookeeper:7.6.1
environment:
ZOOKEEPER_CLIENT_PORT: 2181
kafka:
image: confluentinc/cp-kafka:7.6.1
depends_on: [zookeeper]
ports: ["9092:9092"]
environment:
KAFKA_BROKER_ID: 1
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
KAFKA_LISTENERS: PLAINTEXT://0.0.0.0:9092
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:9092
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
docker compose up -d
The Right Failure Drill
Do not stop at happy-path aggregation. Delete the local state and restart the app while traffic still exists.
That gives you the baseline restore behavior:
ls -lh /tmp/kafka-streams/
What matters is not just whether the topology comes back. It is:
- how long restore takes
- whether traffic backs up during restore
- whether the local disk footprint is predictable
[!important] In production, restart time is part of correctness. A topology that restores too slowly during incident recovery is operationally weaker than its functional tests suggest.
Common Mistakes
Treating state stores as invisible internals
If the team cannot name the state stores and changelog topics that matter, incident response will be slower than it needs to be.
Ignoring restore in capacity planning
A topology can look fine during steady state and still be unacceptable during restart because rebuild cost was never measured.
Over-tuning before baseline
Block cache tweaks and compaction tuning are second-order moves. First establish whether the topology shape itself is reasonable.
What This Part Should Leave You With
After Part 1, the team should be able to explain:
- where the state physically lives
- how it is recovered
- which metrics prove the stateful topology is healthy under restart
That baseline turns RocksDB tuning from guesswork into a real engineering exercise.
Categories
Tags