Kafka makes asynchronous integration easier to scale, but it does not make event-driven architecture automatically clean.
That cleanliness comes from treating topics as owned contracts with explicit replay, versioning, and failure behavior. Without that, Kafka can become a high-throughput way to spread ambiguity.
Topics Are Product Interfaces
For each topic, someone should be able to answer:
- what business fact this event represents
- who owns publishing it
- what the schema compatibility rules are
- what keying and ordering guarantees consumers may rely on
- how replay is expected to behave
That is what makes a topic a contract instead of just a transport channel.
Events Should Be Facts, Not Wishful Read Models
The strongest events usually describe something that happened:
OrderCreatedPaymentCapturedInventoryReserved
They are weaker when they behave more like unstable read models or commands disguised as events.
Consumers tolerate replay and delayed delivery much better when the event represents a durable domain fact rather than a mutable projection.
Partition Keys Define Real Behavior
Key choice is not a minor producer detail. It defines ordering and scaling behavior.
Examples:
- key by
orderIdif per-order ordering matters - key by
customerIdif customer-level ordering matters - avoid random keys if consumers depend on sequence
The trade-off is that low-cardinality or skewed keys can create hot partitions, so keying is always both a correctness and capacity decision.
Producers Need a Reliability Story
ProducerRecord<String, byte[]> record =
new ProducerRecord<>("orders.events.v1", orderId, payloadBytes);
producer.send(record, (metadata, ex) -> {
if (ex != null) {
// persist failure signal, trigger retry pipeline or alert
}
});
For critical flows, a send callback alone is not the whole story. The producer side should also answer:
- what happens if the publish fails after local state changes
- whether an outbox or equivalent pattern is needed
- how publish failure is surfaced operationally
That is why event-driven design often pairs naturally with the outbox pattern.
Consumers Must Be Replay-Tolerant
Even with ordering guarantees, consumers should expect:
- duplicate delivery
- delayed delivery
- replay from an older offset
That means consumer logic should usually be idempotent:
if (processedEventStore.exists(eventId)) {
return;
}
applyBusinessUpdate(event);
processedEventStore.markProcessed(eventId);
Kafka can help with durability and ordering. It does not remove the need for safe reprocessing.
Retry, DLQ, and Replay Need Explicit Rules
Different failure types deserve different handling:
- transient dependency problems: bounded retry with backoff
- malformed payload or schema mismatch: fail fast to DLQ
- repeated business-rule failure: DLQ plus alert and investigation
Infinite retry on poison messages is one of the fastest ways to turn a topic into a stalled partition with mounting lag.
flowchart LR
A[Producer service] --> B[Kafka topic]
B --> C[Consumer group]
C -->|success| D[Business update]
C -->|transient failure| E[Bounded retry]
C -->|poison record| F[DLQ]
A Better Incident Drill
If inventory-service becomes slow and an order consumer depends on it:
- lag should rise visibly
- retries should stay bounded
- poison records should move to DLQ if they are not recoverable
- replay should remain possible after the dependency recovers
That is a healthier model than blocking the whole flow indefinitely while pretending the pipeline is still “real time.”
What to Measure
High-signal operational metrics include:
- consumer lag by topic and partition
- processing latency
- retry counts
- DLQ rate and top failure reasons
- producer error rate and publish latency
- rebalance frequency
These metrics tell you whether the event system is staying accountable under failure, not just whether bytes are still moving.
Key Takeaways
- Kafka works best when topics are treated as owned, versioned contracts.
- Partition keys define both ordering guarantees and scaling trade-offs.
- Consumers should be replay-tolerant and idempotent by design.
- Retry, DLQ, and replay behavior are core architecture decisions, not operational afterthoughts.
Categories
Tags