Multi Region Kafka Replication and Failover Patterns (Part 1)

Multi-region Kafka is easy to over-simplify. A diagram with arrows between regions can make the plan feel done, but the real work starts when one region is degraded and the system has to decide who owns writes, which copy is authoritative, and how consumers should behave during the transition.

Part 1 is about naming that operating model before any failover drill. Replication alone is not a failover strategy.

The Question Behind the Topology

The first decision is not tooling. It is ownership.

Are you building:

active-passive, where one region normally owns writes
active-active, where write ownership is split or coordinated

The answer changes everything about failover complexity, duplicate risk, and recovery.

flowchart LR
    A[Primary region writes] --> B[Replication]
    B --> C[Secondary region copy]
    C --> D[Failover consumers]
    E[Failover decision] --> F[Shift write authority]

If the team cannot say which region is authoritative during normal operation, failover is already underspecified.

What Failover Actually Has to Solve

A realistic failover plan is not just “send traffic somewhere else.” It has to answer:

when producers stop writing to the primary
when the secondary is considered current enough to trust
how consumers translate or re-establish progress
how failback will avoid duplicate or missing work

That is why multi-region Kafka is as much a data-ownership problem as a routing problem.

A Safer Baseline: Active-Passive

For Part 1, an active-passive model is the clearest baseline:

primary region owns writes
secondary region receives replicated data
failover occurs only after an explicit trigger

This keeps the initial discussion honest. Active-active can be valid, but it is a worse teaching baseline because it hides more failure cases behind higher complexity.

Local Representation

Even in a simple local drill, make the ownership model visible:

Primary topic:   orders.events.primary
Secondary topic: orders.events.secondary

That naming is not the important part. The important part is that readers can see there are two copies and only one current writer.

Local Setup

Prerequisites

Docker Desktop
Java 21
Kafka CLI tools

Local Stack

services:
  zookeeper:
    image: confluentinc/cp-zookeeper:7.6.1
    environment:
      ZOOKEEPER_CLIENT_PORT: 2181

  kafka:
    image: confluentinc/cp-kafka:7.6.1
    depends_on: [zookeeper]
    ports: ["9092:9092"]
    environment:
      KAFKA_BROKER_ID: 1
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
      KAFKA_LISTENERS: PLAINTEXT://0.0.0.0:9092
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:9092
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1

docker compose up -d

The Most Important Drill in Part 1

Do one controlled failover exercise and write down:

replication lag just before failover
the trigger that declared primary unavailable
whether producers switched cleanly
whether consumers observed missing or duplicate work

kafka-topics --bootstrap-server localhost:9092 --list

The goal is not to prove the topology is perfect. The goal is to expose the gap between “secondary has data” and “secondary can safely take over.”

[!warning] Replication lag is not merely a throughput metric in multi-region designs. It is a correctness signal because it defines how much history the failover copy may be missing.

Common Mistakes

Treating failover as automatic before authority is defined

Automatic failover without an explicit ownership model can move the system into split-brain or duplicate-write territory very quickly.

Ignoring consumer progress

Even if data is present in the secondary, consumer recovery may still be painful if offset translation or restart semantics were never tested.

Forgetting failback

Many plans explain how to leave the primary and almost none explain how to return safely.

What This Part Should Leave You With

After Part 1, the team should be able to state:

which region owns writes in the normal case
what signal justifies failover
what data-consistency risk exists during the switch

That is the minimum operating clarity required before replication and failover tooling can be trusted.

Find posts and pages