Designing a Stock Exchange System

Designing a stock exchange is one of the best system-design exercises because it forces you to care about correctness, scale, recovery, and latency at the same time.

This is not a CRUD system with a trading theme. It is a fairness-critical system where one wrong design decision can break sequencing, replay, or market data under load.

In this article, I will walk through the design step by step as if we are learning system design from scratch:

what we are building
functional requirements
non-functional requirements
high-level architecture
detailed component design
how the system scales and handles huge traffic
why the system is so fast

To keep the discussion concrete, I will use Indian cash-equity examples like RELIANCE, TCS, INFY, and HDFCBANK. Treat the examples as architectural illustrations, not as official NSE or BSE protocol documentation.

Scope of the examples

The symbols and scenarios in this article are teaching devices to make sequencing, matching, and recovery easier to reason about. They should not be read as venue-specific protocol or regulatory guidance.

Problem Statement

We want to design a stock exchange for cash equities.

The exchange should accept orders from member brokers, maintain live order books, match buy and sell orders using price-time priority, publish trade confirmations, and send market data to many brokers at the same time.

If you want to think about it in practical terms, imagine:

thousands of listed stocks
some symbols like RELIANCE or INFY becoming extremely hot at market open or on news
heavy cancel and replace traffic during volatility
brokers expecting live acknowledgments and tick-by-tick market data
regulators expecting clean audit and replay

So the real problem is not just “place orders.” The real problem is:

preserve one clear ordering of events
match deterministically
recover cleanly after failure
distribute market data at scale
do all of this with very low latency

What Exactly Are We Building

Before drawing boxes, define the system boundary clearly.

We are building the exchange core, not the entire trading ecosystem.

The exchange is responsible for:

accepting orders from brokers
validating venue-owned rules
sequencing orders
maintaining the order book
matching orders
publishing execution reports
publishing market data
storing an auditable event trail

The broker is responsible for:

client onboarding and KYC
margin and portfolio checks
user-facing app and APIs
routing customer orders to the exchange

The clearing and settlement system is responsible for:

post-trade processing
obligations, netting, and settlement
money and securities movement

This boundary matters because weak designs push broker logic or clearing logic into the exchange hot path. That makes the system slower and harder to reason about.

Functional Requirements

At high level, the exchange should support these behaviors:

Accept new, cancel, and replace orders from brokers.
Maintain a live order book for each stock.
Match orders using deterministic price-time priority.
Send order acknowledgments and execution reports back to brokers.
Publish tick-by-tick market data and depth updates.
Support snapshot plus incremental feed recovery for market-data consumers.
Support audit, replay, and operational recovery.

To keep the first version realistic, I would start with:

cash equities only
limit, market, cancel, and replace first
no attempt to solve every asset class in version 1

Version-1 design discipline

A strong exchange design usually becomes clearer when version 1 is intentionally narrow. If we try to solve equities, derivatives, auctions, complex order types, and cross-venue routing all at once, the core invariants get blurry.

Non-Functional Requirements

This is where the design really gets shaped.

The most important NFRs are:

1. Correctness

For one symbol, there must be one authoritative processing order. If two components disagree about order, the design is broken.

2. Determinism

If we replay the same event stream, we must rebuild the same book and the same trades.

3. Low latency

The hot path must stay extremely short and avoid remote dependencies.

4. High throughput

The exchange must handle very large inbound order volume, especially during bursts.

5. Hot-symbol isolation

One extremely active stock should not degrade unrelated symbols.

6. Market-data fan-out

The system must deliver the same committed feed to many brokers without allowing slow consumers to block matching.

7. Recovery

Failover, replay, and session recovery must be explicit and reliable.

The key design lesson is this: the exchange is not primarily a throughput system. It is a correctness system under latency pressure.

High-Level Design

At high level, the system needs four layers:

ingress and session management
deterministic matching core
market-data distribution
asynchronous downstream systems

Here is the high-level architecture:

flowchart LR
    subgraph Ingress["Ingress and Session Layer"]
        A[Member Brokers]
        B[Traffic Manager]
        C[L4 Load Balancer]
        D[FIX or OUCH Gateways]
        E[Session Manager]
    end

    subgraph Control["Reference and Venue Control Layer"]
        F[Reference Data Service]
        G[Local Reference Cache]
        H[Venue Controls]
    end

    subgraph Core["Deterministic Matching Core"]
        I[Partition Router]
        J[Sequencer]
        K[Matching Engine]
        L[In-Memory Order Book]
        M[Event Journal]
        N[Warm Standby]
    end

    subgraph Distribution["Execution and Market Data"]
        O[Execution Reports]
        P[Market Data Builder]
        Q[Snapshot Store]
        R[Market Data Gateways]
    end

    subgraph Async["Async Consumers"]
        S[Kafka or Internal Bus]
        T[Trade Ledger]
        U[Surveillance]
        V[Clearing Adapter]
        W[Reporting APIs]
    end

    A --> B --> C --> D --> E
    E --> H
    F --> G --> H
    H --> I --> J --> K --> L
    K --> M
    K --> O
    K --> P
    M --> N
    P --> Q --> R
    M --> S
    S --> T
    S --> U
    S --> V
    S --> W

This diagram already tells us a lot:

gateways handle connectivity, not matching
the sequencer establishes authoritative order
the matching engine owns book mutation
the journal defines the recovery boundary
market-data distribution is isolated from the matcher
Kafka and other async systems are downstream, not in the fairness-critical path

Step-by-Step Order Flow

Now let us walk one order through the system.

Suppose a broker sends this order:

symbol: RELIANCE
side: buy
price: INR 2,845.10
quantity: 100

The sequence looks like this:

sequenceDiagram
    participant Client as Investor App
    participant Broker as Member Broker
    participant Gateway as Exchange Gateway
    participant Control as Venue Controls
    participant Seq as Sequencer
    participant Match as Matching Engine
    participant Journal as Event Journal
    participant Feed as Market Data Builder

    Client->>Broker: Place order for RELIANCE
    Broker->>Gateway: New order
    Gateway->>Control: Basic validation and venue checks
    Control->>Seq: Route normalized order
    Seq->>Match: RELIANCE seq=91234012
    Match->>Match: Update book and generate trades if crossed
    Match->>Journal: Commit ordered event
    Match-->>Broker: Ack or execution report
    Match->>Feed: Book delta and trade tick

That is the full design in miniature. Everything else in the article exists to make these steps safe and fast.

Component Design in Detail

Now we can go deeper, component by component, in the same order in which an order flows.

For each component, the useful design question is not just “what does it do?” It is:

what state does it own
what decisions can it make locally
what does it emit downstream
what failure must it isolate from the rest of the exchange

1. Gateways and Session Manager

The gateway terminates exchange protocols like FIX or a lower-latency binary protocol. It is the exchange’s network edge, but it should not become a mini trading system.

A good gateway usually owns these pieces of state:

authenticated session table keyed by member and session id
inbound message sequence tracking
resend and replay windows for reconnect recovery
per-session outbound queue for acknowledgments and execution reports
connection-level throttles and heartbeat timers

Its hot-path work should look like this:

parse the incoming frame
authenticate the session
validate protocol syntax and required fields
detect duplicate or out-of-sequence client messages
normalize the order into a canonical internal envelope
attach session metadata like member id, receive timestamp, and protocol sequence number
hand the request to venue controls

That canonical envelope matters a lot. If the matcher later needs to care about FIX tags, binary packet layout, or session recovery details, the layering is already broken.

This is also the right place to separate client identity from exchange identity. In practice an order often carries:

a client-provided order id used by the broker session
an exchange-assigned internal order id used inside the venue
a transport sequence number used for resend and recovery

Those are not the same thing, and mixing them leads to very confusing recovery bugs. The gateway should preserve the client id for reporting, but internal components should rely on stable exchange ids after normalization.

What the gateway should not do:

call a remote database
invoke a portfolio service
run slow customer workflows
wait on downstream consumers to flush

The first important rule of exchange design is: the gateway must be boring, fast, and predictable. If one broker is slow, its outbound session queue can back up or disconnect, but that must not stall the rest of the venue.

2. Reference Data Service and Local Cache

The system needs static or semi-static data like:

stock metadata
tick-size rules
trading session state
member permissions
venue-level limits

That data should come from a source of truth, but hot-path components should use local in-memory caches.

A complete design usually has three layers:

an authoritative control-plane store where operators publish changes
a change-distribution mechanism such as snapshots plus versioned deltas
local in-memory caches inside gateways, controls, routers, and matchers

The cache design matters almost as much as the data itself. You want atomic reads of a coherent snapshot, not half-updated maps. In practice this often means:

immutable snapshot objects
version numbers or epochs on configuration updates
copy-on-write refresh when new reference data arrives

Why this design is important:

a network lookup per order destroys latency consistency
reference data changes much more slowly than order traffic
venue controls need local decisions
failover logic needs the same routing and rule view across components

Typical fields in reference data include:

symbol to partition ownership
lot-size rules
tick-size bands
daily price collars
session state such as open, halt, or auction
member entitlements for instruments or order types

If a cache is stale or unavailable, the venue should fail safely. It is usually better to reject or pause a symbol than to silently make decisions using uncertain rules.

3. Venue Controls

Before an order reaches the sequencer, the exchange should apply venue-owned controls such as:

price collars
fat-finger checks
throttles
cancel-on-disconnect
trading halts or auction mode

These checks belong to the exchange because they protect the market itself. They must be local and deterministic.

A good venue-controls module typically owns:

per-member throttle counters
per-symbol trading status
per-session disconnect policy flags
local views of price bands and venue rules

Just as important as the rules is the evaluation order. For example, a new order might be checked in this sequence:

is the symbol tradable right now
is the member allowed to send this order type
is the price inside the allowed collar
is the quantity inside maximum order size
is the member already beyond rate limits

That ordering should be stable so the same bad input always fails for the same reason. This makes support, audit, replay, and testing much easier.

Notice that venue controls are not the same thing as a broker’s full portfolio risk engine. A broker may still do its own margin and position checks before sending an order. The exchange only enforces market-owned protections that must apply uniformly to everyone.

Even then, the matcher should still recheck a small set of invariants that directly affect correctness, such as session state or price validity. Early rejection is good for latency, but the final correctness guard still belongs near commit.

Some venues also place self-trade prevention here if it can be evaluated deterministically from local state. Others keep final self-trade prevention inside the matcher because it depends on the exact resting order selected for match. The main rule is consistency: wherever the rule lives, it must behave the same on replay.

4. Partition Router

The exchange cannot process all stocks through one giant matching thread. It needs to partition work.

The partition router maps each stock to one active owner.

Example:

RELIANCE and TCS may be on partition P7
INFY and HDFCBANK may be on partition P12
colder symbols may be grouped on lighter partitions

The important invariant is: one symbol belongs to exactly one active partition at a time.

That sounds simple, but the router is really an ownership directory. It should answer:

which partition currently owns this symbol
which epoch or ownership version is active
where should traffic go after failover or rebalance

So the router usually keeps a local mapping like:

symbol -> partition id
partition id -> active endpoint
symbol -> ownership epoch

That epoch matters during failover. If a gateway still believes RELIANCE belongs to old owner P7@epoch=41 but the control plane has moved it to P19@epoch=42, stale traffic must be rejected or rerouted explicitly.

For hot-symbol migration, a safe transfer usually looks like:

stop accepting new traffic for the symbol on the old owner
drain and commit all already sequenced events
publish the new ownership epoch
activate the new owner
let gateways route future traffic only to the new epoch

The router should not “discover” ownership by trial and error on live traffic. Ownership is a control-plane decision, not a runtime guess.

Behind this router there is usually a small control-plane ownership manager responsible for:

planned symbol migration
failover promotion
fencing an old owner
publishing the next valid ownership epoch

That control-plane path is not latency critical, but it is correctness critical. It should be small, explicit, and heavily audited.

5. Sequencer

The sequencer is the heart of fairness.

Its job is simple but extremely important:

accept normalized events
assign monotonically increasing sequence numbers
create one authoritative order of events for a partition

Without a sequencer, two gateways may disagree about which order arrived first. With a sequencer, replay and failover become understandable.

Internally, the sequencer is usually a very small component with very strict discipline. It typically owns:

the next sequence number for a partition
an ingress queue of normalized commands
the current ownership epoch
optional ingress timestamps and checksums

The sequencer may batch work for efficiency, but it must not reorder it. That means:

it can pull 32 pending orders at once
it can stamp them with sequence numbers in a batch
it cannot let a later cancel jump ahead of an earlier new order

In many real systems, the sequencer is co-located with the matcher or implemented as the first stage of the same single-thread event loop. That is often a good trade-off because it reduces hops while still preserving the logical boundary: first establish order, then mutate state.

One subtle but important point: the sequencer defines the intended processing order, but the journal defines the committed truth boundary. So if the active owner crashes after stamping sequence 91234012 but before that event is durably committed, recovery resumes from the last committed sequence, not from whatever number happened to be in RAM.

Sequencing is the fairness boundary

Once the partition order is established, every downstream consumer should derive truth from that same order. Acknowledgments, replay, standby rebuild, and market data should all agree on it.

6. Matching Engine and In-Memory Order Book

The matching engine should be treated as a single-writer state machine.

For each sequenced event, it:

checks current book state
applies new, cancel, or replace logic
matches against opposite-side liquidity if the order crosses
updates the in-memory book
emits execution results and market-data deltas

The order book usually needs:

bids ordered by best price first
asks ordered by best price first
FIFO queue at each price level
order-id index for fast cancel lookup

In practice, the engine owns much more than just two sorted sides. For each live symbol book it usually maintains:

bid price levels
ask price levels
a queue of resting orders per price level
an order-id index for direct lookup
aggregate quantities per level
symbol-level metadata like trading mode or auction state

The hot path for a new limit order looks like this:

load the book for the symbol
validate final book-level invariants
if the order crosses, match against the best opposite price level
consume resting liquidity in strict price-time order
generate one or more trade events
remove empty levels or reduce partially filled orders
if quantity remains, rest the residual on the book
emit the final order-state transition and book delta

Cancel and replace need equally precise semantics. For example:

cancel must find the order by id in O(1) or close to it
replace often behaves like cancel-plus-new from a priority perspective
quantity reduction may or may not retain priority depending on venue rules

You also need clear semantics for order types and identifiers:

market orders usually execute immediately against the book and must not rest unpriced unless the venue explicitly supports that model
IOC or FOK orders should be resolved entirely inside one deterministic pass
self-trade prevention must define whether to cancel the resting order, the aggressing order, or both
trade ids should be generated deterministically from the committed matching path, not by some later async consumer

Timestamps need similar discipline. Many systems capture several of them:

gateway receive timestamp
sequence assignment timestamp
commit timestamp
trade publication timestamp

For fairness, sequence order wins. Timestamps help with audit and latency measurement, but they should not override committed ordering.

This is why the system can stay fast: the book is memory-resident and mutated by one owner. There is no database in the decision loop, no shared lock between writers, and no distributed coordination between matchers for one symbol.

This single-writer model also makes correctness testing much easier. Given the same ordered input stream, the engine should produce the same trades, book state, and outbound events every time.

7. Event Journal

The journal is the committed truth boundary.

It answers these questions:

what exactly was accepted
what exact event order must a standby replay
from which point can recovery resume

This should be an ordered committed log, not a vague debug trace.

A robust journal record usually includes at least:

partition id
ownership epoch
sequence number
normalized input command
resulting order state transition
generated fills or trades
timestamps and checksums

Some designs persist only commands and rely on deterministic replay to regenerate the outputs. Other designs also persist normalized outcomes for audit and faster downstream consumption. Either approach can work, but the rule is the same: the journal must contain enough information to reconstruct committed truth exactly.

Operationally, the journal is also where durability policy lives. You may batch appends for throughput, but the external acknowledgment boundary should still be tied to a committed journal point.

The journal also needs boring engineering details that matter in production:

segment rotation
checksums
corruption detection
truncation rules
retention and archival policy

Without that discipline, replay looks nice in a whiteboard interview and painful in a real incident.

8. Recovery Snapshots and Replay

The journal is the source of truth, but replaying an entire trading day from the first event every time is not operationally pleasant. So most serious designs combine the journal with periodic recovery snapshots.

A recovery snapshot is different from a market-data snapshot. It is not for subscribers. It is for restoring internal matching state quickly.

A good recovery snapshot typically contains:

partition id and ownership epoch
last included committed sequence number
full in-memory book state for owned symbols
order-id index
level aggregates and symbol metadata
checksum of the serialized state

The recovery rule is straightforward:

load the latest valid snapshot
verify its checksum and included sequence
replay journal events after that sequence
resume only when book state is caught up to the desired commit point

This does two important things:

reduces restart time for cold recovery
gives standby promotion and disaster recovery a bounded replay window

The snapshot cadence is a trade-off. Very frequent snapshots cost more write bandwidth and CPU. Very infrequent snapshots increase recovery time. Production systems usually tune this by partition heat, memory size, and recovery objectives.

9. Warm Standby

The standby is not another active writer. It is a follower that replays the same committed events and stays ready for takeover.

Good standby design means:

it consumes the same ordered journal
it builds the same in-memory state
it can take over only after ownership is explicitly transferred

In a stronger design, the standby does more than just “listen.” It should actively verify:

sequence continuity
current journal offset
periodic book checksum or state hash
readiness to serve the next epoch

That gives operations a way to know whether the standby is truly hot or only “probably fine.”

Promotion should also be explicit. A safe failover sequence is typically:

fence the old primary so it cannot accept more writes
find the last committed sequence in the journal
let the standby replay to that exact point
publish a new ownership epoch
route all future traffic to the promoted node

This is how we avoid split brain. The standby is not there to increase write parallelism. It is there to reduce recovery time while preserving one-writer ownership.

In strong designs, the standby may restore from the latest recovery snapshot first and then tail the journal. That avoids forcing every follower to rebuild only from the first event of the day.

10. Execution Reports

Once an event is committed, the exchange must send the right outcome back to the broker:

accepted
rejected
partially filled
fully filled
canceled

These reports must come from the same committed engine output that feeds replay and market data.

The execution-report path usually owns:

mapping from internal order id back to broker session and client order id
per-session outbound message sequence numbers
resend history for reconnect recovery
outbound queues isolated per broker session

This is important because the exchange is not replying to “the system” in general. It is replying to one particular member session using that protocol’s sequencing rules.

A good flow looks like this:

committed engine result is produced
session mapper resolves the destination broker session
protocol-specific encoder builds the proper response
response is queued on that broker’s outbound session
reconnect and resend logic can replay the same committed outcome if needed

The subtle but critical design point is: the matcher decides order outcome, but the session layer decides delivery mechanics. That separation keeps correctness independent from network recovery behavior.

11. Market Data Builder and Feed Gateways

The market-data builder consumes committed engine outputs and creates:

trade ticks
top-of-book updates
depth updates
snapshots for recovery

Then a separate feed-distribution tier fans this out to many brokers.

This separation is extremely important: market-data distribution must not slow matching.

The builder is a derived-state component. It should not inspect the live mutable book directly from inside the matcher thread. Instead it should consume committed engine events and maintain its own derived view such as:

best bid and ask
top-of-book state
depth ladders
last trade
traded volume and turnover

One committed match event may expand into several feed events:

a trade tick
best bid changed
best ask changed
depth level changed
session statistics updated

That is exactly why feed generation should be decoupled. Fan-out is amplification, and amplification belongs outside the matcher.

Feed gateways then own subscriber-facing concerns such as:

per-subscriber entitlements
protocol encoding
multicast or unicast delivery mode
per-consumer buffer management
gap detection and recovery

Sequence numbers are critical here. A subscriber should be able to say: “I have market-data sequence 48190021; give me everything after that or give me a fresh snapshot.”

That leads to the standard repair model:

client detects a gap
fetch snapshot
replay deltas after snapshot sequence
resume live stream

The builder is allowed to be expensive in ways the matcher is not

The market-data path can maintain derived statistics, prebuilt snapshots, and replay buffers because it is downstream of the fairness-critical path. The matcher should only emit the minimal committed deltas needed to derive them.

12. Async Downstream Systems

Once the exchange commits an event, many other systems may consume it:

trade ledger
surveillance and compliance
reporting
clearing adapter

This is where Kafka or another event bus can be very useful. But this is downstream. It is not where fairness is decided.

These systems are usually interested in different projections of the same committed truth:

the ledger cares about economic events like trades, fees, and adjustments
surveillance cares about the full lifecycle of orders, cancels, and replaces
clearing adapters care about settlement-facing trade records
reporting APIs care about queryable history and analytics

The clean design is to publish a committed event stream with stable identifiers such as:

partition id
sequence number
order id
trade id
event type

That lets downstream consumers be replayable and idempotent. If the ledger sees trade T-884192 twice, it should deduplicate safely. If surveillance is offline for ten minutes, it should catch up from the bus or journal without involving the matcher at all.

This is also the right place for enrichments that would be harmful in the hot path:

user-facing metadata joins
analytical projections
warehouse exports
regulatory packaging

Kafka or an internal bus is excellent here because we want decoupling, fan-out, and replay. It is the wrong abstraction for the fairness-critical matching loop, but a very strong abstraction for everything after commit.

How the Exchange Handles Thousands of Stocks

This is a common confusion point in interviews and blog posts.

The exchange does not maintain one huge shared structure for all stocks. It maintains many independent books with explicit ownership.

The model looks like this:

flowchart LR
    A[Symbol Master]
    B[Partition Directory]

    subgraph P7["Partition P7"]
        C[RELIANCE Book]
        D[TCS Book]
    end

    subgraph P12["Partition P12"]
        E[INFY Book]
        F[HDFCBANK Book]
    end

    subgraph P31["Partition P31"]
        G[Other Symbols]
    end

    A --> B
    B --> C
    B --> D
    B --> E
    B --> F
    B --> G

The exchange uses a symbol master that contains:

symbol code
instrument type
tick-size rule
session state
price bands
partition ownership

That symbol master feeds local caches used by gateways, venue controls, and the router.

This is how we support many stocks while keeping deterministic ownership for each one.

How Tick-by-Tick Market Data Is Maintained

When people say “how are so many stock ticks maintained,” they usually mean:

how the system maintains live order books for many stocks
how the system publishes tick-by-tick market-data updates

These are related, but not identical.

Tick-Size Rules

Each instrument may have a minimum price increment rule. That rule belongs in reference data and should be enforced by the venue.

Architecturally:

gateway and controls can reject obviously invalid prices early
the matcher still enforces the final rule before commit

Tick-by-Tick Feed

The engine does not scan every possible price for every stock on every event. It updates only the book state that changed.

For a committed change in RELIANCE, the system may publish:

best bid changed
best ask changed
last traded price changed
traded volume changed
depth changed

That is the key to scale: the feed is incremental. It is derived from committed state transitions, not by polling a database or recomputing the world.

How the Same Feed Reaches Many Brokers

This is another place where many designs stay too shallow.

The exchange is not sending one update to one consumer. It is sending the same committed market event to many brokers at once.

The right flow looks like this:

flowchart LR
    A[Matching Engine]
    B[Market Data Builder]
    C[Sequence Numbered Feed Stream]
    D[Snapshot Store]

    subgraph Fanout["Feed Distribution Tier"]
        E[Gateway 1]
        F[Gateway 2]
        G[Gateway 3]
    end

    subgraph Brokers["Brokers"]
        H[Broker A]
        I[Broker B]
        J[Broker C]
        K[Broker D]
    end

    A --> B --> C
    B --> D
    C --> E
    C --> F
    C --> G
    E --> H
    E --> I
    F --> J
    G --> K
    D --> H
    D --> I
    D --> J
    D --> K

Important design details:

feed messages should be sequence numbered
brokers should detect gaps
missed data should be repaired by snapshot plus replay
slow consumers should be isolated from fast ones

The most important rule is simple: the matching engine must never wait for a slow broker feed.

Feed fan-out is a hidden failure path

Many weak designs keep matching correct in isolation but accidentally let feed backpressure leak into the core. Recovery, replay, and slow-consumer handling should stay in the distribution tier, not in the matcher thread.

Scalability: How the System Handles Huge Volume

Now let us address the most important scale question: how does this system handle too much volume?

1. Partition by Symbol

The exchange scales horizontally by distributing symbols across partitions.

That means:

each partition has its own sequencer and matcher
symbols are processed independently
one hot stock does not automatically block unrelated stocks

2. Keep One Active Owner Per Symbol

We do not scale one book using many active writers. That creates coordination overhead and ambiguity.

For one symbol, single-writer ownership is usually the better trade-off.

3. Isolate Hot Symbols

Not all stocks are equally active.

At market open or during major events:

RELIANCE may get heavy order flow
INFY may see extreme cancel or replace volume
smaller symbols may remain relatively quiet

So capacity planning should focus on:

hot-symbol concentration
cancel and replace ratio
outbound market-data amplification

4. Scale the Feed Tier Separately

Inbound order traffic and outbound market-data traffic are different scaling problems.

The matching engine must stay small. The fan-out tier can scale independently with more gateways, more replay capacity, and more snapshot infrastructure.

5. Keep Replay and Recovery Off the Matcher

After outages or disconnect storms, many brokers may reconnect at once. That recovery load should hit the snapshot and replay tier, not the live matching thread.

Why the System Is So Fast

The exchange is fast because it refuses to do expensive things in the hot path.

The main reasons are:

one active writer per book or partition
in-memory order books
local caches for reference data
no database round-trip to decide a match
no distributed transaction in the matching loop
market-data fan-out moved off the matcher
bounded, predictable handoff between stages

The hot path should look like this:

gateway
lightweight validation
venue controls
sequencer
single-writer matcher
commit boundary
ack

That is why the system can stay fast even when overall platform complexity is large.

If we turned this into a chain like:

order service
risk service
book service
trade service

we would add network hops, serialization, retries, and latency variance right where determinism matters most.

Failure and Recovery

A stock exchange design is incomplete without failure handling.

The most important scenarios are:

1. One Symbol Becomes Extremely Hot

If INFY suddenly becomes 20 times hotter than the average stock:

only its owning partition should be stressed
other symbols should keep moving normally
venue controls should still be able to throttle or halt if required

2. A Broker Feed Falls Behind

If one broker cannot keep up with tick processing:

that broker’s feed path should lag or disconnect
the matcher should continue normally
recovery should happen through snapshot plus replay

3. Gateway Reconnect Storm

If many sessions reconnect after a network event:

session recovery should be explicit
reconnect handling should be rate limited
cancel-on-disconnect rules should be deterministic

4. Primary Matcher Failure

If the active owner of RELIANCE fails:

the standby should know the last committed sequence
partition ownership should move explicitly
the new primary should continue from the next safe sequence number

The critical recovery rule is: there must be one unambiguous owner of the next event.

Active-active sounds safer than it is

For one order book, multi-writer ownership usually increases coordination cost, ambiguity, and split-brain risk. In low-latency exchange systems, single-writer ownership is often the simpler and safer default.

That is also why active-active multi-writer ownership for one book is usually the wrong trade-off in a low-latency exchange.

Final Design Summary

If I had to explain the design in one short flow, I would say this:

brokers connect through gateways
the exchange applies local venue controls
orders are routed by symbol to one active partition
the sequencer defines authoritative order
the matching engine updates the in-memory book
the journal records committed truth
execution reports go back to brokers
market-data builders and feed gateways distribute the committed feed
downstream systems consume the same truth asynchronously

That is the clean mental model a learner should keep.

If that model is clear, the detailed implementation choices become much easier to understand.

Key Takeaways

A stock exchange is a deterministic sequencing problem under latency pressure.
The design should begin with requirements and invariants, not with random infrastructure boxes.
One active owner per symbol or partition is the simplest scalable model.
Matching should stay memory-resident, single-writer, and isolated from slow dependencies.
Tick-by-tick market data should be derived incrementally from committed events.
The system scales by partitioning symbols, isolating hot books, and scaling market-data fan-out separately.