Designing a stock exchange is one of the best system-design exercises because it forces you to care about correctness, scale, recovery, and latency at the same time.
This is not a CRUD system with a trading theme. It is a fairness-critical system where one wrong design decision can break sequencing, replay, or market data under load.
In this article, I will walk through the design step by step as if we are learning system design from scratch:
- what we are building
- functional requirements
- non-functional requirements
- high-level architecture
- detailed component design
- how the system scales and handles huge traffic
- why the system is so fast
To keep the discussion concrete, I will use Indian cash-equity examples like RELIANCE, TCS, INFY, and HDFCBANK.
Treat the examples as architectural illustrations, not as official NSE or BSE protocol documentation.
Scope of the examples
The symbols and scenarios in this article are teaching devices to make sequencing, matching, and recovery easier to reason about. They should not be read as venue-specific protocol or regulatory guidance.
Problem Statement
We want to design a stock exchange for cash equities.
The exchange should accept orders from member brokers, maintain live order books, match buy and sell orders using price-time priority, publish trade confirmations, and send market data to many brokers at the same time.
If you want to think about it in practical terms, imagine:
- thousands of listed stocks
- some symbols like
RELIANCEorINFYbecoming extremely hot at market open or on news - heavy cancel and replace traffic during volatility
- brokers expecting live acknowledgments and tick-by-tick market data
- regulators expecting clean audit and replay
So the real problem is not just “place orders.” The real problem is:
- preserve one clear ordering of events
- match deterministically
- recover cleanly after failure
- distribute market data at scale
- do all of this with very low latency
What Exactly Are We Building
Before drawing boxes, define the system boundary clearly.
We are building the exchange core, not the entire trading ecosystem.
The exchange is responsible for:
- accepting orders from brokers
- validating venue-owned rules
- sequencing orders
- maintaining the order book
- matching orders
- publishing execution reports
- publishing market data
- storing an auditable event trail
The broker is responsible for:
- client onboarding and KYC
- margin and portfolio checks
- user-facing app and APIs
- routing customer orders to the exchange
The clearing and settlement system is responsible for:
- post-trade processing
- obligations, netting, and settlement
- money and securities movement
This boundary matters because weak designs push broker logic or clearing logic into the exchange hot path. That makes the system slower and harder to reason about.
Functional Requirements
At high level, the exchange should support these behaviors:
- Accept new, cancel, and replace orders from brokers.
- Maintain a live order book for each stock.
- Match orders using deterministic price-time priority.
- Send order acknowledgments and execution reports back to brokers.
- Publish tick-by-tick market data and depth updates.
- Support snapshot plus incremental feed recovery for market-data consumers.
- Support audit, replay, and operational recovery.
To keep the first version realistic, I would start with:
- cash equities only
- limit, market, cancel, and replace first
- no attempt to solve every asset class in version 1
Version-1 design discipline
A strong exchange design usually becomes clearer when version 1 is intentionally narrow. If we try to solve equities, derivatives, auctions, complex order types, and cross-venue routing all at once, the core invariants get blurry.
Non-Functional Requirements
This is where the design really gets shaped.
The most important NFRs are:
1. Correctness
For one symbol, there must be one authoritative processing order. If two components disagree about order, the design is broken.
2. Determinism
If we replay the same event stream, we must rebuild the same book and the same trades.
3. Low latency
The hot path must stay extremely short and avoid remote dependencies.
4. High throughput
The exchange must handle very large inbound order volume, especially during bursts.
5. Hot-symbol isolation
One extremely active stock should not degrade unrelated symbols.
6. Market-data fan-out
The system must deliver the same committed feed to many brokers without allowing slow consumers to block matching.
7. Recovery
Failover, replay, and session recovery must be explicit and reliable.
The key design lesson is this: the exchange is not primarily a throughput system. It is a correctness system under latency pressure.
High-Level Design
At high level, the system needs four layers:
- ingress and session management
- deterministic matching core
- market-data distribution
- asynchronous downstream systems
Here is the high-level architecture:
flowchart LR
subgraph Ingress["Ingress and Session Layer"]
A[Member Brokers]
B[Traffic Manager]
C[L4 Load Balancer]
D[FIX or OUCH Gateways]
E[Session Manager]
end
subgraph Control["Reference and Venue Control Layer"]
F[Reference Data Service]
G[Local Reference Cache]
H[Venue Controls]
end
subgraph Core["Deterministic Matching Core"]
I[Partition Router]
J[Sequencer]
K[Matching Engine]
L[In-Memory Order Book]
M[Event Journal]
N[Warm Standby]
end
subgraph Distribution["Execution and Market Data"]
O[Execution Reports]
P[Market Data Builder]
Q[Snapshot Store]
R[Market Data Gateways]
end
subgraph Async["Async Consumers"]
S[Kafka or Internal Bus]
T[Trade Ledger]
U[Surveillance]
V[Clearing Adapter]
W[Reporting APIs]
end
A --> B --> C --> D --> E
E --> H
F --> G --> H
H --> I --> J --> K --> L
K --> M
K --> O
K --> P
M --> N
P --> Q --> R
M --> S
S --> T
S --> U
S --> V
S --> W
This diagram already tells us a lot:
- gateways handle connectivity, not matching
- the sequencer establishes authoritative order
- the matching engine owns book mutation
- the journal defines the recovery boundary
- market-data distribution is isolated from the matcher
- Kafka and other async systems are downstream, not in the fairness-critical path
Step-by-Step Order Flow
Now let us walk one order through the system.
Suppose a broker sends this order:
- symbol:
RELIANCE - side: buy
- price:
INR 2,845.10 - quantity:
100
The sequence looks like this:
sequenceDiagram
participant Client as Investor App
participant Broker as Member Broker
participant Gateway as Exchange Gateway
participant Control as Venue Controls
participant Seq as Sequencer
participant Match as Matching Engine
participant Journal as Event Journal
participant Feed as Market Data Builder
Client->>Broker: Place order for RELIANCE
Broker->>Gateway: New order
Gateway->>Control: Basic validation and venue checks
Control->>Seq: Route normalized order
Seq->>Match: RELIANCE seq=91234012
Match->>Match: Update book and generate trades if crossed
Match->>Journal: Commit ordered event
Match-->>Broker: Ack or execution report
Match->>Feed: Book delta and trade tick
That is the full design in miniature. Everything else in the article exists to make these steps safe and fast.
Component Design in Detail
Now we can go deeper, component by component, in the same order in which an order flows.
For each component, the useful design question is not just “what does it do?” It is:
- what state does it own
- what decisions can it make locally
- what does it emit downstream
- what failure must it isolate from the rest of the exchange
1. Gateways and Session Manager
The gateway terminates exchange protocols like FIX or a lower-latency binary protocol. It is the exchange’s network edge, but it should not become a mini trading system.
A good gateway usually owns these pieces of state:
- authenticated session table keyed by member and session id
- inbound message sequence tracking
- resend and replay windows for reconnect recovery
- per-session outbound queue for acknowledgments and execution reports
- connection-level throttles and heartbeat timers
Its hot-path work should look like this:
- parse the incoming frame
- authenticate the session
- validate protocol syntax and required fields
- detect duplicate or out-of-sequence client messages
- normalize the order into a canonical internal envelope
- attach session metadata like member id, receive timestamp, and protocol sequence number
- hand the request to venue controls
That canonical envelope matters a lot. If the matcher later needs to care about FIX tags, binary packet layout, or session recovery details, the layering is already broken.
This is also the right place to separate client identity from exchange identity. In practice an order often carries:
- a client-provided order id used by the broker session
- an exchange-assigned internal order id used inside the venue
- a transport sequence number used for resend and recovery
Those are not the same thing, and mixing them leads to very confusing recovery bugs. The gateway should preserve the client id for reporting, but internal components should rely on stable exchange ids after normalization.
What the gateway should not do:
- call a remote database
- invoke a portfolio service
- run slow customer workflows
- wait on downstream consumers to flush
The first important rule of exchange design is: the gateway must be boring, fast, and predictable. If one broker is slow, its outbound session queue can back up or disconnect, but that must not stall the rest of the venue.
2. Reference Data Service and Local Cache
The system needs static or semi-static data like:
- stock metadata
- tick-size rules
- trading session state
- member permissions
- venue-level limits
That data should come from a source of truth, but hot-path components should use local in-memory caches.
A complete design usually has three layers:
- an authoritative control-plane store where operators publish changes
- a change-distribution mechanism such as snapshots plus versioned deltas
- local in-memory caches inside gateways, controls, routers, and matchers
The cache design matters almost as much as the data itself. You want atomic reads of a coherent snapshot, not half-updated maps. In practice this often means:
- immutable snapshot objects
- version numbers or epochs on configuration updates
- copy-on-write refresh when new reference data arrives
Why this design is important:
- a network lookup per order destroys latency consistency
- reference data changes much more slowly than order traffic
- venue controls need local decisions
- failover logic needs the same routing and rule view across components
Typical fields in reference data include:
- symbol to partition ownership
- lot-size rules
- tick-size bands
- daily price collars
- session state such as open, halt, or auction
- member entitlements for instruments or order types
If a cache is stale or unavailable, the venue should fail safely. It is usually better to reject or pause a symbol than to silently make decisions using uncertain rules.
3. Venue Controls
Before an order reaches the sequencer, the exchange should apply venue-owned controls such as:
- price collars
- fat-finger checks
- throttles
- cancel-on-disconnect
- trading halts or auction mode
These checks belong to the exchange because they protect the market itself. They must be local and deterministic.
A good venue-controls module typically owns:
- per-member throttle counters
- per-symbol trading status
- per-session disconnect policy flags
- local views of price bands and venue rules
Just as important as the rules is the evaluation order. For example, a new order might be checked in this sequence:
- is the symbol tradable right now
- is the member allowed to send this order type
- is the price inside the allowed collar
- is the quantity inside maximum order size
- is the member already beyond rate limits
That ordering should be stable so the same bad input always fails for the same reason. This makes support, audit, replay, and testing much easier.
Notice that venue controls are not the same thing as a broker’s full portfolio risk engine. A broker may still do its own margin and position checks before sending an order. The exchange only enforces market-owned protections that must apply uniformly to everyone.
Even then, the matcher should still recheck a small set of invariants that directly affect correctness, such as session state or price validity. Early rejection is good for latency, but the final correctness guard still belongs near commit.
Some venues also place self-trade prevention here if it can be evaluated deterministically from local state. Others keep final self-trade prevention inside the matcher because it depends on the exact resting order selected for match. The main rule is consistency: wherever the rule lives, it must behave the same on replay.
4. Partition Router
The exchange cannot process all stocks through one giant matching thread. It needs to partition work.
The partition router maps each stock to one active owner.
Example:
RELIANCEandTCSmay be on partitionP7INFYandHDFCBANKmay be on partitionP12- colder symbols may be grouped on lighter partitions
The important invariant is: one symbol belongs to exactly one active partition at a time.
That sounds simple, but the router is really an ownership directory. It should answer:
- which partition currently owns this symbol
- which epoch or ownership version is active
- where should traffic go after failover or rebalance
So the router usually keeps a local mapping like:
- symbol -> partition id
- partition id -> active endpoint
- symbol -> ownership epoch
That epoch matters during failover.
If a gateway still believes RELIANCE belongs to old owner P7@epoch=41 but the control plane has moved it to P19@epoch=42, stale traffic must be rejected or rerouted explicitly.
For hot-symbol migration, a safe transfer usually looks like:
- stop accepting new traffic for the symbol on the old owner
- drain and commit all already sequenced events
- publish the new ownership epoch
- activate the new owner
- let gateways route future traffic only to the new epoch
The router should not “discover” ownership by trial and error on live traffic. Ownership is a control-plane decision, not a runtime guess.
Behind this router there is usually a small control-plane ownership manager responsible for:
- planned symbol migration
- failover promotion
- fencing an old owner
- publishing the next valid ownership epoch
That control-plane path is not latency critical, but it is correctness critical. It should be small, explicit, and heavily audited.
5. Sequencer
The sequencer is the heart of fairness.
Its job is simple but extremely important:
- accept normalized events
- assign monotonically increasing sequence numbers
- create one authoritative order of events for a partition
Without a sequencer, two gateways may disagree about which order arrived first. With a sequencer, replay and failover become understandable.
Internally, the sequencer is usually a very small component with very strict discipline. It typically owns:
- the next sequence number for a partition
- an ingress queue of normalized commands
- the current ownership epoch
- optional ingress timestamps and checksums
The sequencer may batch work for efficiency, but it must not reorder it. That means:
- it can pull 32 pending orders at once
- it can stamp them with sequence numbers in a batch
- it cannot let a later cancel jump ahead of an earlier new order
In many real systems, the sequencer is co-located with the matcher or implemented as the first stage of the same single-thread event loop. That is often a good trade-off because it reduces hops while still preserving the logical boundary: first establish order, then mutate state.
One subtle but important point:
the sequencer defines the intended processing order, but the journal defines the committed truth boundary.
So if the active owner crashes after stamping sequence 91234012 but before that event is durably committed, recovery resumes from the last committed sequence, not from whatever number happened to be in RAM.
Sequencing is the fairness boundary
Once the partition order is established, every downstream consumer should derive truth from that same order. Acknowledgments, replay, standby rebuild, and market data should all agree on it.
6. Matching Engine and In-Memory Order Book
The matching engine should be treated as a single-writer state machine.
For each sequenced event, it:
- checks current book state
- applies new, cancel, or replace logic
- matches against opposite-side liquidity if the order crosses
- updates the in-memory book
- emits execution results and market-data deltas
The order book usually needs:
- bids ordered by best price first
- asks ordered by best price first
- FIFO queue at each price level
- order-id index for fast cancel lookup
In practice, the engine owns much more than just two sorted sides. For each live symbol book it usually maintains:
- bid price levels
- ask price levels
- a queue of resting orders per price level
- an order-id index for direct lookup
- aggregate quantities per level
- symbol-level metadata like trading mode or auction state
The hot path for a new limit order looks like this:
- load the book for the symbol
- validate final book-level invariants
- if the order crosses, match against the best opposite price level
- consume resting liquidity in strict price-time order
- generate one or more trade events
- remove empty levels or reduce partially filled orders
- if quantity remains, rest the residual on the book
- emit the final order-state transition and book delta
Cancel and replace need equally precise semantics. For example:
- cancel must find the order by id in O(1) or close to it
- replace often behaves like cancel-plus-new from a priority perspective
- quantity reduction may or may not retain priority depending on venue rules
You also need clear semantics for order types and identifiers:
- market orders usually execute immediately against the book and must not rest unpriced unless the venue explicitly supports that model
- IOC or FOK orders should be resolved entirely inside one deterministic pass
- self-trade prevention must define whether to cancel the resting order, the aggressing order, or both
- trade ids should be generated deterministically from the committed matching path, not by some later async consumer
Timestamps need similar discipline. Many systems capture several of them:
- gateway receive timestamp
- sequence assignment timestamp
- commit timestamp
- trade publication timestamp
For fairness, sequence order wins. Timestamps help with audit and latency measurement, but they should not override committed ordering.
This is why the system can stay fast: the book is memory-resident and mutated by one owner. There is no database in the decision loop, no shared lock between writers, and no distributed coordination between matchers for one symbol.
This single-writer model also makes correctness testing much easier. Given the same ordered input stream, the engine should produce the same trades, book state, and outbound events every time.
7. Event Journal
The journal is the committed truth boundary.
It answers these questions:
- what exactly was accepted
- what exact event order must a standby replay
- from which point can recovery resume
This should be an ordered committed log, not a vague debug trace.
A robust journal record usually includes at least:
- partition id
- ownership epoch
- sequence number
- normalized input command
- resulting order state transition
- generated fills or trades
- timestamps and checksums
Some designs persist only commands and rely on deterministic replay to regenerate the outputs. Other designs also persist normalized outcomes for audit and faster downstream consumption. Either approach can work, but the rule is the same: the journal must contain enough information to reconstruct committed truth exactly.
Operationally, the journal is also where durability policy lives. You may batch appends for throughput, but the external acknowledgment boundary should still be tied to a committed journal point.
The journal also needs boring engineering details that matter in production:
- segment rotation
- checksums
- corruption detection
- truncation rules
- retention and archival policy
Without that discipline, replay looks nice in a whiteboard interview and painful in a real incident.
8. Recovery Snapshots and Replay
The journal is the source of truth, but replaying an entire trading day from the first event every time is not operationally pleasant. So most serious designs combine the journal with periodic recovery snapshots.
A recovery snapshot is different from a market-data snapshot. It is not for subscribers. It is for restoring internal matching state quickly.
A good recovery snapshot typically contains:
- partition id and ownership epoch
- last included committed sequence number
- full in-memory book state for owned symbols
- order-id index
- level aggregates and symbol metadata
- checksum of the serialized state
The recovery rule is straightforward:
- load the latest valid snapshot
- verify its checksum and included sequence
- replay journal events after that sequence
- resume only when book state is caught up to the desired commit point
This does two important things:
- reduces restart time for cold recovery
- gives standby promotion and disaster recovery a bounded replay window
The snapshot cadence is a trade-off. Very frequent snapshots cost more write bandwidth and CPU. Very infrequent snapshots increase recovery time. Production systems usually tune this by partition heat, memory size, and recovery objectives.
9. Warm Standby
The standby is not another active writer. It is a follower that replays the same committed events and stays ready for takeover.
Good standby design means:
- it consumes the same ordered journal
- it builds the same in-memory state
- it can take over only after ownership is explicitly transferred
In a stronger design, the standby does more than just “listen.” It should actively verify:
- sequence continuity
- current journal offset
- periodic book checksum or state hash
- readiness to serve the next epoch
That gives operations a way to know whether the standby is truly hot or only “probably fine.”
Promotion should also be explicit. A safe failover sequence is typically:
- fence the old primary so it cannot accept more writes
- find the last committed sequence in the journal
- let the standby replay to that exact point
- publish a new ownership epoch
- route all future traffic to the promoted node
This is how we avoid split brain. The standby is not there to increase write parallelism. It is there to reduce recovery time while preserving one-writer ownership.
In strong designs, the standby may restore from the latest recovery snapshot first and then tail the journal. That avoids forcing every follower to rebuild only from the first event of the day.
10. Execution Reports
Once an event is committed, the exchange must send the right outcome back to the broker:
- accepted
- rejected
- partially filled
- fully filled
- canceled
These reports must come from the same committed engine output that feeds replay and market data.
The execution-report path usually owns:
- mapping from internal order id back to broker session and client order id
- per-session outbound message sequence numbers
- resend history for reconnect recovery
- outbound queues isolated per broker session
This is important because the exchange is not replying to “the system” in general. It is replying to one particular member session using that protocol’s sequencing rules.
A good flow looks like this:
- committed engine result is produced
- session mapper resolves the destination broker session
- protocol-specific encoder builds the proper response
- response is queued on that broker’s outbound session
- reconnect and resend logic can replay the same committed outcome if needed
The subtle but critical design point is: the matcher decides order outcome, but the session layer decides delivery mechanics. That separation keeps correctness independent from network recovery behavior.
11. Market Data Builder and Feed Gateways
The market-data builder consumes committed engine outputs and creates:
- trade ticks
- top-of-book updates
- depth updates
- snapshots for recovery
Then a separate feed-distribution tier fans this out to many brokers.
This separation is extremely important: market-data distribution must not slow matching.
The builder is a derived-state component. It should not inspect the live mutable book directly from inside the matcher thread. Instead it should consume committed engine events and maintain its own derived view such as:
- best bid and ask
- top-of-book state
- depth ladders
- last trade
- traded volume and turnover
One committed match event may expand into several feed events:
- a trade tick
- best bid changed
- best ask changed
- depth level changed
- session statistics updated
That is exactly why feed generation should be decoupled. Fan-out is amplification, and amplification belongs outside the matcher.
Feed gateways then own subscriber-facing concerns such as:
- per-subscriber entitlements
- protocol encoding
- multicast or unicast delivery mode
- per-consumer buffer management
- gap detection and recovery
Sequence numbers are critical here.
A subscriber should be able to say:
“I have market-data sequence 48190021; give me everything after that or give me a fresh snapshot.”
That leads to the standard repair model:
- client detects a gap
- fetch snapshot
- replay deltas after snapshot sequence
- resume live stream
The builder is allowed to be expensive in ways the matcher is not
The market-data path can maintain derived statistics, prebuilt snapshots, and replay buffers because it is downstream of the fairness-critical path. The matcher should only emit the minimal committed deltas needed to derive them.
12. Async Downstream Systems
Once the exchange commits an event, many other systems may consume it:
- trade ledger
- surveillance and compliance
- reporting
- clearing adapter
This is where Kafka or another event bus can be very useful. But this is downstream. It is not where fairness is decided.
These systems are usually interested in different projections of the same committed truth:
- the ledger cares about economic events like trades, fees, and adjustments
- surveillance cares about the full lifecycle of orders, cancels, and replaces
- clearing adapters care about settlement-facing trade records
- reporting APIs care about queryable history and analytics
The clean design is to publish a committed event stream with stable identifiers such as:
- partition id
- sequence number
- order id
- trade id
- event type
That lets downstream consumers be replayable and idempotent.
If the ledger sees trade T-884192 twice, it should deduplicate safely.
If surveillance is offline for ten minutes, it should catch up from the bus or journal without involving the matcher at all.
This is also the right place for enrichments that would be harmful in the hot path:
- user-facing metadata joins
- analytical projections
- warehouse exports
- regulatory packaging
Kafka or an internal bus is excellent here because we want decoupling, fan-out, and replay. It is the wrong abstraction for the fairness-critical matching loop, but a very strong abstraction for everything after commit.
How the Exchange Handles Thousands of Stocks
This is a common confusion point in interviews and blog posts.
The exchange does not maintain one huge shared structure for all stocks. It maintains many independent books with explicit ownership.
The model looks like this:
flowchart LR
A[Symbol Master]
B[Partition Directory]
subgraph P7["Partition P7"]
C[RELIANCE Book]
D[TCS Book]
end
subgraph P12["Partition P12"]
E[INFY Book]
F[HDFCBANK Book]
end
subgraph P31["Partition P31"]
G[Other Symbols]
end
A --> B
B --> C
B --> D
B --> E
B --> F
B --> G
The exchange uses a symbol master that contains:
- symbol code
- instrument type
- tick-size rule
- session state
- price bands
- partition ownership
That symbol master feeds local caches used by gateways, venue controls, and the router.
This is how we support many stocks while keeping deterministic ownership for each one.
How Tick-by-Tick Market Data Is Maintained
When people say “how are so many stock ticks maintained,” they usually mean:
- how the system maintains live order books for many stocks
- how the system publishes tick-by-tick market-data updates
These are related, but not identical.
Tick-Size Rules
Each instrument may have a minimum price increment rule. That rule belongs in reference data and should be enforced by the venue.
Architecturally:
- gateway and controls can reject obviously invalid prices early
- the matcher still enforces the final rule before commit
Tick-by-Tick Feed
The engine does not scan every possible price for every stock on every event. It updates only the book state that changed.
For a committed change in RELIANCE, the system may publish:
- best bid changed
- best ask changed
- last traded price changed
- traded volume changed
- depth changed
That is the key to scale: the feed is incremental. It is derived from committed state transitions, not by polling a database or recomputing the world.
How the Same Feed Reaches Many Brokers
This is another place where many designs stay too shallow.
The exchange is not sending one update to one consumer. It is sending the same committed market event to many brokers at once.
The right flow looks like this:
flowchart LR
A[Matching Engine]
B[Market Data Builder]
C[Sequence Numbered Feed Stream]
D[Snapshot Store]
subgraph Fanout["Feed Distribution Tier"]
E[Gateway 1]
F[Gateway 2]
G[Gateway 3]
end
subgraph Brokers["Brokers"]
H[Broker A]
I[Broker B]
J[Broker C]
K[Broker D]
end
A --> B --> C
B --> D
C --> E
C --> F
C --> G
E --> H
E --> I
F --> J
G --> K
D --> H
D --> I
D --> J
D --> K
Important design details:
- feed messages should be sequence numbered
- brokers should detect gaps
- missed data should be repaired by snapshot plus replay
- slow consumers should be isolated from fast ones
The most important rule is simple: the matching engine must never wait for a slow broker feed.
Feed fan-out is a hidden failure path
Many weak designs keep matching correct in isolation but accidentally let feed backpressure leak into the core. Recovery, replay, and slow-consumer handling should stay in the distribution tier, not in the matcher thread.
Scalability: How the System Handles Huge Volume
Now let us address the most important scale question: how does this system handle too much volume?
1. Partition by Symbol
The exchange scales horizontally by distributing symbols across partitions.
That means:
- each partition has its own sequencer and matcher
- symbols are processed independently
- one hot stock does not automatically block unrelated stocks
2. Keep One Active Owner Per Symbol
We do not scale one book using many active writers. That creates coordination overhead and ambiguity.
For one symbol, single-writer ownership is usually the better trade-off.
3. Isolate Hot Symbols
Not all stocks are equally active.
At market open or during major events:
RELIANCEmay get heavy order flowINFYmay see extreme cancel or replace volume- smaller symbols may remain relatively quiet
So capacity planning should focus on:
- hot-symbol concentration
- cancel and replace ratio
- outbound market-data amplification
4. Scale the Feed Tier Separately
Inbound order traffic and outbound market-data traffic are different scaling problems.
The matching engine must stay small. The fan-out tier can scale independently with more gateways, more replay capacity, and more snapshot infrastructure.
5. Keep Replay and Recovery Off the Matcher
After outages or disconnect storms, many brokers may reconnect at once. That recovery load should hit the snapshot and replay tier, not the live matching thread.
Why the System Is So Fast
The exchange is fast because it refuses to do expensive things in the hot path.
The main reasons are:
- one active writer per book or partition
- in-memory order books
- local caches for reference data
- no database round-trip to decide a match
- no distributed transaction in the matching loop
- market-data fan-out moved off the matcher
- bounded, predictable handoff between stages
The hot path should look like this:
- gateway
- lightweight validation
- venue controls
- sequencer
- single-writer matcher
- commit boundary
- ack
That is why the system can stay fast even when overall platform complexity is large.
If we turned this into a chain like:
- order service
- risk service
- book service
- trade service
we would add network hops, serialization, retries, and latency variance right where determinism matters most.
Related reading:
- Shared Memory vs Message Passing in Java Applications
- Contention Collapse Under Load in Java Services
- Bounded Queues and Backpressure in Java Systems
Failure and Recovery
A stock exchange design is incomplete without failure handling.
The most important scenarios are:
1. One Symbol Becomes Extremely Hot
If INFY suddenly becomes 20 times hotter than the average stock:
- only its owning partition should be stressed
- other symbols should keep moving normally
- venue controls should still be able to throttle or halt if required
2. A Broker Feed Falls Behind
If one broker cannot keep up with tick processing:
- that broker’s feed path should lag or disconnect
- the matcher should continue normally
- recovery should happen through snapshot plus replay
3. Gateway Reconnect Storm
If many sessions reconnect after a network event:
- session recovery should be explicit
- reconnect handling should be rate limited
- cancel-on-disconnect rules should be deterministic
4. Primary Matcher Failure
If the active owner of RELIANCE fails:
- the standby should know the last committed sequence
- partition ownership should move explicitly
- the new primary should continue from the next safe sequence number
The critical recovery rule is: there must be one unambiguous owner of the next event.
Active-active sounds safer than it is
For one order book, multi-writer ownership usually increases coordination cost, ambiguity, and split-brain risk. In low-latency exchange systems, single-writer ownership is often the simpler and safer default.
That is also why active-active multi-writer ownership for one book is usually the wrong trade-off in a low-latency exchange.
Final Design Summary
If I had to explain the design in one short flow, I would say this:
- brokers connect through gateways
- the exchange applies local venue controls
- orders are routed by symbol to one active partition
- the sequencer defines authoritative order
- the matching engine updates the in-memory book
- the journal records committed truth
- execution reports go back to brokers
- market-data builders and feed gateways distribute the committed feed
- downstream systems consume the same truth asynchronously
That is the clean mental model a learner should keep.
If that model is clear, the detailed implementation choices become much easier to understand.
Key Takeaways
- A stock exchange is a deterministic sequencing problem under latency pressure.
- The design should begin with requirements and invariants, not with random infrastructure boxes.
- One active owner per symbol or partition is the simplest scalable model.
- Matching should stay memory-resident, single-writer, and isolated from slow dependencies.
- Tick-by-tick market data should be derived incrementally from committed events.
- The system scales by partitioning symbols, isolating hot books, and scaling market-data fan-out separately.
Categories
Tags
Comments