Skip to content
← Back to Notes

Split Brain & Quorum

Split brain occurs when a network partition divides a cluster into isolated groups that can no longer communicate. Each group believes it is the entire cluster and proceeds to elect leaders and accept writes independently. When the partition heals, both sides have conflicting write histories — a state that may require manual intervention to resolve.


The Problem

A three-node database cluster: one primary (P) and two replicas (R1, R2). A network switch fails and splits the cluster:

  • Partition A: P and R1 can communicate. Everything looks normal.
  • Partition B: R2 is isolated and cannot reach P or R1.

From R2's perspective, the primary has gone silent. Its health check fires, decides the primary is dead, and promotes itself. Now there are two primaries:

Partition A:  P  ←→  R1   → accepting writes from clients routed here
Partition B:  R2             → accepting writes from clients routed here

Both are confidently accepting transactions. Neither knows about the other. When the partition heals, both sides carry conflicting write histories with no reliable way to determine which is correct. In financial systems, this means debits and credits that conflict across the same accounts.


The Solution: Quorum

If a partition requires majority agreement before doing anything important — electing a leader, committing a write — then only one partition can ever proceed. You cannot have two groups both holding a majority.

In a 5-node cluster, majority = 3. Two groups of 3 would require 6 nodes. At most one group can have a majority at any time.

Quorum = ⌊n/2⌋ + 1
Cluster size Quorum Failures tolerated
3 2 1
5 3 2
7 4 3

This is why distributed systems run odd numbers of nodes. A 4-node cluster has the same failure tolerance as a 3-node cluster (both tolerate 1 failure) but costs more.

The minority partition does not elect a leader or accept writes — it sits idle until communication is restored.


Read and Write Quorums

Quorum extends to individual read and write operations in replicated databases:

  • n = total replicas
  • W = write quorum (write must succeed on W replicas before returning committed)
  • R = read quorum (read from R replicas, take the most recent)

The constraint W + R > n ensures at least one replica in any read set was also in the write set, guaranteeing you always read the latest committed write.

Strategy W R n=3 Tradeoff
Strong write 3 1 W=3, R=1 Durable writes, fast reads
Strong read 1 3 W=1, R=3 Fast writes, always fresh reads
Balanced 2 2 W=2, R=2 Tolerates 1 failure on each side
Eventual 1 1 W=1, R=1 Fastest, no consistency guarantee

Cassandra exposes this directly: CONSISTENCY QUORUM sets W and R to ⌊n/2⌋ + 1. CONSISTENCY ONE is faster but offers no split-brain protection.


Consensus Algorithms

Quorum is the concept. Raft and Paxos are the algorithms that implement it.

Raft

Raft divides consensus into three sub-problems: leader election, log replication, and safety.

Leader election: If a follower receives no heartbeat from a leader within a randomised timeout (e.g., 150–300ms), it becomes a candidate and requests votes. A candidate wins if it receives votes from a majority — quorum. Randomised timeouts prevent two nodes from becoming candidates simultaneously.

Log replication: The leader appends each write to its log and sends the entry to all followers. Once a majority acknowledge the entry, the leader commits it and notifies followers. Followers commit when informed by the leader.

Term numbers: Each election increments a monotonically increasing term counter. Nodes reject messages from leaders with a lower term, preventing stale leaders from interfering after a partition heals.

Paxos

The original consensus algorithm, mathematically proven correct. More general than Raft — it can reach consensus on a single value without requiring a persistent leader. Notoriously difficult to implement correctly; most production systems use Multi-Paxos, which is structurally similar to Raft. Raft was designed as a more understandable alternative.


Real-World Usage

etcd stores all Kubernetes cluster state and uses Raft internally. A 3-node etcd cluster tolerates 1 node failure. Losing quorum takes down the Kubernetes control plane — which is why etcd high availability is critical in production.

ZooKeeper uses a Paxos-like protocol (Zab). A 3-node ZooKeeper ensemble requires 2 nodes to be available for any write. Kafka (older versions), Hadoop, and HBase use ZooKeeper for distributed coordination.

MongoDB replica sets use a majority write concern ({w: "majority"}). Without it, a primary can acknowledge writes that are later lost during failover. With it, a write is only committed once a majority of nodes have confirmed it.


The Hard Part: Stale Leaders and CAP

This section covers edge cases that quorum alone does not prevent.

The stale leader problem

Quorum prevents two nodes from both believing they are current leaders at election time. It does not prevent a leader from acting on stale information after it has been replaced.

Scenario: Leader L1 is paused by a garbage collection pause. The cluster times out, runs a new election, and elects L2. L1's GC pause ends. L1 still believes it is the leader and attempts to write to the database.

Fencing tokens address this: when a lock service grants leadership, it issues a monotonically increasing token. Every write to shared storage must include the token. Storage rejects any write carrying a token lower than the highest it has seen.

L1 receives token 42
Cluster elects L2 with token 43 (L1 evicted)
L1 resumes, attempts write with token 42 → REJECTED
L2 writes with token 43                  → ACCEPTED

STONITH

In systems where even a single stale write is unacceptable, fencing tokens depend on the storage layer correctly rejecting the stale write. A stronger guarantee: physically power off the stale node via out-of-band hardware management (IPMI, iDRAC). A powered-off node cannot write. This is called STONITH (Shoot The Other Node In The Head) and is used in financial and medical systems.

CAP theorem

Network partitions cannot be avoided in any real distributed system. When a partition occurs, a system must choose:

  • CP (consistency over availability): the minority partition refuses requests. Quorum-based systems (etcd, ZooKeeper, Raft clusters) are CP. You lose availability in the minority partition.
  • AP (availability over consistency): both partitions continue operating, potentially with divergent data. Cassandra with CONSISTENCY ONE, DynamoDB with eventual consistency are AP.

The choice depends entirely on your failure semantics: is returning an error worse, or is returning stale or conflicting data worse?

Flexible quorums

Standard quorum requires any majority. More flexible quorum systems define explicit quorum sets — specific subsets that count as a quorum. For example: "any 2 of {A, B, C} OR any node from {D, E}, as long as every possible write set intersects every possible read set."

This allows read-heavy workloads to use small read quorums (cheap) at the cost of larger write quorums, or geographic quorums that can be satisfied within a single datacenter during normal operation while preserving global consistency guarantees.