Circuit Breaker Pattern
In a distributed system, a slow dependency is more dangerous than a dead one. A dead dependency fails immediately. A slow one holds threads open, exhausts connection pools, and causes the failure to propagate upstream — taking down services that had nothing to do with the original fault.
The circuit breaker pattern addresses this by detecting sustained failures and stopping calls to a failing dependency entirely, giving it time to recover while protecting everything upstream.
The Problem: Cascading Failure
Consider this service chain:
User Request → Service A → Service B → Service C → Database
Service C's database starts struggling. Queries that normally take 10ms are now taking 30 seconds.
- Service B sends requests to C. They hang for 30 seconds waiting for a response.
- Service B has a thread pool of 50 threads. Within seconds, all 50 are blocked on C.
- Service A sends requests to B. B's thread pool is exhausted — requests queue up.
- A's threads start piling up waiting for B.
- The entire system freezes. All user requests time out.
One slow database brought down three services. The failure propagated upstream because each service kept retrying into an already-overloaded dependency. More calls made recovery harder, not easier.
The Solution: Three States
A circuit breaker wraps every call to an external dependency and tracks its health. It operates as a state machine with three states.
Closed — normal operation
All requests pass through to the dependency. The breaker monitors the failure rate over a rolling window. When the failure rate or slow-call rate exceeds a configured threshold, the breaker trips and moves to Open.
Threshold example: trip if >50% of the last 100 calls fail,
or if >50% take longer than 2 seconds
Open — failing fast
All requests are immediately rejected without making a network call. Callers receive an error in microseconds rather than waiting 30 seconds for a timeout. Two things happen:
- Calling services free up their threads immediately
- The failing dependency receives no new traffic and has a chance to recover
After a configured wait period (e.g., 60 seconds), the breaker moves to Half-Open.
Half-Open — probing
A small number of requests are allowed through as a probe. If they succeed, the dependency has recovered — the breaker closes. If they fail, the dependency is still unhealthy — the breaker opens again and the timeout resets.
State transitions:
CLOSED ──(failure rate ≥ threshold)──→ OPEN
OPEN ──(timeout expires)──────────→ HALF-OPEN
HALF-OPEN ──(probe succeeds)─────────→ CLOSED
HALF-OPEN ──(probe fails)────────────→ OPEN
Fallback behaviour
An open circuit does not have to return a bare error. Common fallback strategies:
- Return cached or stale data from the last successful response
- Return a safe default ("feature unavailable")
- Serve a degraded version of the feature
- Enqueue the request for later processing
Real-World Usage
Netflix Hystrix was the library that proved this pattern at scale. Every inter-service call at Netflix was wrapped in a Hystrix command. At peak, Netflix was making billions of circuit-breaker-protected calls per day. Hystrix is now in maintenance mode, succeeded by more modern alternatives.
Resilience4j is the current standard for JVM services. Key improvement over Hystrix: it trips on both failure rate and slow-call rate independently.
CircuitBreakerConfig config = CircuitBreakerConfig.custom()
.failureRateThreshold(50)
.slowCallRateThreshold(50)
.slowCallDurationThreshold(Duration.ofSeconds(2))
.waitDurationInOpenState(Duration.ofSeconds(60))
.permittedNumberOfCallsInHalfOpenState(5)
.slidingWindowSize(100)
.build();
Istio implements circuit breaking at the infrastructure layer with no application code changes required:
trafficPolicy:
outlierDetection:
consecutive5xxErrors: 5
interval: 10s
baseEjectionTime: 30s
When a pod returns 5xx errors five times in ten seconds, Istio stops routing traffic to it for 30 seconds. This applies to every service in the mesh, regardless of language or framework.
The Hard Part: Composition and Observability
This section covers how circuit breakers interact with other resilience patterns in production.
Interaction with retries
Circuit breakers and retries must be composed in the right order:
Request → Retry → Circuit Breaker → Timeout → Service Call
The circuit breaker must sit outside the retry loop. If the breaker is open, the retry should not fire — it would waste retry budget on calls that will immediately fail. A retry inside an open circuit is pointless and burns resources.
Bulkhead pattern
The circuit breaker stops calls to a failing dependency. The bulkhead pattern prevents a failing dependency from consuming all threads in the first place by giving each dependency its own isolated thread pool.
Without bulkheads: one slow dependency fills the shared pool, blocking all other dependencies. With bulkheads: each dependency is allocated a fixed pool. A slow dependency blocks only its own threads; others are unaffected.
Combined: the circuit breaker trips and stops calls; the bulkhead limits the blast radius before the breaker trips.
Adaptive thresholds
A fixed 50% failure threshold is blunt. Under high traffic, 50 failures per second may be normal noise. Under low traffic, 5 failures may warrant concern. Adaptive circuit breakers set thresholds dynamically based on historical baseline behaviour using statistical methods such as EWMA (exponentially weighted moving average).
Netflix's Concurrency Limits library takes this further — it applies TCP congestion control algorithms to limit in-flight requests based on observed latency, automatically backing off before the circuit trips.
Circuit breakers as sensors
A tripped circuit breaker is a signal, not just a safety mechanism. Which services are unhealthy, how often they trip, and how long they stay open are all meaningful operational metrics. Tracking state transitions — open/close events, failure rates per breaker, time spent in open state — provides an accurate picture of system health that complements request latency and error rate dashboards.