The Thundering Herd Problem

08 May 2026·6 min read

The thundering herd is a failure mode where a large number of servers simultaneously lose access to cached data and all attempt to regenerate it from the source at the same moment. Each server behaves correctly in isolation. The problem is the synchronization — every server acts at the exact same time, producing a spike that the underlying database was never designed to absorb.

The Problem

A typical caching setup:

1,000 application servers behind a load balancer
A frequently read piece of data cached in Redis with a 5-minute TTL
All servers populated their cache around the same time

When the TTL expires:

t=0s:  Cache key expires
t=0s:  All 1,000 servers simultaneously get a cache miss
t=0s:  All 1,000 servers fire the same expensive database query
       Database handles ~50 queries/sec in steady state
       Database receives 1,000 queries in milliseconds
t=2s:  Database buckles, queries start timing out
t=3s:  Servers retry failed queries → load increases further
t=10s: Database falls over

The cache's purpose is to protect the database. Its synchronized expiry caused the exact surge it was designed to prevent. Because TTLs are fixed, this repeats on a schedule — every 5 minutes in this case.

The core issue: the cache expiry event is synchronized because all servers wrote the entry at roughly the same time with the same TTL. When one misses, all miss.

Solutions

1. TTL Jitter

The simplest fix. Add random jitter to the TTL so no two servers expire at exactly the same time:

base_ttl = 300
jitter    = random.randint(0, 60)
cache.set(key, value, ex=base_ttl + jitter)

Expiries are now spread across a 60-second window. Instead of 1,000 simultaneous misses, you get roughly 16 per second — well within normal database capacity.

Jitter does not prevent misses. It converts a synchronized spike into a steady trickle.

2. Mutex Lock

When a cache miss occurs, only one server should query the database. The rest should wait for that result.

def get(key):
    value = cache.get(key)
    if value:
        return value

    acquired = redis.set(f"lock:{key}", 1, nx=True, ex=5)
    if acquired:
        value = db.query(...)
        cache.set(key, value, ex=300)
        redis.delete(f"lock:{key}")
        return value
    else:
        time.sleep(0.05)
        return get(key)   # subsequent calls hit the warmed cache

One server wins the lock and fetches. The rest wait and pick up the result from cache on retry. The time.sleep busy-wait can be replaced with a blocking Redis operation (BLPOP) to avoid polling.

Drawback: all waiting servers are held up for the duration of the database query. If the lock holder crashes, the lock expires after its TTL — but the window of degraded service is real.

3. Stale-While-Revalidate

Serve the stale cached value immediately to all callers, while triggering a single background refresh:

def get(key):
    entry = cache.get_with_meta(key)  # {value, expires_at, revalidating}

    if entry is None:
        return fetch_and_cache(key)   # cold start — must block once

    if entry.expires_at < now() and not entry.revalidating:
        cache.mark_revalidating(key)
        background_task(lambda: refresh(key))

    return entry.value   # always returns immediately, stale if needed

No caller ever waits on the database after the initial cold start. Users may receive data a few seconds stale — acceptable for most non-critical reads. This is the same semantics as HTTP's Cache-Control: stale-while-revalidate.

4. Probabilistic Early Expiration (XFetch)

Rather than waiting for the TTL to expire, proactively refresh the cache before expiry — with a probability that increases as expiry approaches.

def get(key, beta=1.0):
    value, ttl_remaining, delta = cache.get_with_meta(key)
    # delta = seconds the last recompute took

    if -delta * beta * math.log(random.random()) > ttl_remaining:
        value = recompute(key)
        cache.set(key, value, ex=original_ttl)

    return value

math.log(random.random()) produces a negative number with an exponential distribution. As ttl_remaining decreases toward zero, the condition becomes increasingly likely to trigger. The first server to roll a qualifying random number refreshes the cache early. All other servers continue serving the still-warm value.

The delta term is adaptive: if the last recompute took 5 seconds, the algorithm starts probing for early refresh well before the 5-second boundary. Fast recomputes refresh only slightly early; slow recomputes refresh much earlier to avoid serving expired data.

No lock needed. No stale data served. Statistically, exactly one server refreshes at a time.

5. Request Coalescing at the Proxy

Deduplicate in-flight cache-miss requests at the proxy layer rather than in application code:

1,000 requests arrive for /api/recommendations → cache miss
Proxy queues 999 requests
Proxy forwards 1 request to the origin server
Origin queries the database, returns response
Proxy serves the single response to all 1,000 queued requests

No application code changes. The database sees one query instead of 1,000. Nginx implements this with proxy_cache_lock. Varnish does it natively as part of its grace mode.

Real-World Usage

Facebook's Memcache uses a lease system: when a server gets a cache miss, Memcache issues it a unique 64-bit lease token and tells all other servers missing the same key to wait. Only the lease holder queries the database and writes back. This also prevents stale writes — if a key is invalidated while a fetch is in flight, only the current lease holder's write is accepted.

Varnish coalesces requests by default. Multiple simultaneous cache misses for the same URL result in exactly one backend request. The rest receive the response when it arrives — transparent to both clients and the origin.

CDNs (Fastly, Cloudflare) implement stale-while-revalidate at the edge. An origin response with Cache-Control: stale-while-revalidate=60 tells the CDN to serve the cached copy for up to 60 seconds past expiry while asynchronously refreshing. End users see no latency increase; the origin receives one refresh request instead of thousands.

The Hard Part: Hot Keys and Cascading Misses

This section covers variants that require additional solutions.

Hot key problem

Even with a warm cache, a single key receiving 100,000 requests per second overloads one Redis node. When that key must refresh — regardless of which mechanism — there is a brief window of high concurrency on a single resource.

Mitigations:

Local in-process cache (L1): each application server maintains a small in-memory copy with a short TTL (a few seconds). Reads never reach Redis for hot keys. Only when the local TTL expires does the server fetch from Redis, and only rarely will it miss Redis too. This reduces Redis load by several orders of magnitude for true hot keys.

Key replication: store the value under N separate keys (key_shard_0 through key_shard_N-1), read from a randomly selected shard per request. Write must update all shards. This spreads read load across N Redis nodes and N independent refresh cycles.

Cascading cache misses

When a cache layer fails entirely (Redis restart, network partition), all traffic falls through to the database simultaneously — a thundering herd at a larger scale than TTL expiry alone.

Mitigations: circuit breakers that return default values when the cache is unavailable (preventing any database fallthrough during the outage), and warm-up traffic shaping that gradually increases load on a restarted cache node rather than directing full traffic immediately.

Auto-scaling stampede

When auto-scaling spins up 50 new instances simultaneously, all 50 boot with empty caches and issue cold-start queries at the same moment. The scale-out event itself becomes a spike.

Mitigations: staggered instance startup (bring up batches of 5, wait for each batch to warm before proceeding), health check grace periods that delay traffic until an instance's cache is warm, and circuit breakers on new instances that prevent them from forwarding cache misses to the database during the warm-up window.