Rate Limiting
Rate limiting controls how many requests a client can make in a given time window. Exceed the limit and you get an HTTP 429 Too Many Requests. Without it, a single client — a buggy script, a bot, or a bad actor — can exhaust server resources and take down the service for everyone else.
The mechanism sounds simple. The implementation details are not.
Fixed Window Counter
Divide time into fixed-length windows (e.g., 1 second each). Maintain a counter per client per window. Increment on each request; reject when the counter exceeds the limit. Reset the counter when the window rolls over.
Window: |--- 0s to 1s ---|--- 1s to 2s ---|
Counter: 87 / 100 12 / 100
Drawback: A client can send 100 requests at t=0.99s and another 100 at t=1.01s. Both windows see 100 — within the limit — but 200 requests hit the server in 20 milliseconds. This is the boundary burst problem.
Sliding Window Log
Keep a timestamped log of every request per client. On each new request, discard entries older than the window, then count what remains. If the count is under the limit, allow — otherwise reject.
Limit: 100 req/min
New request at t=60.4s
Discard all entries before t=0.4s
Count remaining entries → 73 → ALLOW
No boundary problem. The window moves continuously.
Drawback: Memory. At 1000 req/min per user across millions of users, storing and scanning individual timestamps becomes expensive. Not viable at scale.
Sliding Window Counter
A hybrid of the two above. Store counters for the current window and the previous window. Estimate the request count in the sliding window using a weighted sum:
estimated_count = prev_window_count × (1 - elapsed_fraction) + curr_window_count
If a request arrives 45 seconds into a 60-second window, the previous window contributes 25% of its count.
Only two integers per client. Cloudflare uses this approach at the edge of their global network.
Drawback: It is an approximation. The worst-case error is small (under 10%) but non-zero. Acceptable for traffic shaping; not acceptable if exact enforcement is required.
Token Bucket
Each client has a bucket with a maximum capacity of b tokens. Tokens are added at a steady rate r. Each request consumes one token. If the bucket is empty, the request is rejected.
capacity: 10 tokens refill: 2/sec
t=0s: 10 tokens available
t=0s: 6 requests → 4 tokens remaining
t=1s: refill → 6 tokens
t=1s: 3 requests → 3 tokens remaining
A client that makes no requests accumulates tokens and can later spend them in a burst — up to the bucket capacity. This is intentional: it models legitimate bursty usage patterns.
AWS API Gateway, Google Cloud, and most CDN-level rate limiters use Token Bucket. The state per client is just (token_count, last_refill_time).
Drawback: A client that has been idle can immediately consume the full bucket, producing a short spike that may still stress downstream systems.
Leaky Bucket
Requests are placed into a queue. They are processed (leaked) at a fixed output rate. If the queue is full, incoming requests are dropped.
Input: [burst of 20 requests]
Queue: [capacity 10] → 10 accepted, 10 dropped
Output: 1 request per 100ms, always steady
The output rate is strictly constant regardless of input burstiness. This makes it useful for protecting downstream services with fixed processing capacity.
Drawback: Unlike Token Bucket, idle time earns no credit. Clients cannot burst even if they have been quiet. Leaky Bucket is common in network-layer traffic shaping; less common for API rate limiting where some burst tolerance is expected.
Comparison
| Algorithm | Memory | Burst Handling | Accuracy | Common Use |
|---|---|---|---|---|
| Fixed Window | Low | Double-burst at boundary | Poor at edges | Simple counters |
| Sliding Log | High | None allowed | Exact | Low-traffic APIs |
| Sliding Counter | Low | Smooth approximation | ~99% | High-scale APIs, CDNs |
| Token Bucket | Low | Allowed up to capacity | Good | APIs, cloud gateways |
| Leaky Bucket | Medium | Queued or dropped | Good | Network traffic shaping |
The Hard Part: Distributed Rate Limiting
This section covers production implementation details. Skip if you're only interested in the algorithms.
The multi-server problem
Single-server implementations keep counters in local memory. In a distributed system, a load balancer routes requests across many servers. Each server maintains its own counter, so a client can exceed the global limit by a factor equal to the number of servers. Local counters do not work.
Centralized Redis
The standard solution is to store counters in a shared Redis instance. All servers read and write the same counter. Redis operates in single-threaded event loop mode and delivers sub-millisecond reads, so the added latency per request is minimal.
The naive pattern — INCR key, then EXPIRE key — has a race condition: two servers reading the same value simultaneously will both allow a request that should have been the one to trigger a reject.
Atomic Lua scripts
Redis supports Lua scripts that execute atomically. The increment, check, and expiry are a single indivisible operation:
local count = redis.call('INCR', KEYS[1])
if count == 1 then
redis.call('EXPIRE', KEYS[1], ARGV[1])
end
return count
This eliminates the race. Stripe, GitHub, and most serious API providers use this pattern.
Sliding window in Redis
For an exact sliding window, store request timestamps in a Redis sorted set. On each request, prune old entries, count the remainder, then add the new timestamp — all in one Lua script:
ZREMRANGEBYSCORE key 0 (now - window_ms)
ZCARD key ← current count
ZADD key now request_id
Precise but more expensive than a counter. Used when exact enforcement matters.
Failure handling
A centralized Redis is a single point of failure. When it goes down, most systems fail open — allowing all requests — and fall back to local in-memory counters. Rate limiting degrades to approximate enforcement rather than causing a complete outage.
At Scale
Cloudflare handles roughly 50 million requests per second across its edge network. Rate limiting runs locally at each data center with no cross-datacenter coordination. They use the Sliding Window Counter approximation: two counters per client per edge node, eventually synchronized. The small approximation error is acceptable when blocking millions of attack requests per second.
AWS API Gateway exposes Token Bucket semantics directly: separate configuration for burst limit (bucket size) and steady-state rate (refill rate). The backing store is a proprietary distributed counter system.
Stripe uses Redis with Lua-scripted sliding window counters, one key per API key per window. They have documented storing hundreds of millions of such keys on commodity Redis hardware.
Nginx implements Leaky Bucket via its limit_req module at the reverse proxy layer, before requests reach application code.
The algorithms here — particularly Token Bucket and Sliding Window Counter — underlie nearly every API you interact with. Tuning them is a balance: too loose and a single client can degrade the service for everyone; too tight and legitimate traffic gets rejected.