Skip to content
← Back to Notes

Consistent Hashing

Distributing data across a cluster of servers requires a mapping from keys to servers. The naive approach works until the cluster changes size — at which point it becomes catastrophic. Consistent hashing solves this by ensuring that adding or removing a server only remaps the minimum necessary fraction of keys.


The Problem: Modulo Hashing

The obvious key-to-server mapping:

server_index = hash(key) % number_of_servers

Simple, fast, uniform. Works correctly until the number of servers changes.

When you add an 11th server to a 10-server cluster, the result of hash(key) % 10 versus hash(key) % 11 agrees only when hash(key) is a multiple of 110 (the LCM of 10 and 11). For all other values — ~91% of keys — the key remaps to a different server.

In a caching layer, this means:

  • Cache hit rate drops to ~9% instantly
  • 91% of requests simultaneously miss the cache and hit the database
  • The database, sized for normal traffic, collapses under the load
  • The system falls over precisely when you were trying to scale it

The same failure happens in reverse when removing a server. Modulo hashing cannot work in any system where cluster membership changes.


The Solution: Hash Ring

Consistent hashing maps both servers and keys onto the same circular space — a ring with positions 0 to 2³²−1.

Building the ring

Hash each server's identifier (typically IP + port) to a position on the ring:

hash("server-A:6379") → position 120M
hash("server-B:6379") → position 890M
hash("server-C:6379") → position 2.1B
hash("server-D:6379") → position 3.4B

Assigning a key

Hash the key to a position on the ring, then walk clockwise to the first server. That server owns the key.

hash("user:9912") → position 950M
Walk clockwise → hits server-C at 2.1B → server-C owns this key

Each server owns the arc between itself and its predecessor counterclockwise on the ring.

Adding a server

Place the new server on the ring. Only the keys in the arc between the new server and its counterclockwise predecessor need to move — those keys now resolve to the new server instead.

With modulo hashing: adding 1 server to 10 remaps ~91% of keys
With consistent hashing: adding 1 server to 10 remaps ~10% of keys (1/n)

Removing a server

The removed server's keys all move to the next server clockwise. Again, only 1/n keys are affected.


Virtual Nodes

With a small number of physical servers, the ring distribution is uneven by chance. Server A might own 40% of the ring, Server B only 8%, purely depending on where the hash function placed them.

Virtual nodes (vnodes) fix this: each physical server is assigned V positions on the ring, created by hashing "server-A:0", "server-A:1", ..., "server-A:V-1".

With V=150, load distributes evenly. The variance in owned key space drops from O(n) to O(1/√V).

Additional benefits:

  • Heterogeneous hardware: a server with twice the capacity gets twice as many vnodes, naturally owning twice the key space.
  • Graceful removal: a server's vnodes are spread across all remaining servers — no single server absorbs all the displaced load.

Real-World Usage

Amazon DynamoDB uses consistent hashing to map partition keys to storage nodes. When DynamoDB scales, partitions split and keys migrate to new nodes — consistent hashing ensures this is always a 1/n remapping, not a full reshuffle.

Apache Cassandra is built entirely around a token ring. Each node is assigned a range of token values. Data is replicated by placing it on the N nodes clockwise from a key's token position, where N is the replication factor. The nodetool ring command shows live token assignments across the cluster.

Memcached (libketama) was an early adoption case. Before consistent hashing, a single server failure invalidated the entire cache because every key remapped under modulo hashing. With consistent hashing, a failure causes only 1/n keys to miss — the affected server's share — rather than a total cache invalidation.


The Hard Part: Alternative Algorithms

This section covers variants with different tradeoffs used in production systems.

Rendezvous Hashing (HRW)

Instead of a ring, score every server for each key and assign the key to the highest-scoring server:

score(key, server) = hash(key + server_id)
assigned_server    = argmax(score over all servers)

Adding or removing a server only affects keys assigned to that server — still 1/n remapping. Advantages: no ring structure to maintain, simpler implementation, no virtual node complexity. Disadvantage: O(n) lookup time since every server must be scored. Nginx uses rendezvous hashing for upstream selection.

Jump Consistent Hash

A minimal algorithm — the entire implementation is ~10 lines. It uses a mathematical formula rather than a ring to assign a key to a bucket, guaranteeing exactly 1/n remapping when adding one server.

int32_t JumpConsistentHash(uint64_t key, int32_t num_buckets) {
    int64_t b = -1, j = 0;
    while (j < num_buckets) {
        b = j;
        key = key * 2862933555777941757ULL + 1;
        j = (b + 1) * ((double)(1LL << 31) / (double)((key >> 33) + 1));
    }
    return b;
}

O(log n) time, no data structure required. Limitation: servers must be identified by sequential integers, so arbitrary addition and removal of specific servers is not supported.

Maglev Hashing

Google's load balancer precomputes a large lookup table (e.g., 65,537 entries) mapping every possible key to a backend. Lookup is O(1). The table is constructed such that each backend occupies approximately equal entries and minimal entries change when backends are added or removed.

Table rebuild is O(M log M) but infrequent — backend membership changes rarely. The O(1) per-request lookup matters far more at the scale of Google's network.

Replication on the ring

Cassandra and DynamoDB do not store each key on one server. A key at position P is stored on the first N servers clockwise from P, where N is the replication factor. This means any server in the replica set can serve a read.

Combined with quorum reads and writes (R + W > N), this gives tunable consistency without a single master. When a server fails, its replicas are already on adjacent nodes — no data migration is required.