Skip to content
← Back to Notes

Partitioning and Sharding

A single database node has finite storage, memory, and write throughput. When a table grows beyond what one node can serve — either in size, query volume, or write rate — the data must be split across multiple storage units. This is partitioning.

Partitioning splits a table's data across storage units. When those units are separate database nodes (separate machines), the term sharding is typically used. The concepts are the same; the word changes with scope.


Vertical vs Horizontal Partitioning

Vertical partitioning splits by columns. A users table with 50 columns might be split into users_core (frequently accessed columns) and users_profile (rarely accessed columns). This reduces the working set that must fit in memory for common queries. It is a schema design technique, not a scaling technique — the data still lives on one node.

Horizontal partitioning splits by rows. Each partition holds a subset of rows from the full table, with all columns intact. Partitions can be spread across separate nodes, each handling a fraction of the read and write load. This is the technique that enables horizontal scaling, and it is what the rest of this article covers.


Partition Strategies

Range Partitioning

Rows are assigned to partitions based on a range of values in the partition key.

Partition 1: user_id  1 – 1,000,000
Partition 2: user_id  1,000,001 – 2,000,000
Partition 3: user_id  2,000,001 – 3,000,000

Range queries are efficient: WHERE user_id BETWEEN 500000 AND 600000 hits exactly one shard. The database knows the partition boundaries and routes directly.

Drawback — hot spots: if the partition key is monotonically increasing (auto-increment IDs, timestamps), all new writes go to the last partition. Every INSERT hits one shard while the others sit idle. Time-series data partitioned by date exhibits this constantly — today's partition absorbs the entire write load while past partitions are read-only.

Hash Partitioning

The partition key is passed through a hash function; the result modulo the number of partitions determines the shard.

shard = hash(user_id) % num_shards

Write load distributes uniformly across all shards — no hot spots for random or high-cardinality keys.

Drawback — scatter-gather queries: range queries (WHERE user_id BETWEEN 500000 AND 600000) have no locality. The hash function destroys ordering, so the database must query every shard and merge the results. For a cluster of 50 shards, one range query becomes 50 parallel queries. This is called a scatter-gather and is expensive at scale.

Directory-Based Partitioning

A separate lookup table (the directory) records which shard each key or key range lives on.

Key range 000-099 → Shard A
Key range 100-199 → Shard B
Key range 200-249 → Shard A  (moved after rebalance)

The most flexible strategy: shards can be rebalanced arbitrarily by updating the directory, with no constraint from a fixed hash function or partition boundary.

Drawback: the directory becomes a critical dependency. Every query first hits the directory service to determine the target shard. If the directory is slow, all queries are slow. If it fails, the database is unavailable. The directory is both a performance bottleneck and a single point of failure unless it is itself distributed and replicated.


Shard Key Selection

The shard key determines which partition a row belongs to. Choosing it correctly is the most consequential design decision in a sharded database — it is very difficult to change later without resharding the entire dataset.

Cardinality: the key must have enough distinct values to spread data across all shards. Sharding by a boolean column that is 99% true puts 99% of data on one shard regardless of strategy.

Write distribution: the key must not concentrate writes on one shard. Monotonically increasing keys (auto-increment IDs, created_at timestamps) cause write hot spots with range partitioning.

Query alignment: the most frequent queries should be routable to a single shard. If most queries filter on user_id, shard by user_id. A query that must hit all shards (scatter-gather) does not scale linearly — adding shards makes it slower, not faster, because you add more network round-trips.

Join locality: rows that are frequently joined together should live on the same shard. A query joining orders and order_items by user_id should have both tables sharded by user_id, so the join executes locally within one shard.


Real-World Usage

PostgreSQL supports native table partitioning (range, list, hash) using the PARTITION BY clause. Partitions are separate physical tables but are queried transparently as one. Partition pruning eliminates irrelevant partitions at the query planner level. This is single-node partitioning — for multi-node sharding, extensions like Citus (used by Azure Cosmos DB for PostgreSQL) distribute partitions across worker nodes.

MongoDB implements sharding natively. A mongos router sits in front of the cluster, consulting config servers for the chunk map (directory-based). Data is split into chunks (default 128 MB); when a chunk grows too large, it splits. The balancer migrates chunks between shards to maintain even distribution. Shard key selection is permanent — MongoDB does not support resharding after the fact in older versions (support was added in 6.0 with significant limitations).

Vitess shards MySQL horizontally. YouTube and Slack run on Vitess. It interposes a query router that rewrites cross-shard queries, manages connections, and handles resharding without downtime by range-splitting shards and migrating data in the background while serving live traffic.

Cassandra shards automatically using consistent hashing (the token ring). There is no explicit shard key selection step — the partition key in the table's PRIMARY KEY definition serves this role. Consistent hashing means adding nodes requires minimal data migration.


The Hard Part: Cross-Shard Operations and Resharding

This section covers the problems that emerge in production sharded systems.

Cross-shard queries

Any query that cannot be routed to a single shard must scatter to all shards and gather results:

SELECT * FROM orders WHERE amount > 1000 ORDER BY created_at LIMIT 20

If orders is sharded by user_id, this query hits every shard. Each returns its top 20 results ordered by created_at. The router receives up to 20 × num_shards rows and merges them to find the global top 20. Aggregations (COUNT, SUM, AVG) require partial aggregation on each shard and final aggregation at the router.

Cross-shard joins are worse: the router must fetch the join candidates from one shard, then for each row issue a lookup to the appropriate shard of the joined table. At scale this becomes prohibitively expensive. The standard mitigation is denormalization — embed the joined data directly in each row, removing the need for the join entirely.

Hot spots from skewed data

A shard key with high cardinality still produces hot spots if the data distribution is skewed. If 1% of users generate 50% of the write traffic (power users, bot accounts, viral content), the shards holding those users are overwhelmed regardless of the hashing strategy. Approaches:

  • Sub-sharding: split hot user accounts across multiple virtual shards, appending a suffix to the key (user_id:0, user_id:1).
  • Application-layer rate limiting: prevent any single entity from generating disproportionate write volume.
  • Read replicas for hot shards: add read replicas specifically to the overloaded shard.

Resharding

When the current number of shards is insufficient — either due to growth or an initially poor shard count estimate — the dataset must be resharded. With hash partitioning, changing the modulus remaps most rows to different shards. A table of 1 billion rows being resharded from 10 to 11 shards must move ~91% of its data.

Live resharding (without downtime) requires:

  1. Dual-writing all new writes to both the old and new shard locations
  2. Backfilling existing data in the background
  3. Verifying consistency between old and new locations
  4. Cutting over reads to the new layout
  5. Stopping dual-writes and cleaning up old data

This process takes days to weeks on large datasets and carries significant operational risk. It is the main reason shard key selection and initial shard count must be over-provisioned — changing them later is expensive.

Distributed transactions

A transaction that modifies rows on two different shards cannot use a local ACID transaction. It requires a distributed protocol — typically two-phase commit (2PC). 2PC requires a coordinator to:

  1. Ask all participating shards to prepare (lock rows, write to WAL)
  2. If all shards confirm ready, commit to all
  3. If any shard fails to prepare, abort all

2PC is slow (two round trips to every participating shard), and if the coordinator fails between prepare and commit, participating shards are left holding locks indefinitely until the coordinator recovers. For this reason, most sharded architectures avoid cross-shard transactions entirely by choosing a shard key that keeps related data co-located on the same shard.