Bloom Filters
A Bloom filter answers one question: has this element been seen before? It can return false positives — claiming an element is present when it is not — but it will never return a false negative. If it says "not seen," that is guaranteed to be correct.
This asymmetry is the entire point. In many systems, a false positive is cheap to handle (do a second, more expensive check) while a false negative is catastrophic (miss a malicious URL, serve stale data). Bloom filters are built for exactly that asymmetry.
The Problem
Consider Chrome's Safe Browsing: every URL a user visits must be checked against ~650 million known malicious URLs.
- Hash set in memory: 50 bytes per URL × 650M = ~32 GB. Cannot ship in a browser.
- Server lookup on every request: latency on every navigation, privacy issues, billions of RPS on the server.
- Sorted list on disk: disk I/O on every page load, unacceptably slow.
The real requirement is not "is this URL definitely malicious?" but "is this URL possibly malicious, warranting a deeper check?" A data structure that occasionally cries wolf is fine. One that misses a real threat is not.
The Structure
A Bloom filter has two components:
- A bit array of
mbits, all initialized to0 kindependent hash functions, each mapping any input to a position[0, m)
Inserting an element
Run the element through all k hash functions. Set each of the k resulting bit positions to 1.
Insert "apple" with k=3 hash functions:
h1("apple") → 2 set bit 2
h2("apple") → 7 set bit 7
h3("apple") → 11 set bit 11
Querying an element
Run the element through all k hash functions. Check each position:
- If any bit is
0→ element was never inserted. Return no (definitive). - If all bits are
1→ element was probably inserted. Return maybe (could be a false positive).
Query "banana":
h1("banana") → 2 bit is 1 ✓
h2("banana") → 5 bit is 0 ✗ → DEFINITELY NOT in set
A 0 bit is absolute proof of absence: if the element had been inserted, that bit would have been set to 1. This is why false negatives are impossible.
False positives occur when all k bits for a queried element happen to be 1 due to other insertions, even though the element itself was never inserted.
The Math
False positive probability given m bits, n inserted elements, and k hash functions:
p ≈ (1 − e^(−kn/m))^k
Optimal number of hash functions to minimise this:
k = (m/n) × ln 2 ≈ 0.693 × (m/n)
Bits needed to achieve a target false positive rate p for n elements:
m = −n × ln(p) / (ln 2)²
| False Positive Rate | Bits per element | Hash functions (k) |
|---|---|---|
| 10% | 4.8 | 3 |
| 1% | 9.6 | 7 |
| 0.1% | 14.4 | 10 |
| 0.01% | 19.2 | 13 |
At 1% FPR: 9.6 bits per element versus ~64 bits for a hash set pointer. A 6.6× space reduction for a 1-in-100 false alarm rate.
Real-World Usage
Google Chrome ships a local Bloom filter of known malicious URLs. The filter runs in microseconds on every navigation. Only if it returns "possibly dangerous" does Chrome make a network call to Google's servers for verification. Nearly all safe URLs never trigger a server call; no malicious URL is missed locally.
Apache Cassandra stores data in SSTables (immutable files on disk). Reading a key might require scanning many SSTables — each needing disk I/O. Each SSTable maintains a Bloom filter over its keys. Before touching disk, Cassandra checks the filter. A definitive "not here" skips that SSTable entirely, turning O(n) disk seeks into O(1) for most read paths.
Akamai CDN found that roughly 75% of cached objects are requested only once — caching them wastes space. Akamai uses a Bloom filter as a admission policy: only promote an object to cache if it has been requested before. The filter answers "have I seen this URL before?" in O(k) time with negligible memory.
The Hard Part: Deletion, Scaling, and Variants
This section covers advanced variants and production considerations.
Deletion
Standard Bloom filters do not support deletion. Setting a bit back to 0 is unsafe — that position may have been set by a different element. Deleting one element invalidates membership for any other element that shares that bit.
Counting Bloom Filter: Replace each bit with a small counter (typically 4 bits). Increment on insert, decrement on delete. Deletion is now safe, at the cost of 4× more memory.
Cuckoo Filter: Stores short fingerprints (a few bits per element) in a cuckoo hash table. Supports deletion natively, uses less space than a Counting Bloom Filter for false positive rates below ~3%, and has better cache performance.
Unknown element count
The standard Bloom filter requires knowing n upfront to configure m and k. As more elements are inserted beyond the design capacity, the false positive rate degrades silently.
Scalable Bloom Filters chain multiple filters: when the current filter approaches capacity, a new, larger filter is added. New insertions go into the latest filter; queries check all filters. The total FPR is bounded by the sum of individual FPRs across the chain.
Distributed merging
Two Bloom filters with the same m and k can be merged with a bitwise OR. This property enables eventually-consistent set membership across a distributed cluster with no coordination required — each node maintains its own filter, and filters can be gossiped and merged freely.