Skip to content
← Back to Notes

Bloom Filters

A Bloom filter answers one question: has this element been seen before? It can return false positives — claiming an element is present when it is not — but it will never return a false negative. If it says "not seen," that is guaranteed to be correct.

This asymmetry is the entire point. In many systems, a false positive is cheap to handle (do a second, more expensive check) while a false negative is catastrophic (miss a malicious URL, serve stale data). Bloom filters are built for exactly that asymmetry.


The Problem

Consider Chrome's Safe Browsing: every URL a user visits must be checked against ~650 million known malicious URLs.

  • Hash set in memory: 50 bytes per URL × 650M = ~32 GB. Cannot ship in a browser.
  • Server lookup on every request: latency on every navigation, privacy issues, billions of RPS on the server.
  • Sorted list on disk: disk I/O on every page load, unacceptably slow.

The real requirement is not "is this URL definitely malicious?" but "is this URL possibly malicious, warranting a deeper check?" A data structure that occasionally cries wolf is fine. One that misses a real threat is not.


The Structure

A Bloom filter has two components:

  1. A bit array of m bits, all initialized to 0
  2. k independent hash functions, each mapping any input to a position [0, m)

Inserting an element

Run the element through all k hash functions. Set each of the k resulting bit positions to 1.

Insert "apple" with k=3 hash functions:
  h1("apple") → 2    set bit 2
  h2("apple") → 7    set bit 7
  h3("apple") → 11   set bit 11

Querying an element

Run the element through all k hash functions. Check each position:

  • If any bit is 0 → element was never inserted. Return no (definitive).
  • If all bits are 1 → element was probably inserted. Return maybe (could be a false positive).
Query "banana":
  h1("banana") → 2   bit is 1 ✓
  h2("banana") → 5   bit is 0 ✗ → DEFINITELY NOT in set

A 0 bit is absolute proof of absence: if the element had been inserted, that bit would have been set to 1. This is why false negatives are impossible.

False positives occur when all k bits for a queried element happen to be 1 due to other insertions, even though the element itself was never inserted.


The Math

False positive probability given m bits, n inserted elements, and k hash functions:

p ≈ (1 − e^(−kn/m))^k

Optimal number of hash functions to minimise this:

k = (m/n) × ln 2  ≈  0.693 × (m/n)

Bits needed to achieve a target false positive rate p for n elements:

m = −n × ln(p) / (ln 2)²
False Positive Rate Bits per element Hash functions (k)
10% 4.8 3
1% 9.6 7
0.1% 14.4 10
0.01% 19.2 13

At 1% FPR: 9.6 bits per element versus ~64 bits for a hash set pointer. A 6.6× space reduction for a 1-in-100 false alarm rate.


Real-World Usage

Google Chrome ships a local Bloom filter of known malicious URLs. The filter runs in microseconds on every navigation. Only if it returns "possibly dangerous" does Chrome make a network call to Google's servers for verification. Nearly all safe URLs never trigger a server call; no malicious URL is missed locally.

Apache Cassandra stores data in SSTables (immutable files on disk). Reading a key might require scanning many SSTables — each needing disk I/O. Each SSTable maintains a Bloom filter over its keys. Before touching disk, Cassandra checks the filter. A definitive "not here" skips that SSTable entirely, turning O(n) disk seeks into O(1) for most read paths.

Akamai CDN found that roughly 75% of cached objects are requested only once — caching them wastes space. Akamai uses a Bloom filter as a admission policy: only promote an object to cache if it has been requested before. The filter answers "have I seen this URL before?" in O(k) time with negligible memory.


The Hard Part: Deletion, Scaling, and Variants

This section covers advanced variants and production considerations.

Deletion

Standard Bloom filters do not support deletion. Setting a bit back to 0 is unsafe — that position may have been set by a different element. Deleting one element invalidates membership for any other element that shares that bit.

Counting Bloom Filter: Replace each bit with a small counter (typically 4 bits). Increment on insert, decrement on delete. Deletion is now safe, at the cost of 4× more memory.

Cuckoo Filter: Stores short fingerprints (a few bits per element) in a cuckoo hash table. Supports deletion natively, uses less space than a Counting Bloom Filter for false positive rates below ~3%, and has better cache performance.

Unknown element count

The standard Bloom filter requires knowing n upfront to configure m and k. As more elements are inserted beyond the design capacity, the false positive rate degrades silently.

Scalable Bloom Filters chain multiple filters: when the current filter approaches capacity, a new, larger filter is added. New insertions go into the latest filter; queries check all filters. The total FPR is bounded by the sum of individual FPRs across the chain.

Distributed merging

Two Bloom filters with the same m and k can be merged with a bitwise OR. This property enables eventually-consistent set membership across a distributed cluster with no coordination required — each node maintains its own filter, and filters can be gossiped and merged freely.