Skip to content
← Back to Notes

Write-Ahead Logging

Disks are not atomic. Writing an 8 KB database page requires multiple sector writes. A power failure between those writes leaves a page half-written — partially old, partially new. The database cannot distinguish this corruption from a valid page.

Write-Ahead Logging (WAL) is the mechanism that makes database writes durable and recoverable despite this. The principle: write a log record describing a modification before applying the modification itself. If a crash occurs, the log is the source of truth for recovery.


The Problem

Torn writes

A database page is typically 4–16 KB. A disk sector is 512 bytes or 4 KB. Writing a 16 KB page requires 4 sector writes. A crash after the second sector write leaves a page with the first 8 KB updated and the last 8 KB stale. The page is silently corrupted — there is no checksum failure, just wrong data.

Partial transactions

A money transfer modifies two rows: debit account A, credit account B. Even if each individual write is durable, what ensures both happen or neither does? Without a record of "this transaction was in progress," a crash between the two writes leaves the database in an inconsistent state with no indication of what needs to be rolled back.

Performance

Calling fsync after every single write is correct but too slow. Each fsync forces the OS to flush to physical media. On a modern SSD, fsync latency is in the hundreds of microseconds to milliseconds range — capping throughput at a few hundred transactions per second at best. Databases need to commit thousands of transactions per second.

WAL addresses all three problems.


How WAL Works

Before modifying any data page, the database writes a log record to an append-only log file. The log is flushed to disk. Only then may the data page be modified.

1. Transaction writes modification to WAL buffer (in memory)
2. WAL buffer is flushed to WAL file on disk (fsync)
3. "Committed" is returned to the client       ← durability point
4. Data pages are updated asynchronously        (background)

If a crash occurs after step 2 but before step 4, the log contains the complete record of what was committed. On restart, the database replays the log to bring data pages up to date.

If a crash occurs before step 2, the transaction was never committed. The partially written data page can be ignored — the log says this transaction never finished.

Log record structure

Each log record contains:

Field Purpose
LSN (Log Sequence Number) Monotonically increasing position in the log
Transaction ID Which transaction made this change
Before image Data as it was before the change (for UNDO)
After image Data as it should be after the change (for REDO)
Page ID Which data page this applies to

The WAL rule

A data page must never be written to disk unless its corresponding log record has already been flushed. Formally: if page.pageLSN > log.flushedLSN, the page cannot go to disk. The log goes first, always.


Crash Recovery

Recovery runs in three phases:

Analysis: Scan the log from the last checkpoint forward. Build a table of which transactions were active at crash time and which pages were dirty (modified in memory but not yet written to disk).

REDO: Starting from the earliest dirty page's log record, replay every log record in order. This reconstructs the exact database state at the moment of crash — including uncommitted writes that were in flight.

UNDO: Roll back every transaction that was active at crash time (never committed). Apply their before images in reverse log order, writing Compensation Log Records (CLRs) as you go. CLRs ensure that if the database crashes again during recovery, the undo work is not repeated.


Real-World Usage

PostgreSQL stores WAL in pg_wal/. Every data modification generates WAL records used for both crash recovery and replication: standbys receive WAL records from the primary and apply them continuously. wal_level controls verbosity — replica enables streaming replication, logical enables logical decoding to replicate to heterogeneous targets.

MySQL InnoDB maintains two logs: the redo log (WAL for crash recovery) and the binary log (binlog for replication and point-in-time recovery). The redo log is circular — old records are overwritten once the corresponding pages are confirmed checkpointed to disk.

SQLite WAL mode: the default rollback journal copies original pages before modifying them. WAL mode instead appends writes to a separate WAL file, allowing readers and writers to operate without blocking each other. Writers never block readers; readers never block writers. Substantially faster for read-heavy workloads.

RocksDB and LevelDB: both use an LSM tree architecture where writes go simultaneously to the WAL (on disk, sequential — for durability) and to an in-memory memtable (for fast reads). When the memtable fills, it flushes to an SSTable. The corresponding WAL segment can then be discarded. Crash recovery replays the WAL to reconstruct any memtable that was not yet flushed.

Kafka is architecturally a distributed WAL with a publish-subscribe API. Messages are written to an append-only log on disk. Followers replicate by replaying the leader's log. Consumers read at their own pace by seeking to any offset in the log. The log is the source of truth; the consumer's position is just a pointer into it.


The Hard Part: Group Commit, Logical Replication, and CDC

This section covers optimizations and advanced uses built on top of WAL.

Group commit

Fsyncing for every individual transaction is still expensive. Group commit batches multiple transactions' log records into a single fsync:

T1 writes to WAL buffer, signals ready
T2 writes to WAL buffer, signals ready   ← while T1 is waiting for fsync
T3 writes to WAL buffer, signals ready
One fsync flushes all three records
All three transactions commit simultaneously

At 10,000 transactions per second, group commit can reduce 10,000 fsyncs to ~100. PostgreSQL's commit_delay and commit_siblings parameters control the grouping window. Throughput improves by orders of magnitude with minimal added latency per transaction.

Physical vs logical replication

Physical replication sends raw WAL records — byte-for-byte page changes. Fast, but requires the replica to have the same storage layout and major version as the primary.

Logical replication decodes WAL records back into logical operations (INSERT/UPDATE/DELETE on specific tables) before sending. This enables:

  • Replicating to a different database version
  • Replicating only specific tables
  • Replicating to non-PostgreSQL databases
  • Zero-downtime major version upgrades

Tools like Debezium consume PostgreSQL's logical decoding API to stream row-level changes into Kafka as a CDC (Change Data Capture) pipeline.

Checkpointing

Without checkpointing, crash recovery would require replaying the entire history of the log. A checkpoint flushes all dirty pages to disk and records the checkpoint position in the log. All log records before that position can be discarded — recovery only needs to replay from the last checkpoint.

PostgreSQL's checkpoint_timeout (default: 5 minutes) and max_wal_size (default: 1 GB) control checkpoint frequency. More frequent checkpoints mean faster recovery but more background I/O. For replication, WAL segments must be retained until all standbys have consumed them — wal_keep_size and WAL archiving control this.

WAL as an event log

The deepest insight about WAL is architectural: an append-only, ordered, durable log of every state change is the primary source of truth, and the current database state is just a materialization of replaying that log. This is identical to what event sourcing systems build explicitly.

The WAL can be consumed as:

  • A change data capture stream: every row modification, delivered in order, to downstream systems
  • An audit log: full history of every write, who made it, old and new values
  • A time machine: replay the log to reconstruct the database at any past point in time

This connection between databases (WAL), message queues (Kafka's log), and event sourcing (event log) is not coincidental — all three are implementations of the same fundamental idea.