If you draw the durability story of an etcd cluster, you get this:
A client request arrives → gRPC handler → raft proposes the entry → Raft WAL fsync → quorum acknowledges → apply loop writes it to the backend → backend fsync → client gets a response.
Two fsyncs. The first is non-negotiable: Raft’s safety proof requires it. The second is bbolt’s commit. If we run Pebble with its WAL enabled, we replace the second fsync with a different second fsync — same number, same cost, same place.
Pebble has an option for this. pebble.Options.DisableWAL = true removes the second fsync. Memtable writes are not durable until they flush to L0; flushes are still fsync’d, but no longer once per commit. On the etcd workload — small writes, high commit rate — this is meaningful headroom.
It also creates a problem.
What we lose
When the Pebble WAL is on, the recovery contract is simple: a Commit() that returned nil is durable. After a crash, Pebble replays its WAL and the memtable comes back.
With the Pebble WAL off, the picture is different. A Commit() that returned nil is in the memtable. The memtable is in RAM. A kill -9 loses it. Everything not yet flushed to L0 is gone.
That is fine for etcd, because Raft’s WAL still has those entries. But it means recovery has a job to do: figure out which Raft entries are not yet reflected in Pebble’s on-disk state, and replay them.
That job hinges on a single number: the last Raft index that has been durably flushed to Pebble. Call it lastFlushedIndex. After a crash, etcd reads it from disk, asks Raft “give me everything after that index,” and re-applies. If the number is right, recovery is correct. If the number is wrong by even one in either direction, you get either silent state divergence (one too high) or duplicate apply (one too low).
This is the single most dangerous integer in the milestone.
The protocol
The protocol we settled on:
- Capture the Raft index before issuing the flush. Pebble’s
EventListener.FlushBegincallback fires synchronously before the flush starts. At that moment, every committed write in the memtable has a Raft index ≤ the highest applied entry so far. Snapshot that integer. - Issue the flush, wait for
FlushEnd. WhenFlushEndfires, every key in that flush is durable in an L0 SSTable, which has been fsync’d as part of the flush. - Persist the captured index after the flush returns. Write it to
${data-dir}/member/snap/last-flushed-index. Fsync the file. Fsync the parent directory. - Never store the index inside Pebble. That would be a chicken-and-egg problem: the index would itself live in the memtable, lost in the same crash it is meant to recover from.
The order matters. Capture before, persist after, in a separately-fsynced file outside Pebble. Anything else — including “capture and persist atomically inside the same batch as the data writes” — fails on at least one crash schedule.
The invariant the recovery path checks: last-flushed-index <= max(applied-index reflected in SSTables). If the file ever exceeds what the SSTables prove was flushed, recovery would skip replay for entries Pebble never saw — the silent-divergence failure mode. We make this a runtime panic, not a soft warning: better to fail to start than to start in an inconsistent state.
We are not the first to chew on this. The closest published precedent is CockroachDB’s own work, which has #16624 (open since 2017) and #38322 (open since 2019) tracking essentially the same protocol question. Both have been open long enough that the lesson is itself the lesson: the integer’s location is the easy part; the ordering around it is the hard one.
The gate
The Phase 4 exit gate is exactly one test — TEST-04:
One thousand random
kill -9injections under a continuous write load, against a Pebble-backed cluster. After each kill, restart the member, let it rejoin, and compare its post-recovery state to the rest of the cluster across three oracles: a canonical hash of every bucket, theconsistentIndex, and a linearizability check.Pass = exactly zero divergences across all 1,000 iterations.
No statistical pass. No “this one looks like a test harness artifact.” Zero.
The reason for zero, not 0.1%, is structural: an etcd cluster’s CORRUPT alarm trips on a single hash mismatch. A failure rate of one in a thousand kills, in production, would page someone every few hours on a large fleet. The whole point of the gate is to make the protocol problem visible as a protocol problem, not a flaky test.
Where we are
We ran the gate. It found three bugs — two of them in code we hadn’t touched, both surfaced by the protocol being correct enough to expose them. That story is the next post.
The protocol itself, once written down, comes out to roughly forty lines of Go and one extra fsync per flush. The implementation was small; getting the ordering right was the work.