What the chaos gate surfaced

The last post described the Phase 4 gate: a thousand random kill -9 injections under write load, comparing post-recovery state across the cluster, zero divergences allowed.

We ran it. It did not pass.

The interesting part is not that it failed — chaos tests are supposed to find things. The interesting part is what it found. The bugs were not in the WAL-disabled protocol from the previous post. They were in surrounding code that had been correct enough under bbolt’s behaviour and was no longer correct under Pebble’s.

Three classes, labelled in our notes as A, B, and C.

Bug A: a missing parent-directory fsync

The first failure came from snapshot installation on a follower. When a follower falls behind, the leader sends a snapshot — an entire database image. The follower receives the bytes, writes them to a temp file, fsyncs the file, and renames it into place.

The fsync is on the file. The rename is on the directory. On ext4 with data=ordered, the rename is durable only after the directory entry is fsync’d. Skip that, and a crash between the rename and the next implicit flush can leave you with the new file content but the old directory entry — or the new directory entry pointing at an empty file.

Under bbolt, this almost never mattered: bbolt’s read path was tolerant enough of intermediate states that recovery would refetch. Under Pebble, the same state could leave a directory marked “Pebble database” with no SSTables in it, and Pebble would happily open it as empty.

The fix is one line — fsync the parent directory after the rename. The bug had been theoretically present in etcd for years; Pebble’s stricter open-time behaviour just exposed it.

Bug B: a wire-size mismatch

The second class showed up under a different injection pattern: snapshots that arrived truncated.

A Pebble snapshot on the wire is header + body + trailer: a 17-byte engine-tagged header, the tar of a Pebble checkpoint, and a SHA-256 trailer. We had a Size() method that returned the body bytes and a WireSize() method that returned the full count. Both were correct. One call site, deep in snapshot_merge.go, used Size() where it should have used WireSize() — a 16-byte underestimate per snapshot. Most of the time, harmless. Under torn connections, the receiver’s pre-allocation check could trip between the size check and the trailer arrival, depending on TCP scheduling.

The fix is also one line. The regression test we’re more pleased with asserts that for every snapshot, WireSize() matches the actual bytes WriteTo produces:

n, _ := snap.WriteTo(&counter)
assert.Equal(t, snap.WireSize(), n)

If they ever drift, the test fails. Future code can’t reintroduce the bug.

Bug C: a cascade

The third class was a consequence of B. When the snapshot-merge size check tripped, the leader would re-send the snapshot. Under rapid-retry conditions — multiple failed installations within seconds — the leader would have two or three in-flight snapshot streams to the same follower, each holding open TCP connections, each holding a Pebble snapshot, each preventing SSTable compaction.

On smaller members, compaction debt spiked briefly. On larger ones, the follower received interleaved byte streams from two senders and our (now-correct) WireSize check would fail them — producing what looked like a fourth bug class until we traced the chain back.

The fix is structural: a per-destination single-flight gate on snapshot send. While one snapshot is in flight to a given peer, additional send requests for that peer are short-circuited. Idle connections to peers that recently failed are closed eagerly. The regression test opens N parallel Send() calls to the same destination and asserts only one produces wire traffic at a time.

What this taught us

Two things, mostly.

Chaos tests are forensic instruments, not pass/fail checks. The point isn’t to ship green — it’s to make the failure modes legible. Each bug class was diagnosed by stepping backwards from a specific divergence at iteration 612, 738, 944. The chaos run produced enough breadcrumbs that the post-mortem was tractable.

Quiet fixes earn their keep. A parent-directory fsync and a single-flight gate are not the kind of code one writes a blog post about; they are the kind of code that quietly prevents pages a year from now. Each got a targeted unit test alongside the fix — not because the chaos suite isn’t enough, but because the chaos suite is slow, and we want failures to fail fast next time they try to come back.

We re-ran the gate after all three fixes landed. The smoke harness — a 50-iteration variant we use to gate the gate — passed clean; the full 1,000-iteration re-run is scheduled and not yet complete. The phase doesn’t close until we see zero across the whole thousand. The next post reports the full 1,000-iteration result.