An engineering notebook

Porting etcd's storage engine from bbolt to Pebble.

etcd has used a B+ tree (bbolt) as its on-disk store since the 3.x line. It has carried Kubernetes for a decade, but at the scale of modern control planes the seams show: an 8 GB practical ceiling, stop-the-world defragmentation, and an mmap memory model that depends on kernel quirks. This is a notebook of an in-progress effort to add a Pebble (LSM-tree) backend alongside bbolt — what we are doing, why, and what we are learning.

Status: in progress Scope: opt-in --backend=pebble, bbolt stays default Target: Linux + SSD/NVMe

Posts

  • 01

    The case for a second storage engine

    Why etcd is getting a Pebble (LSM-tree) backend alongside bbolt — what hurts at Kubernetes scale, and what we explicitly are not changing.

  • 02

    Sealing the bbolt leaks

    Before any Pebble code can land, four bbolt-typed APIs have to leave the public Backend interface. Boring work, load-bearing outcome.

  • 03

    How bbolt and Pebble differ

    A short primer on B+ trees and LSM-trees, and why the difference between them shows up at exactly the operational seams etcd operators care about.

  • 04

    Disabling Pebble's WAL

    etcd already has a Raft write-ahead log. Running Pebble's WAL on top is wasted fsyncs. The trade-off is that we have to be exactly right about a single integer.

  • 05

    What the chaos gate surfaced

    We ran the kill -9 gate. It surfaced three bugs in code we hadn't touched. Two were structural; one cascaded from another.

  • 06

    What we're explicitly not doing

    Big migrations succeed by what they refuse to take on. A list of things this milestone deliberately punts — and why each refusal earns its keep.

Where we are

Phase 1Seal bbolt leaksDone
Phase 2Engine factory + skeletonDone
Phase 3Read / write / iterator / snapshot parityDone
Phase 4WAL-disabled durability gateIn progress
Phase 5Conformance + chaos testsDone
Phase 6Migration toolUp next
Phase 7Benchmarks & operator docsPlanned

Phase 4 is the hardest correctness gate of the milestone — 1,000 random kill -9 injections under write load must produce zero state divergences. See the posts for the journey.