The case for a second storage engine
Why etcd is getting a Pebble (LSM-tree) backend alongside bbolt — what hurts at Kubernetes scale, and what we explicitly are not changing.
An engineering notebook
etcd has used a B+ tree (bbolt) as its on-disk store since the 3.x line. It has carried Kubernetes for a decade, but at the scale of modern control planes the seams show: an 8 GB practical ceiling, stop-the-world defragmentation, and an mmap memory model that depends on kernel quirks. This is a notebook of an in-progress effort to add a Pebble (LSM-tree) backend alongside bbolt — what we are doing, why, and what we are learning.
Posts
Why etcd is getting a Pebble (LSM-tree) backend alongside bbolt — what hurts at Kubernetes scale, and what we explicitly are not changing.
Before any Pebble code can land, four bbolt-typed APIs have to leave the public Backend interface. Boring work, load-bearing outcome.
A short primer on B+ trees and LSM-trees, and why the difference between them shows up at exactly the operational seams etcd operators care about.
etcd already has a Raft write-ahead log. Running Pebble's WAL on top is wasted fsyncs. The trade-off is that we have to be exactly right about a single integer.
We ran the kill -9 gate. It surfaced three bugs in code we hadn't touched. Two were structural; one cascaded from another.
Big migrations succeed by what they refuse to take on. A list of things this milestone deliberately punts — and why each refusal earns its keep.
Where we are
Phase 4 is the hardest correctness gate of the milestone — 1,000 random kill -9 injections under write load must produce zero state divergences. See the posts for the journey.