etcd has run on the same storage engine — bbolt , a single-file mmap’d B+ tree — for the entire life of the v3 line. A few thousand lines of Go, copy-on-write semantics, a clean transactional interface. Kubernetes, OpenShift, Cilium, and a long tail of other systems trust it every day.
It is also pushing against limits that did not exist ten years ago.
Where bbolt strains
Three operational stories come up at every scale that matters:
The 8 GB ceiling. Officially the supported maximum is 8 GiB. In practice, large Kubernetes clusters approach it with revision history, leases, and large CRDs. Going past it works, until it doesn’t — single-writer concurrency in bbolt means tail latency from one slow operation spreads across the whole cluster, and the failure modes are uneven enough to make capacity planning a guessing game.
Stop-the-world defragmentation. bbolt’s free-list strategy is excellent at avoiding fragmentation in the steady state, but MVCC compaction can leave it holding a lot of free pages. The fix is etcdctl defrag — and online defrag locks the backend for the duration of the rewrite. On a busy member, that means watch lag, election risk, and pager noise. Operators have built whole runbooks around the defrag window.
An mmap memory model that depends on kernel quirks. bbolt uses mmap for reads and adjusts with MADV_RANDOM. The Linux 6.4 readahead changes that triggered bbolt#939
were a reminder that the storage layer’s working-set behaviour is one kernel release away from looking different. mmap’d databases are not portable assumptions.
None of this means bbolt is broken. It means it is solving a problem we no longer have alone.
Why Pebble specifically
The natural alternative is an LSM-tree. There are good ones — RocksDB, LevelDB, BadgerDB. We chose Pebble for four reasons:
- Pure Go. No CGo, no second toolchain, no language boundary in profiles. etcd’s build, deploy, and observability story stays one language.
- Production-proven at the right scale. CockroachDB has been running on Pebble since 2020 — at TB scale, inside a Raft-based system. That is the closest precedent to etcd’s architecture that exists in open source.
- The API maps cleanly onto our
Backendinterface. Indexed batches, snapshots, and iterators are the same primitives etcd already speaks. The integration is mostly translation, not invention. - Tuning surface that fits the etcd workload. L0 sublevels, ClockPro block cache, per-level Zstd compression — these are well-matched to monotonically-increasing revision keys and watch range scans.
The risk profile has also dropped. Pebble v2.1.5 (released a few days before this writing) is the first release where DisableWAL=true — the mode etcd needs, because etcd already has a Raft WAL — is a first-class, well-exercised configuration.
What this is not
These deserve equal time. A separate post on what we’re not doing comes later in the series; the short version:
- bbolt stays the default. This entire effort is opt-in via
--backend=pebble. Existing deployments must keep working with no behaviour change. We are not pulling the rug. - bbolt is not being removed. It remains a fully supported backend. We are adding an option, not replacing one.
- No dual-write or shadow modes. The validation strategy is the existing test suite parameterized by engine, plus chaos tests, plus operator-driven canary deploys. Not running both engines side-by-side in production.
- No bidirectional migration. Forward only, bbolt → Pebble. Rollback is restoring the bbolt backup directory the migration tool leaves behind. Cutting the surface in half cuts the test matrix more than in half.
- No client-visible changes. The wire protocol, gRPC API, and
etcdctlsemantics are unchanged. Operators see new flags and a new directory format. Clients see nothing.
What we are committing to
The bar is the same one bbolt set:
- The full etcd test suite (unit, integration, e2e, robustness, linearizability) passes against Pebble. No test gets skipped because “it’s hard on Pebble.”
- 1,000 random
kill -9injections under write load produce zero state divergences. This is a hard gate; no statistical pass. - A populated bbolt data directory can be migrated to Pebble end-to-end with an integrity check that compares a canonical hash on both sides.
- An operator can read four guides — operator, tuning, migration runbook, troubleshooting — and successfully bring up a Pebble-mode cluster from scratch.
This notebook is the public record of getting from here to there. The next post is about the unglamorous work that has to come first: an interface with four leaks in it, and what it took to seal them.