How bbolt and Pebble differ

Pebble and bbolt look interchangeable from above the Backend interface — same Put, same Range, same Snapshot. Underneath, they are structurally different. To understand why we are going through this whole effort, it helps to know how they differ.

bbolt, briefly

bbolt is a single-file B+ tree . When you write a key, it traverses the tree, finds the right leaf page, and updates it in place — using copy-on-write, so the on-disk image is never half-written. Reads mmap the file, so the OS page cache is the read cache. There is one writer at a time; readers don’t block writers, but a long-running read pins the page revisions it observed.

The design has real virtues: the code is small, recovery is essentially free because the file is always meta-page consistent, point reads are usually a single page fault, and range scans are sequential on disk.

Three structural costs show up when you push it hard:

Free-page accumulation. MVCC compaction deletes lots of keys. Their pages become free, but the file does not shrink. etcdctl defrag rewrites it to reclaim them — and locks the backend while it runs.
Single-writer concurrency. One write transaction at a time. Throughput is the speed of one writer.
Memory the kernel owns. mmap means the working set is whatever the kernel decides to keep resident. Usually fine. Sometimes not.

Pebble, briefly

Pebble is a Log-Structured Merge-tree (LSM) . Writes go to an in-memory memtable and an on-disk write-ahead log. When the memtable fills up, it is flushed to a Sorted String Table (SSTable) on disk — an immutable file containing keys in sorted order with a block index and a Bloom filter on top.

SSTables live in numbered levels. L0 holds the most-recently-flushed SSTables; L1 through L6 are larger, more compressed, more thoroughly merged. A background compaction process picks overlapping SSTables and rewrites them as a smaller number of larger SSTables one level down. Deletes are encoded as tombstones; compactions garbage-collect them.

So a key’s lifecycle is: memtable → WAL → flush to L0 → compact down through L1..L6 → eventually overwritten or tombstoned away. Reads check the memtable, then each level, short-circuited by Bloom filters.

The virtues that matter for etcd:

Compaction is continuous and online. No “defrag window” — space is reclaimed by the same machinery that organizes the levels.
Compression is per-level. L0–L4 use Snappy/MinLZ for cheap CPU; L5–L6 use Zstd-1 for the ratio. Protobuf-heavy Kubernetes payloads compress 3-5×.
Larger working sets behave gracefully. The block cache is sized explicitly, not at the mercy of the kernel’s mmap heuristics.
TB-scale precedent. CockroachDB has been on Pebble since 2020, inside a Raft-based system, at sizes etcd does not currently target.

The costs:

Write amplification. Keys are rewritten several times as they migrate down levels; per-level tuning controls how aggressively.
Read amplification. A worst-case point read touches the memtable plus every level — mitigated by Bloom filters, but real on cold caches.
More moving parts. Compaction debt, level sizes, sublevel counts, block-cache pressure — new metrics, new runbooks.

What changes at the operator seams

Most of the surface stays the same. Same etcdctl, same gRPC, same on-disk location (${data-dir}/member/snap/db flips from a single file to a directory). But four operational seams shift visibly:

Seam	bbolt	Pebble
Defrag	Stop-the-world rewrite; locks the backend	Non-blocking background compaction; `etcdctl defrag` triggers a manual `db.Compact`
Memory	mmap’d, kernel-managed, opaque	Explicit block cache + memtable budget, cgroup-aware
Compression	None	Per-level (configurable), default `fast` profile
Quota	File size vs `--quota-backend-bytes`	`SizeInUse()` vs `--quota-backend-bytes` (compaction debt accounted)

There is one operationally subtle case: a long-running pebble.Snapshot — the LSM analogue of bbolt’s stable read view — pins SSTables and prevents them from being compacted away. A 4-hour stale watch on a write-heavy cluster could pin hundreds of gigabytes. The mitigation is structural: snapshots are not held across RPC boundaries. The watch path in Pebble mode re-iterates per notification with LowerBound = lastEmittedRev + 1. Different code path; same observable behaviour.

What stays the same

The shape of correctness does not change. Raft consensus and the WAL are still etcd’s durability story. MVCC’s revision-keyed history works on either backend. Watches see the same key-range subscriptions. The auth store, lease manager, and snapshot transfer pipeline see different bytes underneath, but the same semantics on top.

You can think of the whole effort as: the abstraction was waiting for a second implementation. It just took a decade and a Kubernetes-scale forcing function to actually go build one.

The next post is about the first place where “same semantics” gets genuinely hard: durability. Pebble has its own write-ahead log. etcd already has a Raft write-ahead log. Running both is wasteful. Running only one requires being very careful about what “flushed” means.