There is a certain kind of engineering trap where you build something sophisticated, and the sophistication itself becomes the product. You stop asking "does this solve the problem?" and start asking "isn't this impressive?"
The trap is seductive because the sophisticated thing looks right. It uses the correct vocabulary. It references the right papers. The README sounds like something a senior engineer at a FAANG company would write. You show it to people and they nod slowly, impressed by terms they don't fully understand — which means you don't have to fully understand them either.
I fell into that trap building SherifDB. Then a friend called me out — in Swahili — and I climbed out of it.
This is that story. It is also a story about what a real storage engine looks like when you care more about correctness than about sounding smart.
I wanted to understand storage engines from the inside. Not use one — build one. The kind of project where you read a paper, implement it badly, break it, read the paper again, and finally understand what the authors were actually saying.
The paper I chose was the Bitcask design from Basho Technologies. Bitcask is elegant in its simplicity:
That is the whole design. Append-only log plus in-memory index. The simplicity is the point — sequential writes are fast on every storage medium, reads are a single seek, and the data model is easy to reason about.
I understood the design. What I built did not reflect that understanding.
SherifDB started with the right bones: append-only log, in-memory hash index, CRC32 checksums on every record. The data format was clean:
[ crc32(4) | timestamp(8) | kLen(4) | vLen(4) | key | value ]
Four bytes of checksum. Eight bytes of timestamp. Four bytes each for key and value lengths. Then the raw bytes. Every record self-describing. You could hexdump the file and read it without running any application code.
But I wanted it to be fast. And "fast" in my head at the time meant "lock-free."
So I added a ring buffer — 1024 pre-allocated 4KB slots, managed with atomic Compare-And-Swap operations. No GC pressure. No mutex contention. Goroutines would pop a buffer slot, encode their record, write to disk, push the slot back — all without blocking on a lock.
type RingBuffer struct {
storage [][]byte
size uint32
head uint32 // read from here
tail uint32 // write to here
}
I also added runtime.Pinner to pin Go memory during syscalls.
The idea being the garbage collector should not move our buffer mid-write.
I had read about this in Go internals documentation and wanted to use it.
The README talked about "kernel-safe interaction," "mechanical sympathy," and "deterministic ordering." It sounded like systems engineering.
It was broken in three separate ways. And solving zero real problems.
I shared the code with a friend. He looked at ring.go for a few
minutes and sent back this message:
"Ninja — kwa the ring buffer, method Push, unaweka data kwa slot you don't own. Is you need to own it first ndio you place the data."
For those who do not speak Swahili — and this translation matters, because the criticism was precise:
"Man — for the ring buffer, method Push, you are placing data in a slot you do not own. You need to own it first, then you place the data."
He was exactly right. Look at Push():
func (rb *RingBuffer) Push(b []byte) error {
h := atomic.LoadUint32(&rb.head)
t := atomic.LoadUint32(&rb.tail)
if t-h == rb.size {
return ErrBufferFull
}
rb.storage[t%rb.size] = b // ← write BEFORE claiming ownership
if atomic.CompareAndSwapUint32(&rb.tail, t, t+1) {
return nil
}
return rb.Push(b)
}
The write happens before the CAS. Two goroutines can both read the same value of
t, both pass the full-check, both write to
storage[t%rb.size]. One CAS succeeds, one retries. The winner thinks
it successfully pushed its buffer — but it is holding whatever the loser wrote last.
Silent data corruption. The kind that does not panic. It just quietly gives you the
wrong answer.
The fix is not to swap those two lines. You cannot CAS-then-write atomically for a slice assignment in Go without a more complex ownership protocol. The real fix is to ask the question my friend's correction was implying:
Why do we need a ring buffer at all?
Lock-free data structures exist to solve a specific problem: contention on a shared resource across many concurrent goroutines, where mutex overhead becomes a measurable bottleneck. You profile your code, find the mutex is the hot spot, and reach for atomics as a targeted optimization.
I had not done any of that. I had reached for atomics because they sounded fast.
Look at the actual write path:
caller → encode record into buffer → seek to end of file → write → fsync → update index
There is one file. Writes must be serialized because seek + write is
not atomic — if two goroutines interleave a seek and a write, their records overlap
on disk and both are corrupt. So I already needed a mutex protecting the file handle.
And once you have a mutex protecting the file, only one goroutine is ever in
Write() at a time. Which means you only ever need one scratch
buffer. Not 1024.
The ring buffer was a pool of 1024 buffers solving a problem that required exactly one.
There were two more bugs hiding inside it:
runtime.Pinner does nothing here. Pinner is designed
for passing Go pointers to C code via cgo, to prevent the GC from moving memory
that C is reading. os.File.Write() is a pure Go function — the runtime
handles any necessary pinning internally during the syscall. My use of Pinner was
cargo-culted from cgo documentation that had nothing to do with my code. It added
lines, added complexity, and had exactly zero effect on correctness or performance.
Head and tail are initialized backwards. The ring buffer starts
with tail = uint32(capacity), not 0. In a standard ring buffer, both
start at 0 — empty when head == tail, full when
tail - head == capacity. My initialization started tail at
capacity positions ahead of head, meaning the empty check would never
trigger on a fresh buffer. Empty and full were logically swapped from the start.
Three bugs. Zero useful problems solved. The most sophisticated piece of code in the repository was the most wrong piece of code in the repository.
Before I tell you what I deleted and what I built instead, I want to be clear about something: this is not a story about how simple is always better. It is a story about how every line of code must justify its existence against the problem it claims to solve.
Some tradeoffs I made deliberately and would make again:
fsync on every write, no exceptions. I was asked whether to add a
SyncOnWrite bool option — let the caller skip fsync for performance.
I said no. The moment you add SyncOnWrite: false, someone sets it,
forgets what it means, loses data on a crash, and the library is "unreliable."
SherifDB's identity is that every write is durable, always. That is a stronger
guarantee than "durable if you configured it correctly." The fsync is the cost of
the contract. You do not get durability for free.
No compaction in the hot path. The append-only log grows
indefinitely with overwrites — every time you write a key that already exists, the
old record stays on disk. This is the same tradeoff Bitcask makes. Disk is cheap.
Sequential write throughput matters. I considered auto-compaction triggered by file
size, but rejected it: Set() would sometimes take milliseconds and
sometimes take seconds, depending on whether compaction triggered. Unpredictable
latency is worse than a slightly larger file. Compaction is a deliberate
Compact() call the caller makes on their own schedule.
No parallel compaction. The bottleneck in compaction is disk I/O, not CPU. Parallel readers on a single disk turn sequential reads into random reads and destroy the kernel's readahead. Parallel compaction would add ~125 LOC, introduce goroutine coordination complexity, and run slower on any single-disk machine. Rejected.
No Options struct. Options structs grow. Today it is
SyncOnWrite. Tomorrow it is MaxFileSize,
CompactionThreshold, ReadOnly, BufferPool.
Each option is a new way to misconfigure the engine. SherifDB has one correct
configuration. The six public functions are the entire API surface. That is
intentional.
These were not lazy decisions. They were the result of asking, for each feature: what problem does this solve, who has that problem, and what does the code cost?
ring.go → deleted entirely (-60 LOC, 3 bugs removed)
runtime.Pinner → deleted (-5 LOC, 0 functionality lost)
restore() → inlined, was 3 lines (-4 LOC, clearer call site)
ErrDeleted → declared, never used (-1 LOC, dead code)
layout comment → moved to README (-6 LOC, docs belong in docs)
Net deletion: 76 lines. Net improvement in correctness: significant.
Deleting code improved the correctness of the system. That is the thing nobody
tells you when they are writing excited blog posts about lock-free data structures.
A sync.Mutex is 5 lines any engineer can reason about in 30 seconds.
An atomic ring buffer is 60 lines that experts argue about. When the mutex is
correct and the ring buffer is broken, you delete the ring buffer.
A single file. 488 LOC. Six public functions:
func Open(path string) (*DB, error)
func (db *DB) Set(key, value []byte) error
func (db *DB) Get(key []byte) ([]byte, error)
func (db *DB) Delete(key []byte) error
func (db *DB) Compact() error
func (db *DB) Close() error
The DB struct:
type DB struct {
mu sync.Mutex // serializes all writes
file *os.File // the append-only log
lock *os.File // the exclusive lockfile
index *keyDir // in-memory key → offset mapping
scratch []byte // one reusable encode buffer
path string
}
One mutex. One file handle. One scratch buffer that grows lazily if a record exceeds its current capacity. That replaces 1024 pre-allocated 4KB slots with one slice that allocates exactly once per size class ever seen.
Durability — fsync after every write.
os.File.Write() returning success means the data is in the OS page
cache. It does not mean it is on disk. The kernel can hold data in memory for
seconds before flushing. If power cuts in that window, the write is gone — but your
keydir thinks it succeeded. file.Sync() is the only syscall that gives
you the actual durability guarantee. Every write in SherifDB ends with it.
Crash recovery — CRC32 with torn write truncation. Every record carries a 4-byte checksum over the entire record payload. On startup, log replay verifies each checksum. On failure — torn write, partial record, power loss mid-write — the loop breaks and the file is truncated to the last known good offset. The database wakes up at the last fully committed record. Without this, a crash leaves a corrupt tail. Next startup reads garbage lengths, tries to allocate gigabytes, and either OOMs or panics.
Tombstone deletes — vLen = 0xFFFFFFFF.
A delete cannot erase a record from an append-only log. So we append a tombstone:
same header format, vLen set to 0xFFFFFFFF, no value
bytes. Log replay removes the key from the keydir when it encounters a tombstone.
Without tombstones, a delete means "remove from the in-memory keydir." Next
restart, the log replay encounters the original write record and adds the key back.
Deleted keys resurrect. This happens on the very first restart after a delete.
Tombstones are not optional in an append-only design.
Single writer — lockfile with O_EXCL.
lf, err := os.OpenFile(path+".lock", os.O_CREATE|os.O_EXCL|os.O_WRONLY, 0600)
O_EXCL tells the kernel: create this file, but fail if it already
exists. This is atomic at the filesystem level. Two processes racing to open the
same database will both attempt to create the lockfile — exactly one will succeed.
Without this, two processes can interleave appends. Individual records may be
perfectly valid — correct CRC, correct lengths — but the log as a sequence is
unreadable. No amount of checksumming fixes interleaved writes.
Fast restarts — hint file. Log replay is O(data) — you scan every byte in the file to rebuild the keydir. For a database with 10,000 keys and 500MB of value data, that is 500MB of reads on every startup just to extract 10,000 offsets. The hint file is a compact index snapshot written on clean close:
[ kLen(4) | offset(8) | size(4) | key bytes ]
No values. Just the metadata the keydir needs. Startup reads the hint file in O(keys) and skips the log entirely. If the hint file is missing or corrupt, we fall back to full log replay. The hint file accelerates the common case; the log is the safety net.
You can read the actual data file with hexdump -C. That is a feature,
not an accident.
00000000 4b de 8e f2 02 19 c7 b0 98 26 a9 18 08 00 00 00 |K........&......|
00000010 10 00 00 00 65 6d 6d 61 6e 75 65 6c 73 79 73 74 |....emmanuelsyst|
00000020 65 6d 73 5f 65 6e 67 69 6e 65 65 72 |ems_engineer |
Breaking it down:
4b de 8e f2 → CRC32 checksum
02 19 c7 b0 98 26 a9 18 → timestamp (Unix nanoseconds)
08 00 00 00 → kLen = 8 (little-endian)
10 00 00 00 → vLen = 16 (little-endian)
65 6d 6d 61 6e 75 65 6c → key = "emmanuel"
73 79 73 74 65 6d 73 5f
65 6e 67 69 6e 65 65 72 → value = "systems_engineer"
A tombstone looks identical except vLen = ff ff ff ff and there are
no value bytes after the key. The checksum covers the ff ff ff ff
too — a corrupt tombstone is caught the same way as any corrupt record.
You do not need the source code to read this file. You need hexdump
and this format spec. When something goes wrong — and something will — you want to
be able to inspect the raw data without trusting the application layer.
Write "emmanuel" → "engineer" ten times and you have ten records on disk, but only the last one is live. The other nine are dead — not referenced by the keydir, not readable by Get, just occupying space.
Compact() reclaims that space. The sequence:
.compact temp filefsync the temp fileos.Rename over the old file — atomic at the kernel levelThe rename is the critical step. On POSIX systems, rename is atomic — the destination path atomically points to the new file. There is no window where the path does not exist, or points to a partially-written file. If the process dies during compaction, the old data file is intact. If the rename succeeds, the compacted file is fully written and synced. You get one or the other. Never a half-compacted mess.
Compact() is a caller decision, not automatic. The engine does not
decide when your disk is too full. You do. Call it during a maintenance window,
on a schedule, after a bulk delete. This is what "library not application" means
in practice: the engine provides the mechanism, the caller provides the policy.
| Metric | Clever version | Final version |
|---|---|---|
| LOC | ~195 | 488 |
| Ring buffer | 60 LOC, 3 bugs | deleted |
| runtime.Pinner | present, useless | deleted |
| Durability | no fsync | fsync every write |
| Crash recovery | none | CRC + truncation |
| Tombstone deletes | no | yes |
| Single writer | no | lockfile |
| Fast restarts | no | hint file |
| Compaction | no | Compact() |
| Public API | none exported | 6 functions |
| Dependencies | stdlib | stdlib |
The final version is bigger because it actually does things. The clever version was smaller because it was mostly impressive-sounding code that did nothing useful and several things incorrectly.
Lock-free is a last resort, not a starting point.
The correctness surface of atomic operations in Go is brutal. The memory model is
precise and unforgiving. A sync.Mutex gives you an explicit
happens-before relationship any engineer can reason about in 30 seconds. Reach for
atomics only when you have proven the mutex is the bottleneck — with a profiler,
not intuition.
Complexity must justify itself against the actual problem.
The ring buffer would earn its place if I had multiple goroutines encoding records
in parallel before a batch flush, or a dedicated I/O goroutine draining a
pre-encoded queue. I had neither. I had a write path where every operation ends
with fsync. The mutex overhead on top of an fsync is unmeasurable
noise. The ring buffer's complexity was unjustified by the architecture.
Deleting code is engineering. The ring buffer deletion was not giving up. It was recognizing that code has a cost — bugs, maintenance burden, cognitive load for every future reader — and that cost must be paid back by the problem the code solves. The ring buffer's cost was real. Its benefit was zero.
Small is a feature with consequences. 488 LOC means the entire engine fits in one sitting. It means I can audit every guarantee the code makes. It means when something breaks I know exactly where to look. That is not a consolation prize for not having more features. It is a design goal that constrained every decision.
Your friends will catch what you miss. I stared at that ring buffer for days. I was too close to it, too invested in the idea that lock-free meant fast meant good. It took someone who did not care about my cleverness to read it fresh and say, in Swahili, "you are writing to a slot you do not own." Correctness review from someone who has not read the design doc is worth more than an hour of self-review.
The code is at github.com/Emmanuel326/sheriffDB. Single file. No dependencies. Six functions. Every guarantee documented and justified.
If you find a bug, open an issue. If you think a tradeoff is wrong, argue it — but bring a reason, not just an instinct. Every decision in this codebase was made deliberately and can be defended.
Built during late nights in Nairobi. The Swahili code review is real and lives in the conversation history forever. Some of the best technical feedback I have received came in a language I grew up with, about a bug in a language I am still learning. That detail felt worth keeping.