I Deleted My Clever Code and the Database Got Better

There is a certain kind of engineering trap where you build something sophisticated, and the sophistication itself becomes the product. You stop asking "does this solve the problem?" and start asking "isn't this impressive?"

The trap is seductive because the sophisticated thing looks right. It uses the correct vocabulary. It references the right papers. The README sounds like something a senior engineer at a FAANG company would write. You show it to people and they nod slowly, impressed by terms they don't fully understand — which means you don't have to fully understand them either.

I fell into that trap building SherifDB. Then a friend called me out — in Swahili — and I climbed out of it.

This is that story. It is also a story about what a real storage engine looks like when you care more about correctness than about sounding smart.


Why I built this

I wanted to understand storage engines from the inside. Not use one — build one. The kind of project where you read a paper, implement it badly, break it, read the paper again, and finally understand what the authors were actually saying.

The paper I chose was the Bitcask design from Basho Technologies. Bitcask is elegant in its simplicity:

That is the whole design. Append-only log plus in-memory index. The simplicity is the point — sequential writes are fast on every storage medium, reads are a single seek, and the data model is easy to reason about.

I understood the design. What I built did not reflect that understanding.


The original design (the clever one)

SherifDB started with the right bones: append-only log, in-memory hash index, CRC32 checksums on every record. The data format was clean:

[ crc32(4) | timestamp(8) | kLen(4) | vLen(4) | key | value ]

Four bytes of checksum. Eight bytes of timestamp. Four bytes each for key and value lengths. Then the raw bytes. Every record self-describing. You could hexdump the file and read it without running any application code.

But I wanted it to be fast. And "fast" in my head at the time meant "lock-free."

So I added a ring buffer — 1024 pre-allocated 4KB slots, managed with atomic Compare-And-Swap operations. No GC pressure. No mutex contention. Goroutines would pop a buffer slot, encode their record, write to disk, push the slot back — all without blocking on a lock.

type RingBuffer struct {
    storage [][]byte
    size    uint32
    head    uint32  // read from here
    tail    uint32  // write to here
}

I also added runtime.Pinner to pin Go memory during syscalls. The idea being the garbage collector should not move our buffer mid-write. I had read about this in Go internals documentation and wanted to use it.

The README talked about "kernel-safe interaction," "mechanical sympathy," and "deterministic ordering." It sounded like systems engineering.

It was broken in three separate ways. And solving zero real problems.


A friend read it — in Swahili

I shared the code with a friend. He looked at ring.go for a few minutes and sent back this message:

"Ninja — kwa the ring buffer, method Push, unaweka data kwa slot you don't own. Is you need to own it first ndio you place the data."

For those who do not speak Swahili — and this translation matters, because the criticism was precise:

"Man — for the ring buffer, method Push, you are placing data in a slot you do not own. You need to own it first, then you place the data."

He was exactly right. Look at Push():

func (rb *RingBuffer) Push(b []byte) error {
    h := atomic.LoadUint32(&rb.head)
    t := atomic.LoadUint32(&rb.tail)

    if t-h == rb.size {
        return ErrBufferFull
    }

    rb.storage[t%rb.size] = b   // ← write BEFORE claiming ownership

    if atomic.CompareAndSwapUint32(&rb.tail, t, t+1) {
        return nil
    }
    return rb.Push(b)
}

The write happens before the CAS. Two goroutines can both read the same value of t, both pass the full-check, both write to storage[t%rb.size]. One CAS succeeds, one retries. The winner thinks it successfully pushed its buffer — but it is holding whatever the loser wrote last. Silent data corruption. The kind that does not panic. It just quietly gives you the wrong answer.

The fix is not to swap those two lines. You cannot CAS-then-write atomically for a slice assignment in Go without a more complex ownership protocol. The real fix is to ask the question my friend's correction was implying:

Why do we need a ring buffer at all?


The harder question

Lock-free data structures exist to solve a specific problem: contention on a shared resource across many concurrent goroutines, where mutex overhead becomes a measurable bottleneck. You profile your code, find the mutex is the hot spot, and reach for atomics as a targeted optimization.

I had not done any of that. I had reached for atomics because they sounded fast.

Look at the actual write path:

caller → encode record into buffer → seek to end of file → write → fsync → update index

There is one file. Writes must be serialized because seek + write is not atomic — if two goroutines interleave a seek and a write, their records overlap on disk and both are corrupt. So I already needed a mutex protecting the file handle.

And once you have a mutex protecting the file, only one goroutine is ever in Write() at a time. Which means you only ever need one scratch buffer. Not 1024.

The ring buffer was a pool of 1024 buffers solving a problem that required exactly one.

There were two more bugs hiding inside it:

runtime.Pinner does nothing here. Pinner is designed for passing Go pointers to C code via cgo, to prevent the GC from moving memory that C is reading. os.File.Write() is a pure Go function — the runtime handles any necessary pinning internally during the syscall. My use of Pinner was cargo-culted from cgo documentation that had nothing to do with my code. It added lines, added complexity, and had exactly zero effect on correctness or performance.

Head and tail are initialized backwards. The ring buffer starts with tail = uint32(capacity), not 0. In a standard ring buffer, both start at 0 — empty when head == tail, full when tail - head == capacity. My initialization started tail at capacity positions ahead of head, meaning the empty check would never trigger on a fresh buffer. Empty and full were logically swapped from the start.

Three bugs. Zero useful problems solved. The most sophisticated piece of code in the repository was the most wrong piece of code in the repository.


Every tradeoff was a decision

Before I tell you what I deleted and what I built instead, I want to be clear about something: this is not a story about how simple is always better. It is a story about how every line of code must justify its existence against the problem it claims to solve.

Some tradeoffs I made deliberately and would make again:

fsync on every write, no exceptions. I was asked whether to add a SyncOnWrite bool option — let the caller skip fsync for performance. I said no. The moment you add SyncOnWrite: false, someone sets it, forgets what it means, loses data on a crash, and the library is "unreliable." SherifDB's identity is that every write is durable, always. That is a stronger guarantee than "durable if you configured it correctly." The fsync is the cost of the contract. You do not get durability for free.

No compaction in the hot path. The append-only log grows indefinitely with overwrites — every time you write a key that already exists, the old record stays on disk. This is the same tradeoff Bitcask makes. Disk is cheap. Sequential write throughput matters. I considered auto-compaction triggered by file size, but rejected it: Set() would sometimes take milliseconds and sometimes take seconds, depending on whether compaction triggered. Unpredictable latency is worse than a slightly larger file. Compaction is a deliberate Compact() call the caller makes on their own schedule.

No parallel compaction. The bottleneck in compaction is disk I/O, not CPU. Parallel readers on a single disk turn sequential reads into random reads and destroy the kernel's readahead. Parallel compaction would add ~125 LOC, introduce goroutine coordination complexity, and run slower on any single-disk machine. Rejected.

No Options struct. Options structs grow. Today it is SyncOnWrite. Tomorrow it is MaxFileSize, CompactionThreshold, ReadOnly, BufferPool. Each option is a new way to misconfigure the engine. SherifDB has one correct configuration. The six public functions are the entire API surface. That is intentional.

These were not lazy decisions. They were the result of asking, for each feature: what problem does this solve, who has that problem, and what does the code cost?


What I deleted

ring.go          → deleted entirely        (-60 LOC, 3 bugs removed)
runtime.Pinner   → deleted                (-5 LOC, 0 functionality lost)
restore()        → inlined, was 3 lines   (-4 LOC, clearer call site)
ErrDeleted       → declared, never used   (-1 LOC, dead code)
layout comment   → moved to README        (-6 LOC, docs belong in docs)

Net deletion: 76 lines. Net improvement in correctness: significant.

Deleting code improved the correctness of the system. That is the thing nobody tells you when they are writing excited blog posts about lock-free data structures. A sync.Mutex is 5 lines any engineer can reason about in 30 seconds. An atomic ring buffer is 60 lines that experts argue about. When the mutex is correct and the ring buffer is broken, you delete the ring buffer.


What I built instead

A single file. 488 LOC. Six public functions:

func Open(path string) (*DB, error)
func (db *DB) Set(key, value []byte) error
func (db *DB) Get(key []byte) ([]byte, error)
func (db *DB) Delete(key []byte) error
func (db *DB) Compact() error
func (db *DB) Close() error

The DB struct:

type DB struct {
    mu      sync.Mutex  // serializes all writes
    file    *os.File    // the append-only log
    lock    *os.File    // the exclusive lockfile
    index   *keyDir     // in-memory key → offset mapping
    scratch []byte      // one reusable encode buffer
    path    string
}

One mutex. One file handle. One scratch buffer that grows lazily if a record exceeds its current capacity. That replaces 1024 pre-allocated 4KB slots with one slice that allocates exactly once per size class ever seen.


The five guarantees

Durability — fsync after every write. os.File.Write() returning success means the data is in the OS page cache. It does not mean it is on disk. The kernel can hold data in memory for seconds before flushing. If power cuts in that window, the write is gone — but your keydir thinks it succeeded. file.Sync() is the only syscall that gives you the actual durability guarantee. Every write in SherifDB ends with it.

Crash recovery — CRC32 with torn write truncation. Every record carries a 4-byte checksum over the entire record payload. On startup, log replay verifies each checksum. On failure — torn write, partial record, power loss mid-write — the loop breaks and the file is truncated to the last known good offset. The database wakes up at the last fully committed record. Without this, a crash leaves a corrupt tail. Next startup reads garbage lengths, tries to allocate gigabytes, and either OOMs or panics.

Tombstone deletes — vLen = 0xFFFFFFFF. A delete cannot erase a record from an append-only log. So we append a tombstone: same header format, vLen set to 0xFFFFFFFF, no value bytes. Log replay removes the key from the keydir when it encounters a tombstone. Without tombstones, a delete means "remove from the in-memory keydir." Next restart, the log replay encounters the original write record and adds the key back. Deleted keys resurrect. This happens on the very first restart after a delete. Tombstones are not optional in an append-only design.

Single writer — lockfile with O_EXCL.

lf, err := os.OpenFile(path+".lock", os.O_CREATE|os.O_EXCL|os.O_WRONLY, 0600)

O_EXCL tells the kernel: create this file, but fail if it already exists. This is atomic at the filesystem level. Two processes racing to open the same database will both attempt to create the lockfile — exactly one will succeed. Without this, two processes can interleave appends. Individual records may be perfectly valid — correct CRC, correct lengths — but the log as a sequence is unreadable. No amount of checksumming fixes interleaved writes.

Fast restarts — hint file. Log replay is O(data) — you scan every byte in the file to rebuild the keydir. For a database with 10,000 keys and 500MB of value data, that is 500MB of reads on every startup just to extract 10,000 offsets. The hint file is a compact index snapshot written on clean close:

[ kLen(4) | offset(8) | size(4) | key bytes ]

No values. Just the metadata the keydir needs. Startup reads the hint file in O(keys) and skips the log entirely. If the hint file is missing or corrupt, we fall back to full log replay. The hint file accelerates the common case; the log is the safety net.


Reading the hexdump

You can read the actual data file with hexdump -C. That is a feature, not an accident.

00000000  4b de 8e f2 02 19 c7 b0  98 26 a9 18 08 00 00 00  |K........&......|
00000010  10 00 00 00 65 6d 6d 61  6e 75 65 6c 73 79 73 74  |....emmanuelsyst|
00000020  65 6d 73 5f 65 6e 67 69  6e 65 65 72              |ems_engineer    |

Breaking it down:

4b de 8e f2              → CRC32 checksum
02 19 c7 b0 98 26 a9 18  → timestamp (Unix nanoseconds)
08 00 00 00              → kLen = 8  (little-endian)
10 00 00 00              → vLen = 16 (little-endian)
65 6d 6d 61 6e 75 65 6c  → key   = "emmanuel"
73 79 73 74 65 6d 73 5f
65 6e 67 69 6e 65 65 72  → value = "systems_engineer"

A tombstone looks identical except vLen = ff ff ff ff and there are no value bytes after the key. The checksum covers the ff ff ff ff too — a corrupt tombstone is caught the same way as any corrupt record.

You do not need the source code to read this file. You need hexdump and this format spec. When something goes wrong — and something will — you want to be able to inspect the raw data without trusting the application layer.


Compaction

Write "emmanuel" → "engineer" ten times and you have ten records on disk, but only the last one is live. The other nine are dead — not referenced by the keydir, not readable by Get, just occupying space.

Compact() reclaims that space. The sequence:

  1. Lock the database — no reads or writes during compaction
  2. Snapshot the live keydir — these are the only keys that matter
  3. Read each live record from disk, verify CRC
  4. Write clean records sequentially into a .compact temp file
  5. fsync the temp file
  6. os.Rename over the old file — atomic at the kernel level
  7. Reopen, swap index, rewrite hint file

The rename is the critical step. On POSIX systems, rename is atomic — the destination path atomically points to the new file. There is no window where the path does not exist, or points to a partially-written file. If the process dies during compaction, the old data file is intact. If the rename succeeds, the compacted file is fully written and synced. You get one or the other. Never a half-compacted mess.

Compact() is a caller decision, not automatic. The engine does not decide when your disk is too full. You do. Call it during a maintenance window, on a schedule, after a bulk delete. This is what "library not application" means in practice: the engine provides the mechanism, the caller provides the policy.


The numbers

Metric Clever version Final version
LOC ~195 488
Ring buffer 60 LOC, 3 bugs deleted
runtime.Pinner present, useless deleted
Durability no fsync fsync every write
Crash recovery none CRC + truncation
Tombstone deletes no yes
Single writer no lockfile
Fast restarts no hint file
Compaction no Compact()
Public API none exported 6 functions
Dependencies stdlib stdlib

The final version is bigger because it actually does things. The clever version was smaller because it was mostly impressive-sounding code that did nothing useful and several things incorrectly.


What I learned

Lock-free is a last resort, not a starting point. The correctness surface of atomic operations in Go is brutal. The memory model is precise and unforgiving. A sync.Mutex gives you an explicit happens-before relationship any engineer can reason about in 30 seconds. Reach for atomics only when you have proven the mutex is the bottleneck — with a profiler, not intuition.

Complexity must justify itself against the actual problem. The ring buffer would earn its place if I had multiple goroutines encoding records in parallel before a batch flush, or a dedicated I/O goroutine draining a pre-encoded queue. I had neither. I had a write path where every operation ends with fsync. The mutex overhead on top of an fsync is unmeasurable noise. The ring buffer's complexity was unjustified by the architecture.

Deleting code is engineering. The ring buffer deletion was not giving up. It was recognizing that code has a cost — bugs, maintenance burden, cognitive load for every future reader — and that cost must be paid back by the problem the code solves. The ring buffer's cost was real. Its benefit was zero.

Small is a feature with consequences. 488 LOC means the entire engine fits in one sitting. It means I can audit every guarantee the code makes. It means when something breaks I know exactly where to look. That is not a consolation prize for not having more features. It is a design goal that constrained every decision.

Your friends will catch what you miss. I stared at that ring buffer for days. I was too close to it, too invested in the idea that lock-free meant fast meant good. It took someone who did not care about my cleverness to read it fresh and say, in Swahili, "you are writing to a slot you do not own." Correctness review from someone who has not read the design doc is worth more than an hour of self-review.


The code is at github.com/Emmanuel326/sheriffDB. Single file. No dependencies. Six functions. Every guarantee documented and justified.

If you find a bug, open an issue. If you think a tradeoff is wrong, argue it — but bring a reason, not just an instinct. Every decision in this codebase was made deliberately and can be defended.


Built during late nights in Nairobi. The Swahili code review is real and lives in the conversation history forever. Some of the best technical feedback I have received came in a language I grew up with, about a bug in a language I am still learning. That detail felt worth keeping.