False Sharing, Made Obvious

False sharing is one of those performance problems that everyone has heard of, most people can define, and very few have actually seen.

Sometimes it appears to do nothing at all. Other times it destroys throughput. The confusing part is that the code can look nearly identical in both cases.

Why does false sharing sometimes look harmless — and sometimes catastrophic?

The short answer is that it depends on whether cache coherence latency ends up on the critical path. This post exists to make that distinction concrete.

A deliberately small experiment

This is not a realistic workload, and it is not intended to predict application performance. It is a controlled experiment designed to isolate a single effect.

There are four tiny programs, arranged along two independent axes:

Whether two threads share a cache line or not
Whether the loop performs a simple store or a read–modify–write

Nothing else varies. Thread placement, timing logic, iteration counts, and compiler flags are identical across all cases.

The first axis: cache-line placement

In the false-sharing cases, two threads update different variables that reside on the same cache line. In the padded cases, each thread updates a variable that occupies its own cache line.

This isolates false sharing from true sharing. No data is actually shared. Only the cache line is.

The second axis: what the CPU is forced to wait for

In the store-only variants, the hot loop is effectively:

(*local)++;

On modern CPUs this often compiles to a buffered store that can retire without waiting for cache coherence traffic. The store is real, but its cost may be hidden.

In the read–modify–write variants, the loop becomes:

uint64_t v = *local;
v++;
*local = v;

Now the core must obtain exclusive ownership of the cache line, wait for invalidations on other cores, and complete the load before the store can retire. Coherence latency is no longer optional.

This difference — whether the CPU is forced to wait — is the entire point of the experiment.

A live run

The following terminal recording shows the automated benchmark running all four variants back to back. Progress output is intentionally included so long-running behavior is visible rather than silent.

How to read the results

Exact timings are not the important part. They vary by CPU model, frequency scaling, thermal state, and background load.

What matters is the qualitative ordering.

Case	Expected behavior
Store-only, padded	Fastest
Store-only, false sharing	Slightly slower
Read–modify–write, padded	Slower
Read–modify–write, false sharing	Much slower

If false sharing shows little impact in the store-only case, that does not mean it is harmless. It means its cost is being hidden by the memory system.

When the code forces coherence latency onto the critical path, the cost becomes impossible to ignore.

Why threads are pinned

Thread migration makes cache ownership unstable and introduces noise that obscures cause and effect. Each worker thread is pinned to a specific CPU so that cache lines have consistent owners and coherence traffic is real rather than incidental.

This is a teaching benchmark. Stability matters more than realism.

Why the benchmark is intentionally boring

Large benchmarks obscure causality. Abstractions pile up. Effects blur together.

Here, when performance changes, there is only one place to look.

The goal is not to impress. The goal is to remove excuses.

Takeaway

False sharing is not binary. It can exist quietly, hidden behind store buffers and out-of-order execution, or it can dominate runtime when coherence latency becomes unavoidable.

Whether you see it depends less on the data layout than on whether your code forces the hardware to wait.

This experiment exists to make that distinction obvious.