pread() Really Take?At some point, every performance investigation collapses into a deceptively simple question:
“How long does this actually take?”
This page documents one such question, stripped down as far as I know how to strip it:
How long does a single
pread()system call take, and what role does the page cache play?
Not throughput. Not benchmarks. Not “real-world workloads”. Just one syscall, measured carefully.
The code for this experiment lives here:
https://github.com/Emmanuel326/syslat
Most performance discussions mix several effects into one number: syscall overhead, page faults, storage latency, filesystem behavior, CPU scheduling, and sometimes sheer luck.
Once mixed, cause and effect become hard to separate.
Rather than arguing about numbers, this experiment backs up and asks a deliberately narrow question under explicit conditions.
If we can’t reason about one syscall in isolation, we have no business reasoning about systems built on top of thousands of them.
Each data point is the elapsed time of one pread() call,
measured from userspace entry to userspace return.
Each iteration:
pread()There is no averaging, no warm-up phase, and no filtering. Every syscall stands on its own.
The output is intentionally raw. Interpretation is a separate step.
If the experiment does not fit in one file, it is not yet simple enough to trust.
tasksetCPU pinning is treated as environmental control, not part of the program itself.
Page cache state is treated as an initial condition, not a runtime toggle.
sync echo 3 | sudo tee /proc/sys/vm/drop_caches
This forces the page cache to start empty. Early iterations will include page faults and real I/O.
cat testfile > /dev/null
This ensures the file is resident in the page cache before measurement.
The binary itself is unchanged between runs. Only the initial conditions differ.
pread()?
pread() avoids shared file offset state.
Each iteration reads the same offset and follows the same kernel path.
This removes lseek() noise and keeps the syscall behavior
as consistent as possible across iterations.
The goal is not realism, but repeatability.
Those are all valid questions. They just require different experiments.
The program emits one number per line:
latency_ns latency_ns latency_ns
There are no summaries and no opinions in the output. The measurement path is kept as short and transparent as possible.
Any aggregation or visualization happens after the fact.
This page documents the baseline.
Future work will modify exactly one variable at a time — block size, offset patterns, storage media, CPU isolation — while preserving the existing structure.
As the experiment evolves, this page will grow with it.