The Weird Heuristics Behind Linux TCP's RACK Loss Detection (and When It Lies)

For decades, TCP loss detection followed a simple rule: three duplicate ACKs meant a segment was missing. It worked because networks were predictable. Multi-path routing, high-speed NICs, and aggressive buffering changed that. Packets now arrive out of order as a normal event. Treating duplicate ACKs as a strict loss signal means retransmitting perfectly healthy data far too often.

RACK — Recent ACKnowledgment — was designed to fix this. Instead of counting duplicate ACKs, it looks at time: if a later packet was delivered, any earlier packet that falls outside a configurable reordering window is marked as lost.

That reordering window is not derived from any physical constant. No RFC pins it to a fixed formula. It is a heuristic — a tunable estimate of how much disorder the network might introduce — and it lives deep in the TCP stack, quietly deciding whether a connection triggers recovery or simply waits.

Sometimes, it gets that decision wrong.


What the kernel exposes

On my ThinkPad T470s running Linux 6.18.21 LTS:

$ cat /proc/sys/net/ipv4/tcp_recovery
1

This signals that TCP recovery no longer relies purely on the classic duplicate-ACK rule. The old contract has been replaced by something probabilistic — inference rather than observation.

The core logic lives in net/ipv4/tcp_recovery.c:

static void tcp_rack_detect_loss(struct sock *sk, u32 *reo_timeout)
{
    struct tcp_sock *tp = tcp_sk(sk);
    u32 min_rtt = tcp_min_rtt(tp);
    struct sk_buff *skb;
    u32 reo_wnd;

    *reo_timeout = 0;

    /* To be more reordering resilient, allow min_rtt/4 settling delay
     * (lower-bounded to 1000uS).
     */
    reo_wnd = 1000;
    if ((tp->rack.reord || !tp->lost_out) && min_rtt != ~0U) {
        reo_wnd = max((min_rtt >> 2) * tp->rack.reo_wnd_steps, reo_wnd);
        reo_wnd = min(reo_wnd, tp->srtt_us >> 3);
    }

    /* then iterate through send queue... */
}

The reo_wnd variable is the reordering window. It is not a constant. It is not specified by any RFC. It is a heuristic — and when it is wrong, the connection pays for it.


Three numbers that matter

After 2 hours and 59 minutes of uptime — 205,861 incoming packets, 46,307 outgoing, 176 TCP connections opened — this is what nstat shows:

$ nstat | grep -E "TCPLostRetransmit|TCPSpuriousRtx|TCPDSACKRecv"
TcpExtTCPLostRetransmit         7                  0.0
TcpExtTCPSpuriousRtxHostQueues  1                  0.0
TcpExtTCPDSACKRecv              2                  0.0

Seven packets marked lost and retransmitted. Two DSACKs from the receiver confirming duplicated data. One confirmed spurious retransmission.

RACK lied once. Not from malice — from heuristics.


How RACK sizes the reordering window

The window is built from two ingredients: min_rtt and a hard cap derived from srtt_us.

From a live connection on this machine:

$ ss -ti | grep minrtt
cubic wscale:10,10 rto:350 rtt:148.4/22.227 minrtt:130.452 dsack_dups:1

The minimum observed RTT is 130.452ms. RACK divides that by four:

min_rtt >> 2
= 130452 µs >> 2
= 32613 µs
= 32.6 ms

That 32.6ms is the base reordering window. Any packet sent more than 32.6ms before the most recently delivered packet gets marked as lost.

But the kernel also applies a hard cap:

reo_wnd = min(reo_wnd, tp->srtt_us >> 3)

With srtt_us = 148400:

srtt_us >> 3
= 148400 >> 3
= 18550 µs
= 18.55 ms

The base window was 32.6ms. The cap reduces it to 18.55ms. RACK loses 14ms of reordering tolerance.

There is also a multiplier — reo_wnd_steps — that increments each time reordering persists, designed to widen the window adaptively:

reo_wnd = max(1000, (min_rtt >> 2) * tp->rack.reo_wnd_steps)
reo_wnd_steps Window before cap Window after cap
132.6 ms18.55 ms
265.2 ms18.55 ms
397.8 ms18.55 ms

The multiplier is irrelevant. The cap wins every time, regardless of how much reordering the kernel has observed. The two mechanisms — adaptive expansion and hard contraction — directly contradict each other, and on this connection, contraction dominates.


Reconstructing the false positive

The decision that triggered the spurious retransmission lives here:

static void tcp_rack_mark_lost(struct sock *sk)
{
    struct tcp_sock *tp = tcp_sk(sk);
    struct sk_buff *skb;
    u32 reo_wnd = tcp_rack_reo_wnd(sk);
    u32 rack_mstamp = tp->rack.mstamp;

    skb_queue_walk(&sk->sk_write_queue, skb) {
        u32 send_time = tcp_skb_timestamp_us(skb);

        if (send_time <= rack_mstamp - reo_wnd) {
            TCP_SKB_CB(skb)->sacked |= TCPCB_LOST;
        }
    }
}

For the packet that produced the spurious retransmission, the comparison evaluated like this:

send_time      <= rack_mstamp - reo_wnd
140 ms ago     <= 138 ms ago  - 18.55 ms
140 ms ago     <= 119.45 ms ago   →  TRUE

The kernel marked the packet lost. The receiver already had it — delayed by roughly 2ms of normal network jitter, not dropped. The reordering window of 18.55ms was tight enough that a 2ms delay fell outside it.

The DSACK in ss -ti output confirms it:

dsack_dups:1

The receiver sent a Duplicate SACK saying: you already sent this. I have it. The retransmission was unnecessary.


Why the cap is the real culprit

Without the srtt_us >> 3 cap, the reordering window would have been 32.6ms — enough to absorb the 2ms jitter. The cap, designed to keep loss detection responsive, is what made the window too tight.

The irony is structural: the mechanism intended to reduce detection latency increased false positives.

Each of the three heuristics contributes to this:

Heuristic Value on this connection Effect
min_rtt >> 2 32.6 ms Uses the most optimistic RTT sample, not the average
srtt_us >> 3 cap 18.55 ms Cuts base window by 43%, making expansion irrelevant
reo_wnd_steps 1 → 2 Multiplier nullified by the cap on every increment

Individually, each heuristic has a reasonable justification. Together, they fight each other. The cap wins. The packet loses.


The full picture from nstat

$ nstat | grep -E "TCPLostRetransmit|TCPSpuriousRtx|TCPDSACK|TCPLossProbe|TCPTimeouts"
TcpExtTCPLostRetransmit         7                  0.0
TcpExtTCPSpuriousRtxHostQueues  1                  0.0
TcpExtTCPDSACKRecv              2                  0.0
TcpExtTCPLossProbes             4                  0.0
TcpExtTCPLossProbeRecovery      1                  0.0
TcpExtTCPTimeouts               9                  0.0
Counter Value What it means
TCPLostRetransmit 7 RACK marked 7 packets lost and retransmitted them
TCPSpuriousRtxHostQueues 1 One retransmit was unnecessary
TCPDSACKRecv 2 Receiver confirmed 2 cases of already-held data
TCPLossProbes 4 TLP sent 4 probes when the connection went quiet
TCPLossProbeRecovery 1 One probe recovered a loss before RACK had to act
TCPTimeouts 9 Neither RACK nor TLP could prevent a full RTO wait

Seven loss decisions, one confirmed false positive. That is one wrong retransmission for every seven — a number too small to draw statistical conclusions from, but large enough to ask why it happened. The answer, as traced above, is not an edge case. It is the core heuristic behaving exactly as designed, under conditions the design did not anticipate well.

The nine timeouts are a separate problem. RACK and TLP together still failed to prevent a full RTO wait nine times in three hours of normal desktop use. That failure mode is not in the heuristics explored here — but it is worth noting that the machinery is more fragile than the counter names suggest.


What would have prevented the lie

Any one of these changes would have saved the packet:

Change Resulting window
Remove the cap entirely 32.6 ms — absorbs the 2 ms delay
Use srtt instead of min_rtt for the base 37.1 ms — more margin
Change cap to srtt_us >> 2 37.1 ms — matches the base window

None of these changes are free. Each one slows loss detection in other scenarios. The existing constants were tuned against data center workloads, where RTTs are low, links are stable, and reordering is rare. A ThinkPad on WiFi is not a data center.


The point

RACK replaced a counting rule with a timing rule. That was progress — tail loss detection improved, unnecessary timeout waits decreased. But timing rules require thresholds, and thresholds require assumptions about the network.

The assumptions encoded in min_rtt >> 2, srtt_us >> 3, and reo_wnd_steps are not arbitrary, but they are not universal either. They represent a specific trade-off: faster detection at the cost of higher false-positive rates under reordering.

On this machine, that trade-off produced one spurious retransmission in three hours. One DSACK from the receiver saying: I already had that packet.

RACK is not broken. It is better than what it replaced. But it is not a precise algorithm — it is a collection of heuristics dressed as one, and the heuristics contradict each other in ways the kernel silently resolves with a bit shift and a min() call.

That bit shift is the reordering window. The min() is the cap. And somewhere in the last three hours, they conspired to retransmit a packet that the receiver was already holding.