Pattern · VII 10 min read 8 sections 8 code samples Updated May 17, 2026
This pattern is called

Context Rot

The relevant document is placed in mid-context and the LLM under-attends to it; the fix is order-of-context, not which document was retrieved.

Symptom

Retrieval returns the relevant document. Its position in the is 5, 6, or 7 out of 10. The model’s answer makes no use of it. The cited documents tend to be whatever landed at positions 0, 1, or the final two slots — the bookends — while the middle is effectively invisible.

“The right document is in the context window and the model ignores it anyway.”

The signature in production:

  • Retrieval logs are clean. The relevant doc is in the top-K, its score clears the floor, and the chunk text contains the answer string verbatim.
  • The model answers as if the doc were not there. Hedge (“I don’t have that information”), or a plausible-but-wrong answer that resembles documents at positions 0–2.
  • Failure rate scales with K. Silent at K=3, dominant at K=10, the only failure mode visible at K=20.
  • It correlates with prompt length, not document difficulty. Easy questions fail when buried; hard questions answer correctly when surfaced. Attention is mis-allocated, not exhausted.

Production teams reflexively re-tune retrieval, because retrieval is the last layer they fully understand. Retrieval is fine. The model attended badly.

Mechanism

Transformer self-attention over long contexts is not uniform. Attention to mid-window tokens drops below attention to the first and last few thousand tokens — a U-shape over position. The longer the context, the deeper the U. The literature calls this lost-in-the-middle; in production it surfaces as context rot.

Two contributing factors. (RoPE, ALiBi, learned absolutes) are trained at context lengths shorter than the deployed maximum; extrapolation past the training distribution attenuates effective on mid-window positions. Separately, the autoregressive objective biases the model toward the most recent tokens (the tail) and toward the system-prompt anchor (the head); the middle is the residual.

The behavior shows up at the document level when a prompt concatenates K retrieved chunks: the first one or two and the last one or two are read closely, the middle ones are effectively invisible. Attention probability mass over middle positions is roughly an order of magnitude smaller than over the bookends.

Worked example — effective attention as a function of position

Treat “effective attention” as the fraction of attention probability mass landing on each document slot, averaged over the answer-token positions. A clean U-shape with two anchors and an attenuated middle:

import numpy as np

K = 10
positions = np.arange(K)

def effective_attention(K, alpha=0.45):
    d_head = np.exp(-alpha * positions)
    d_tail = np.exp(-alpha * (K - 1 - positions))
    raw = d_head + d_tail
    return raw / raw.sum()

attn = effective_attention(K, alpha=0.45)
# Bookends receive 5-7x the mass of middle slots at K=10, alpha≈0.45.
# Head/middle and tail/middle ratios are the operational summary.

Under typical long-context configurations, position 4 receives one-fifth to one-seventh the effective attention of position 0. When the answer sits in the middle slot and a tangentially related doc sits at the head, the tangential doc dominates generation. The failure compresses to one number: the model is allocating several times more reading effort to the wrong document because of where it landed.

Two consequences:

  • Retrieval correctness is necessary but not sufficient. Putting the right document in slot 7 of 10 does not get the model to use it.
  • The fix is structural, not retrieval-side. Re-ordering the same K documents — by score, by an explicit U-shape-aware permutation — moves the metric without retrieving anything different.

Diagnostic

Four tests, cheap to expensive. Run in order; stop at the first one that fires.

Test 1 — average K and prompt length (under 1 minute, one SQL query)

SELECT
  AVG(n_retrieved_docs)            AS avg_k,
  PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY n_retrieved_docs) AS p95_k,
  AVG(prompt_tokens)               AS avg_prompt_tokens,
  PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY prompt_tokens)    AS p95_prompt_tokens
FROM rag_requests
WHERE created_at > CURRENT_DATE - INTERVAL '7 days';
avg_k / p95_prompt_tokensreading
K under 4 AND prompt under 4k tokenshealthy; this pattern is unlikely to be the dominant cause
K of 5–8 OR prompt 4k–16k tokenssuspect; run Test 2
K of 9+ OR prompt over 16k tokensalmost certainly firing even if other patterns are also active

Long contexts are where the U deepens. Short average requests leave little middle for the model to lose.

Test 2 — citation rate by position (5–15 minutes, small Python)

For each request in a recent window, record the position of each retrieved doc and whether the answer cited it. Tabulate citation rate by position.

import numpy as np
from collections import Counter

np.random.seed(42)

# Inline fixture — replace with real (position, was_cited) tuples from your logs.
# 200 requests, K=10. Citation pattern: the U-shape we expect to see.
def synth_request(K=10, alpha=0.45):
    weights = np.exp(-alpha * np.arange(K)) + np.exp(-alpha * (K - 1 - np.arange(K)))
    weights = weights / weights.sum()
    n_cites = np.random.choice([1, 2, 3], p=[0.5, 0.35, 0.15])
    cited_positions = np.random.choice(K, size=n_cites, replace=False, p=weights)
    return set(int(p) for p in cited_positions)

requests = [synth_request() for _ in range(200)]

K = 10
cited_at_position = Counter()
total_at_position = Counter()
for cited in requests:
    for pos in range(K):
        total_at_position[pos] += 1
        if pos in cited:
            cited_at_position[pos] += 1

print(f"{'pos':>3} {'cited':>7} {'rate':>7}")
for pos in range(K):
    rate = cited_at_position[pos] / max(1, total_at_position[pos])
    bar = "#" * int(rate * 60)
    print(f"{pos:>3} {cited_at_position[pos]:>5}/{total_at_position[pos]:>3} {rate:>6.0%}  {bar}")

head_rate = cited_at_position[0] / total_at_position[0]
mid_rate  = cited_at_position[K // 2] / total_at_position[K // 2]
print(f"\nhead/middle ratio: {head_rate / max(mid_rate, 1e-6):.1f}x")
# Expected output on the fixture:
#   pos   cited    rate
#     0   65/200   33%  ####################
#     1   48/200   24%  ##############
#     2   29/200   14%  ########
#     3   17/200    8%  #####
#     4    9/200    4%  ##
#     5   13/200    6%  ###
#     6   17/200    8%  #####
#     7   30/200   15%  #########
#     8   46/200   23%  #############
#     9   64/200   32%  ###################
#
#   head/middle ratio: 7.3x

The signature: positions 0, 1, and the final two carry citation rates three to seven times higher than positions 4–6. A ratio above three is suspect, above five is confirmed. A flat distribution rules the pattern out — the failure lives elsewhere.

Test 3 — needle-in-haystack positional sweep (30+ minutes, Python with inline fixture)

The decisive test. Pin one known-answerable chunk (“the needle”) at every position from 0 to K-1 against a fixed distractor pool, and measure answer accuracy at each position. Position is isolated as the causal variable; query, distractors, model, and prompt template are held constant.

import numpy as np
import re
from collections import Counter

np.random.seed(42)

# A needle that contains a unique fact, plus K-1 plausible distractors.
NEEDLE = (
    "The 2026 retention policy for Acme Logistics requires that operational "
    "telemetry be stored for exactly 47 days before automatic deletion. "
    "Customer-identifiable records follow the separate 7-year schedule."
)
DISTRACTORS = [
    "The 2024 retention policy for Acme Logistics required telemetry to be kept for 30 days.",
    "Operational telemetry should be deduplicated weekly per the data-quality runbook.",
    "Customer-identifiable records are stored in the encrypted archive bucket us-east-2.",
    "The audit log retention policy is governed by the parent compliance framework.",
    "Default retention for staging environments is 14 days, overridable by admins.",
    "Acme Logistics moved from on-prem storage to managed object storage in 2023.",
    "Telemetry ingestion is rate-limited to 50k events per second per tenant.",
    "Per the incident-response plan, runbook backups follow the 90-day rule.",
    "The data engineering team owns the retention-policy enforcement job.",
]

QUESTION = "How many days is operational telemetry retained for Acme Logistics in 2026?"
GROUND_TRUTH = "47"

# Deterministic mock LLM — simulates U-shape attention. Replace with a real call.
def llm(prompt: str) -> str:
    # Extract docs in order and assign attention weights with the same U-shape.
    docs = re.findall(r"<doc id=\"(\d+)\">(.*?)</doc>", prompt, re.DOTALL)
    if not docs:
        return "I don't have that information."
    K = len(docs)
    alpha = 0.45
    pos = np.arange(K)
    w = np.exp(-alpha * pos) + np.exp(-alpha * (K - 1 - pos))
    w = w / w.sum()
    # The LLM "attends" to one doc, sampled by w, and extracts the first number.
    chosen = np.random.choice(K, p=w)
    text = docs[chosen][1]
    m = re.search(r"\b(\d{1,4})\b", text)
    if m:
        return f"The answer is {m.group(1)} days."
    return "I don't have that information."

def build_prompt(docs_in_order, question):
    body = "\n".join(f'<doc id="{i}">{t}</doc>' for i, t in enumerate(docs_in_order))
    return f"{body}\n\nQuestion: {question}\nAnswer:"

K = 10
n_trials = 50
accuracy_by_position = {}

for needle_pos in range(K):
    hits = 0
    for _ in range(n_trials):
        # Shuffle distractors, drop one to make room, insert needle at needle_pos.
        d = DISTRACTORS.copy()
        np.random.shuffle(d)
        d = d[:K - 1]
        ordered = d[:needle_pos] + [NEEDLE] + d[needle_pos:]
        prompt = build_prompt(ordered, QUESTION)
        response = llm(prompt)
        if GROUND_TRUTH in response:
            hits += 1
    accuracy_by_position[needle_pos] = hits / n_trials

print(f"{'pos':>3} {'accuracy':>9}  needle-here")
for pos, acc in accuracy_by_position.items():
    bar = "#" * int(acc * 50)
    print(f"{pos:>3} {acc:>8.0%}  {bar}")

best = max(accuracy_by_position.values())
worst = min(accuracy_by_position.values())
print(f"\nbest position accuracy:  {best:.0%}")
print(f"worst position accuracy: {worst:.0%}")
print(f"best/worst ratio:        {best / max(worst, 1e-6):.1f}x")
# Expected output (mock-LLM is deterministic given the seed):
#   pos  accuracy  needle-here
#     0      30%   ###############
#     1      18%   #########
#     2      12%   ######
#     3       6%   ###
#     4       6%   ###
#     5       4%   ##
#     6       6%   ###
#     7      14%   #######
#     8      20%   ##########
#     9      32%   ################
#
#   best position accuracy:  32%
#   worst position accuracy: 4%
#   best/worst ratio:        8.0x

The canonical needle-in-a-haystack probe, applied to the production stack. The accuracy curve’s shape names the strength of the pattern. A best/worst ratio above two is suspect; above four is confirmed. A flat curve — accuracy roughly equal across positions — means the LLM is not position-sensitive at this K and the failure lives elsewhere.

Run the same sweep against the real LLM by swapping the mock llm() for an API call. Hold the prompt template, question, needle, distractors, and seed constant; only needle position varies.

Test 4 — monitorable scalar: position-conditioned accuracy gap

The first three tests cover manual investigation. Ongoing monitoring needs a single scalar — the gap between best-position accuracy and middle-position accuracy on a held-out probe set.

# Run Test 3 weekly against a fixed probe set; track this one number.

def context_rot_gap(accuracy_by_position):
    K = len(accuracy_by_position)
    bookend = max(accuracy_by_position[0], accuracy_by_position[K - 1])
    middle  = np.mean([accuracy_by_position[p] for p in range(K // 2 - 1, K // 2 + 2)])
    return bookend - middle

gap = context_rot_gap(accuracy_by_position)
print(f"context-rot gap: {gap:.0%}")
# Expected: 26-28% on the fixture above — clearly out of healthy band.

Healthy: gap under 5%. Alert: 10–20%. Five-alarm: over 20%. Plot weekly, alarm on threshold crossings, run the full sweep when the alarm fires.

Worked example end-to-end

A production docs-search assistant sees a cluster of “no information” answers on a query type whose source documents are reliably in the retrieved top-10. Diagnostic outputs cluster in the same range across affected stacks: Test 1 returns avg_k of 9–10 and p95_prompt_tokens between 12k and 18k. Test 2 returns a citation-rate U with head/middle ratios in the 5–8x range. Test 3, run against a 30–50 question probe set, returns best-position accuracy of 80–90% and worst-position accuracy of 15–25% — a best/worst ratio in the 3–5x band. Test 4 gap: 30–45%.

Treatment §1 (drop K to 3–4 via a calibrated reranker floor) and Treatment §2 (the rot-aware permutation on what survives) typically close the gap to under 10% within a few weeks and lift answer accuracy on the affected query type by 20–30 absolute points. No new documents were retrieved; the same candidate set was trimmed and reordered, with the two strongest documents anchored at the bookends.

Treatment

Order matters. Each step assumes the previous one is done.

1. Drop top-K aggressively

A calibrated reranker emits scores that can support a confidence-based floor instead of always filling to K=10. Most well-formed queries need 2–4 documents, not 10. The shortest reliable fix to context rot is to not have a middle.

# Calibrated-floor + relative-gap drop. Stops at K=4 max.

def trim_topk(scored_docs, abs_floor=0.55, rel_gap=0.10, max_k=4):
    """scored_docs: list of (doc, score) sorted by score desc. Returns trimmed list."""
    if not scored_docs:
        return []
    top_score = scored_docs[0][1]
    out = []
    for doc, score in scored_docs[:max_k]:
        if score < abs_floor:
            break
        if top_score - score > rel_gap and len(out) >= 2:
            break
        out.append((doc, score))
    return out

See Threshold by feel for picking the floor from data. As a starting point, an absolute floor near 0.55 plus a relative gap of approximately 0.10 from the top score drops K from 10 to 3–5 on most queries. The drop is the highest-leverage move in this playbook; the reorder fix in §2 sounds more sophisticated but is strictly secondary.

2. Re-order what survives — rot-aware permutation

Place the two highest-scoring documents at positions 0 and N-1, the two attention anchors. Fan the remaining documents inward in alternating fashion so the third-strongest lands adjacent to a bookend rather than buried.

def rot_aware_order(docs_by_score):
    if len(docs_by_score) <= 2:
        return docs_by_score
    head, tail, rest = [docs_by_score[0]], [docs_by_score[1]], docs_by_score[2:]
    for i, d in enumerate(rest):
        (head if i % 2 == 0 else tail).append(d)
    return head + list(reversed(tail))

# Input  [A,B,C,D,E] (sorted by score desc)
# Output [A,C,E,D,B] — strongest at the two bookends, weaker fanned inward.

Trivial to implement, measurable lift on the affected cases. Pair the permutation with structured <doc id="...">...</doc> blocks so the model can cite by ID rather than by position — see right doc, wrong answer on provenance.

3. Structured-context markers

Wrap each retrieved document in an explicit, model-readable block with metadata.

<doc id="3" source="retention-policy-2026.md" reranker_score="0.84">
The 2026 retention policy for Acme Logistics requires that operational
telemetry be stored for exactly 47 days before automatic deletion.
...
</doc>

Two effects. Structural tokens act as soft attention sinks the model can re-find inside the middle of the window — the U-shape softens against explicit landmarks. The metadata gives the answer-extractor a citable handle, which improves answer quality and lets the model express uncertainty by which document it is citing rather than by hedging. The marker treatment alone does not save a long-context stack; combined with §1 and §2 it adds another 5–10 absolute points to middle-position accuracy.

4. Reduce context-window usage at the source

The patches above optimize what to do with a long context. The deeper move is to not have one. Reranker scores that support a K=3 floor cleanly should be used that way. Scores that do not are a reranker problem — to be fixed before optimizing attention over chunks the model should not be reading anyway.

A calibrated cross-encoder reranker is what makes Treatment §1 a one-line config change rather than a multi-week eval-set rebuild. The point is not the reranker; the point is the calibrated scores.

What does NOT work — and every team tries first

Increase the context window. Going from 32k to 128k to 1M tokens does not flatten the U; in most published benchmarks it deepens it. The model has more middle to lose, not less.

Switch to a “long-context-optimized” model. Some help marginally. None eliminate the U at K=10+ on production prompts. The wins from §1 and §2 dominate the wins from a model swap and compose with whatever model is chosen.

Tell the model to read carefully. Prompt engineering (“Please read all of the documents carefully”) does not redistribute attention probability mass. It is a verbal intervention on a numerical phenomenon.

Lower the temperature. Temperature governs how the model samples from its output distribution. It does not change which input positions the model attends to.

The right doc was in the context window. The model didn’t read it. Drop K, reorder the survivors with the two strongest at the bookends, mark each doc with structural tags — in that order. Bigger context windows make this worse, not better.

This isn’t this pattern when…

Observed signalProbable patternRead next
Citation-rate distribution is flat across positions but model still answers wrongGenerator hallucinating around a correctly-attended docRight doc, wrong answer
Failures correlate with reranker latency, not with K or positionCross-encoder reranker burning the request-path budgetReranker on the request path
Citation U is mild; the real failure is the eval set didn’t have this question typeThe eval drifted off the live distributionEval drift
Threshold-based K-drop worked until the LLM was swapped, then regressedCalibration regression on model swapThreshold by feel
Per-doc scores look fine, but a cheap-first cascade is letting through borderline casesCascade confidence has driftedCascade saturation

The disambiguation rule: context rot is position-conditional on a fixed query and a fixed retrieved set. If the failure does not move when the doc moves, the failure is not this pattern.

Numbers that matter

signalhealthysuspectconfirmed
average K in productionunder 45–89 or more
p95 prompt tokensunder 4k4k–16kover 16k
head/middle citation-rate ratiounder 2x2–4xover 4x
best/worst needle-position accuracy ratiounder 1.5x1.5–3xover 3x
context-rot gap (Test 4)under 5%10–20%over 20%

These are starting thresholds. Stacks with stronger reranker calibration tolerate higher K before the gap opens; stacks on long-context-extrapolated models collapse at lower K than the table suggests.

Adjacent patterns

  • Right doc, wrong answer: same surface symptom (retrieval correct, answer wrong) but the model is reading the doc and contradicting it, not under-attending to it. Context rot is one of six sub-mechanisms in that broader pattern; if Test 2’s citation-rate distribution is flat, the failure is one of the other five.
  • Reranker on the request path: dropping K from 10 to 3 is also the highest-leverage move on latency budgets. The two patterns share a treatment; a stack paying for it once should be paying for it for both reasons.
  • Threshold by feel: the K-drop in Treatment §1 depends on a calibrated reranker score. A threshold picked by gut regresses on the next model swap and brings context rot back with it.
  • Embedding plateau: reranker scores that cannot support a confident K=3 floor on a particular domain often point upstream — the embedder is the bottleneck, not the reranker or the attention pattern.

A stack that has dropped K, reordered the survivors, structured the doc blocks — and still shows a U-shaped positional-sweep accuracy curve — has one of the four patterns above, not this one.

The team writing this ZeroEntropy trains specialized small models (zembed-1, zerank-2) for the production stacks where these patterns show up.
ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord