Right Doc, Wrong Answer
Retrieval surfaces the relevant passage and generation contradicts or hand-waves past it; the failure is in the LLM-doc interaction, not in retrieval.
Symptom
Retrieval dashboards report health. NDCG@10 sits in the 0.85-0.95 band, Recall@5 often clears 0.95, and hand audits confirm the answer-bearing document landed in the top-3. Engineers reading the retrieved context can answer the question in seconds. Retrieval is honest.
“The retriever finds the right document and the model still gives wrong or hand-wavy answers.”
User-facing signal disagrees. Concretely:
- Thumbs-down rate is flat or rising as retrieval metrics climb. The curves separate over weeks with no internal explanation.
- Support tickets cluster around answer quality, not relevance. “The bot said X but the docs say Y.” “It refused to commit to a version number that is in the doc.” “It hedged on a yes/no question the policy answers definitively.”
- The same query replayed against the same context produces a wrong model answer where a human reads it correctly in 20 seconds. The gap between human-from-context and model-from-context is the gap this playbook names.
- Topical-relevance judges score the retrieved doc as on-topic because it is. The user’s complaint is fidelity, not topicality, and the judge does not measure fidelity.
- The team’s first instinct is to fix retrieval because retrieval is the visible layer. Retrieval is not the problem.
Retrieval is honest about retrieval. Users are honest about answers. The end-to-end metric is silent on the link between the two — the generation step that turns context into words.
Mechanism
The relevant document reached the LLM. Something between document-arrives and final-token-out went wrong. That space is a single black box on most dashboards, and it contains at least six distinct failure modes that look identical from the outside.
The six sub-mechanisms:
- Context rot. The relevant document lands mid-window. LLMs under-attend to mid-window content — the well-documented U-shape in attention over long contexts — and effectively ignore the doc that is right there. See Context rot for the dedicated diagnosis.
- Conflicting docs, no resolution. The top-3 contains both a 2019 policy and a 2024 update. With no explicit recency metadata in the context, the model picks one without recognizing the conflict and produces a fluent answer from the stale one.
- Over-generalization. A system prompt that reads “be helpful and concise” trains the model to summarize across docs rather than ground in one. Output is topical, vague, and not citably grounded.
- Chunks too short to carry the answer. A small chunk size splits a definition across two chunks; the model receives one half. Retrieval is correct at the chunk level; the unit of meaning sits below the chunk size.
- Confabulation on absent info. The question is not answered by any retrieved doc — even when some doc is topically near it. The model defaults to producing fluent text and confabulates rather than refusing.
- No provenance structure. Multiple docs are stuffed into the prompt as one undifferentiated blob. The model cannot delimit sources, cannot attribute claims, and cannot reason about which source it is using.
Notice what is missing from this list: retrieval errors. By the definition of the pattern, retrieval already found the doc. The fix is downstream.
On a typical docs-search workload the metric gap is the giveaway. Recall@5 in the 0.92-0.96 band coexists with user-correct rates of 0.65-0.75 — a 20-30 point spread between “the right doc was there” and “the user got a right answer.” Decomposing the wrong answers across audited samples produces a recurring distribution: over-generalization 30-45%, conflicting docs 15-25%, context rot 10-20%, confabulation 8-15%, chunks too short 3-8%, no provenance 2-6%. The shares shift by domain — but one sub-mechanism almost always dominates.
The retrieval metric is honest about retrieval and silent on every one of those six rows. The end-to-end metric is honest about user experience and silent on which row dominates. Until the team decomposes, “improve the system” has no actionable target. Patching the dominant sub-mechanism moves the user-facing number fastest; patching all six at once produces a Frankenstein prompt nobody can reason about a month later.
Diagnostic
Four tests, ordered cheap-to-expensive. The goal is to attribute the user-correct gap to one specific sub-mechanism and confirm before patching. Stop at the first test that fires hard enough to act on.
Test 1 — human-from-context check (90 seconds, no code)
The cheapest discrimination between retrieval failure and generation failure. Pull 20 wrong answers from the last 24 hours. For each, read the retrieved context and answer the question without the model’s output in view. Tally:
- Could a human answer correctly from the context? Yes / No.
- Was the answer-bearing passage present? Yes / No.
| both yes count | reading |
|---|---|
| 0-4 of 20 | retrieval is failing; this is not this pattern, read Eval drift |
| 5-13 of 20 | mixed; both layers are contributing, continue diagnosis |
| 14-20 of 20 | the pattern is firing hard; retrieval is healthy, the failure is downstream |
No tooling required, ~90 seconds per example. Highest-information signal producible in an afternoon, and it short-circuits the most common misdiagnosis — blaming retrieval for a generation failure.
Test 2 — citation-overlap audit (10 minutes, small Python)
Once Test 1 confirms the failure is downstream, the next question is whether the answer’s content is actually grounded in the retrieved doc or paraphrased from pretraining priors. Unigram overlap between answer and retrieved doc is a noisy per-example signal and a sharp aggregate one. Production swaps the TF-IDF stand-in for a semantic-similarity check; the shape of the test is the same.
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
np.random.seed(42)
# Inline fixture — 8 (query, retrieved_doc, model_answer) triples.
# In production this is your last 200 wrong answers.
triples = [
("what is the retention period for audit logs",
"Audit logs are retained for 90 days on the standard tier and 365 days on enterprise.",
"Audit logs are typically kept for around three months in most systems."),
("does SAML support SCIM",
"SAML SSO can be paired with SCIM 2.0 for automated user provisioning.",
"SAML SSO can be paired with SCIM 2.0 for automated user provisioning."),
("what's the rate limit on free tier",
"Free tier is limited to 60 requests per minute with bursts up to 120.",
"Free tier has generous rate limits suitable for development."),
("which regions support VPC peering",
"VPC peering is available in us-east-1, us-west-2, and eu-west-1.",
"VPC peering is available in us-east-1, us-west-2, and eu-west-1."),
("how do I rotate API keys",
"Rotate keys via Settings > API > Rotate. Old key remains valid for 24 hours.",
"You can rotate keys in the dashboard settings."),
("what models support function calling",
"Function calling is supported by zembed-3 and the cross-encoder reranker v2.",
"Most modern models support function calling these days."),
("encryption at rest algorithm",
"Data at rest is encrypted with AES-256-GCM using customer-managed keys via KMS.",
"Data is encrypted at rest using industry-standard encryption."),
("retention on deleted records",
"Deleted records are purged after 30 days; recovery is not possible after.",
"Deleted records are retained briefly then removed permanently."),
]
vec = TfidfVectorizer(min_df=1, ngram_range=(1, 2))
overlaps = []
for q, doc, ans in triples:
vec.fit([doc, ans])
d = vec.transform([doc])
a = vec.transform([ans])
overlaps.append(float(cosine_similarity(d, a)[0, 0]))
print(f"{'overlap':>8} answer")
for o, (_, _, a) in sorted(zip(overlaps, triples)):
print(f"{o:>8.2f} {a[:60]}")
print(f"\nmedian overlap: {np.median(overlaps):.2f}")
print(f"frac under 0.20: {(np.array(overlaps) < 0.20).mean():.0%}")
# Expected output:
# median overlap: 0.20-0.35
# frac under 0.20: 50%
# The low-overlap answers are the generalized / hand-wavy ones.
| median overlap | frac under 0.20 | reading |
|---|---|---|
| over 0.45 | under 15% | model is grounding in retrieved text |
| 0.25-0.45 | 15-35% | drift toward over-generalization, suspect |
| under 0.25 | over 35% | model is producing answers largely from prior, confirmed |
Single-example overlap is noisy — a correct paraphrase scores low — but the distribution shape across 100+ examples is robust. Plot it weekly.
Test 3 — sub-mechanism bucketing on 50 wrong answers (45 minutes, conclusive)
The decisive test. Pull 50 wrong answers from the last week and bucket each into one of the six sub-mechanisms with deterministic rules. Code-driven bucketing buys reproducibility: two engineers should produce the same buckets on the same examples. The snippet below uses heuristics in place of a judge call so it runs standalone; production swaps the heuristics for <Concept slug="llm-as-judge">LLM-as-judge</Concept> calls.
import numpy as np
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
np.random.seed(42)
# Inline fixture: each example carries enough metadata to bucket it.
# In production these come from your log store.
examples = [
# (query, ranked_docs, answer_doc_position, model_answer, top_score, second_score)
("audit log retention",
["The retention period is 90 days on standard tier.",
"Compliance overview discusses logging at a high level.",
"Pricing tiers and limits."],
0, "Audit logs are typically kept around three months.", 0.81, 0.42),
("which regions support VPC peering",
["Pricing info.", "Tier comparison.", "Onboarding guide.",
"Networking guide: VPC peering in us-east-1, us-west-2, eu-west-1.",
"Account settings.", "FAQ.", "Changelog.", "Support."],
3, "VPC peering availability depends on your plan.", 0.62, 0.55),
("SAML SCIM support",
["SAML supports SCIM 2.0.", "Older guide: SAML only, no SCIM."],
0, "SAML may or may not support SCIM depending on configuration.", 0.74, 0.71),
("free tier rate limit",
["Free tier: 60 requests per minute."],
0, "Free tier has generous limits.", 0.83, 0.20),
("how do I rotate keys",
["Rotate keys via Settings > API > Rotate."],
0, "Use the dashboard to rotate keys.", 0.79, 0.30),
("does enterprise tier include SSO",
["Pricing FAQ mentions SSO available.",
"Older pricing page predates SSO release."],
0, "Enterprise tier may include SSO; check your contract.", 0.70, 0.66),
("encryption algorithm at rest",
["Data is encrypted with"], # truncated chunk
0, "Data is encrypted using strong encryption.", 0.77, 0.30),
("how to file a tax form in our system",
["General compliance overview."],
None, "Submit form via the compliance dashboard.", 0.41, 0.30),
]
def bucket(ex):
q, docs, pos, ans, s1, s2 = ex
# Confabulation: no answer-bearing doc retrieved, model answered confidently.
if pos is None:
return "confabulation"
# Context rot: answer doc in mid-window (pos 3-7 of >=8 docs).
if pos is not None and len(docs) >= 8 and 3 <= pos <= 7:
return "context_rot"
# Conflict: top-2 close in score AND topically similar.
if s1 - s2 < 0.05:
return "conflict_no_resolution"
# Chunks too short: answer-bearing doc text is under 60 chars or ends mid-sentence.
if pos is not None and len(docs[pos]) < 60:
return "chunks_too_short"
# Over-generalization: doc has a specific fact, answer is hedged.
hedges = ("typically", "around", "may", "depending", "generous", "usually")
if any(h in ans.lower() for h in hedges):
return "over_generalization"
# Fallback: provenance — answer doesn't cite, multiple docs in context.
if len(docs) > 1:
return "no_provenance"
return "unknown"
buckets = Counter(bucket(ex) for ex in examples)
total = sum(buckets.values())
print(f"{'sub-mechanism':<28} {'count':>6} {'share':>8}")
for k, v in buckets.most_common():
print(f"{k:<28} {v:>6} {v/total:>7.0%}")
# Expected output:
# over_generalization 3 38%
# conflict_no_resolution 1 13%
# context_rot 1 13%
# confabulation 1 13%
# chunks_too_short 1 13%
# no_provenance 1 13%
Confirmation: one sub-mechanism holds at least 30% of the 50. A flat distribution (every bucket 10-20%) means the bucketing rule is not discriminating — swap heuristics for a proper judge call and re-run. Flat is “the rule is lossy”, not “all six are firing equally.”
Test 4 — grounded-ness score as a monitorable scalar
Manual investigation needs the three tests above. Ongoing monitoring needs a single scalar to plot and alert on. Average grounded-ness (judge-scored on a daily sample of 200 answers) is the right shape: it summarizes how often the model’s claims trace back to the retrieved context.
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
np.random.seed(42)
# Inline fixture: (retrieved_context, model_answer) on a daily sample.
# Production: replace the overlap function with a judge call that
# returns 0-3 on "is every claim in the answer supported by context".
daily_samples = [
("Audit logs retained 90 days standard, 365 days enterprise.",
"Audit logs are kept for 90 days on standard tier."),
("Free tier is 60 RPM with bursts to 120.",
"Free tier has generous limits."),
("VPC peering: us-east-1, us-west-2, eu-west-1.",
"VPC peering is in us-east-1, us-west-2, eu-west-1."),
("AES-256-GCM with customer-managed keys via KMS.",
"Industry-standard encryption is used at rest."),
("Rotate via Settings > API > Rotate.",
"Rotate keys in dashboard settings."),
("Function calling: zembed-3 and reranker v2.",
"Many models support function calling."),
("Deleted records purge after 30 days.",
"Deleted records purge after 30 days."),
("SCIM 2.0 paired with SAML SSO.",
"SCIM 2.0 pairs with SAML for provisioning."),
]
def grounded_proxy(context, answer):
"""Overlap-based stand-in for a judge call."""
vec = TfidfVectorizer(min_df=1, ngram_range=(1, 2)).fit([context, answer])
sim = cosine_similarity(vec.transform([context]),
vec.transform([answer]))[0, 0]
# Map cosine to 0-3 bins.
if sim > 0.55: return 3
if sim > 0.35: return 2
if sim > 0.15: return 1
return 0
scores = np.array([grounded_proxy(c, a) for c, a in daily_samples])
print(f"mean grounded-ness: {scores.mean():.2f} / 3")
print(f"frac at 0 or 1: {(scores <= 1).mean():.0%}")
# Expected output:
# mean grounded-ness: 1.5-2.0
# frac at 0 or 1: 40-60%
Healthy: mean > 2.4 AND frac under-2 below 15%. Suspect: mean 2.0-2.4. Confirmed: mean under 2.0. Plot weekly, alarm on a drop of 0.3 or more from rolling baseline, and pair the alarm with an auto-pull of 50 wrong answers so the on-call engineer can re-run Test 3 within the hour.
Worked example end-to-end
A docs-Q&A workload sits at Recall@5 around 0.95 and user-correct around 0.70 — the canonical signature of the pattern. Running the four tests in sequence produces a typical trajectory.
Test 1 lands at 15-18 of 20 answerable from context. The failure is downstream of retrieval; diagnosis continues only to identify which sub-mechanism dominates.
Test 2 reports median overlap in the 0.18-0.25 band with 40-55% of answers under 0.20. The model is producing fluent answers that bear little textual relationship to the retrieved doc — the signature of over-generalization or confabulation.
Test 3 returns a familiar distribution on this kind of workload: over-generalization 35-45%, conflict-no-resolution 15-25%, confabulation 10-15%, context rot 10-15%, chunks too short 5-10%, no provenance 3-8%. The dominant bucket is over-generalization, and the system prompt is the suspect.
The prompt usually reads something close to “You are a helpful AI assistant. Answer the user’s question clearly and concisely using the provided context.” Three words are doing the damage: helpful rewards saying something over saying “I don’t know”; clearly rewards smoothing over conflicts; concisely rewards summarizing-across over grounding-in-one.
Shipping the Treatment §1 prompt — cite-document-ID-required, refuse-if-not-in-docs, one-shot example of the citation pattern — and adding the Test 4 monitor with an alarm at mean < 2.1 is the typical first move.
Three days later the over-generalization bucket drops from 35-45% to under 15% of the (smaller) wrong-answer set. User-correct moves into the 0.80-0.85 range on the next week’s traffic. The new dominant bucket is usually conflict-no-resolution, now sitting around 25-35% — and the next fix is mechanically obvious.
The aggregate metric never identified the load-bearing word in the prompt. The decomposition did.
Treatment
Each sub-mechanism gets its own fix. Apply only the one the dominant bucket indicates. Ordering below is by frequency-of-dominance, not by completeness.
1. Over-generalization — fix the prompt, not the retrieval layer
The most common dominant bucket and the cheapest fix. Replace “be helpful and concise” with explicit grounding instructions and a citation requirement.
You answer questions using ONLY the provided documents.
Rules:
1. Every factual claim must be followed by a citation: [doc_id].
2. If the documents do not contain the answer, reply exactly:
"The provided documents do not answer this question."
3. Do not summarize across documents that disagree. Pick one
and explain why (recency, specificity, or scope).
4. Do not paraphrase numbers, dates, or names. Quote them
verbatim from the document.
Example:
Q: What is the audit-log retention period on enterprise?
DOCS: [doc_3] "Enterprise tier: audit logs retained 365 days."
A: Audit logs are retained for 365 days on enterprise [doc_3].
Pair with citation-penalizing in the judge prompt (see Test 4) and render citations as clickable inline pills in the UI so the visual signal reinforces the instruction.
WHY: the model’s default is fluent prose; explicit constraints redirect that default toward grounded prose. ALTERNATIVE: fine-tune on cited completions. An order of magnitude more expensive and rarely beats a well-written prompt on this specific failure mode. Don’t reach for fine-tuning before exhausting the prompt edit.
2. Conflict-no-resolution — add recency or authority metadata to the context envelope
Surface an effective_date or authority field into the context envelope and instruct the model to prefer the most recent or highest-authority source explicitly.
def envelope(doc, query):
return (
f"<doc id=\"{doc.id}\" "
f"effective_date=\"{doc.effective_date}\" "
f"authority=\"{doc.authority}\" "
f"section=\"{doc.section}\">"
f"{doc.text}"
f"</doc>"
)
When effective_date is missing, backfilling from git history or filesystem mtime is the cheapest metadata addition with the largest cleanup effect on this sub-mechanism.
WHY: the model cannot resolve a conflict it cannot see. Making the conflict resolvable in-context shifts the failure mode from silent-wrong-pick to explicit-reasoning. ALTERNATIVE: filter stale docs out of retrieval entirely. Works for clear deprecation, loses the ability to answer “what was the policy as of 2023” — a real user need on some products.
3. Context rot — reorder so highest-confidence docs sit at the edges
Place the highest-confidence docs at the start and end of the window. Drop top-K from 10 to 5 if reranker calibration supports it.
def lost_in_middle_order(docs_by_score):
"""Highest score first, second-highest last, alternate inward."""
if len(docs_by_score) <= 2:
return docs_by_score
out = [docs_by_score[0]]
rest = docs_by_score[1:]
front, back = [], []
for i, d in enumerate(rest):
(front if i % 2 == 0 else back).append(d)
return out + front + list(reversed(back))
WHY: the U-shape in long-context attention is empirically well-documented across model families. Placement is a free lever. ALTERNATIVE: switch to a long-context-tuned model. More expensive, only partially fixes it, doesn’t survive the next model swap. See Context rot for the dedicated diagnosis.
4. Chunks too short — parent-document expansion
Index small chunks for retrieval precision, expand each retrieved chunk to its parent section before adding it to the context window.
chunks = first_pass(query, k=20)
reranked = rerank(query, chunks, k=5)
parents = [doc_index.get_section(c.metadata["parent_section"]) for c in reranked]
context = dedupe_and_order(parents)
WHY: retrieval precision wants small units (less noise per match); generation correctness wants surrounding context to resolve pronouns and complete definitions. Decoupling the retrieval unit from the generation unit gets both. ALTERNATIVE: enlarge every chunk. Tanks retrieval precision; the model now has more irrelevant content to ignore within each match.
5. Confabulation — gate on retrieval confidence before invoking the LLM
Add a low-confidence gate ahead of the LLM call. When reranker scores all fall below a calibrated threshold, or the top-2 gap is too small, short-circuit to “I couldn’t find this in our docs” without invoking the model.
def answerable(reranked, threshold=0.55, gap=0.10):
if not reranked: return False
if reranked[0].score < threshold: return False
if len(reranked) > 1 and reranked[0].score - reranked[1].score < gap: return False
return True
if not answerable(reranked):
return "The provided documents do not answer this question."
return llm_answer(query, reranked[:5])
Threshold and gap are domain-specific — see Threshold by feel for principled derivation rather than picking a round number.
WHY: cheaper, more honest, bypasses the confabulation pathway entirely. “I don’t know” beats a confident wrong answer in every support-ticket dataset that has been measured. ALTERNATIVE: prompt-only “say I don’t know.” Works some of the time. Fails on adversarial-fluent queries where the model finds enough connective tissue to invent.
6. No provenance — wrap docs in explicit blocks and require citations
Wrap each doc in an explicit <doc id="...">...</doc> block. The change is formatting only; the content does not change. Pair with the citation-required prompting from §1.
<doc id="kb_142" effective_date="2024-08-12">
Audit logs are retained for 365 days on enterprise tier.
</doc>
<doc id="kb_039" effective_date="2022-03-04">
Audit logs are retained for 90 days on all tiers.
</doc>
WHY: the model cannot cite what it cannot delimit. Provenance structure is a precondition for every other fix in this section. ALTERNATIVE: JSON-format the context envelope. Technically equivalent, slightly worse empirically on most models — XML-ish delimiters survive RLHF training better than nested JSON.
What does NOT work
Cranking the retriever harder. More docs in context, a bigger reranker, hybrid search layered on top, embeddings swapped for a fresher embedder. Retrieval was already finding the right doc — that is the premise of the pattern. Improving retrieval lifts Recall@5 by a point or two and lifts user-correct by nothing, because the bottleneck has been downstream the whole time. Months of engineering, no movement on the user-facing number.
The other thing that fails: bolting “be careful and grounded” onto the existing prompt. The model already has “be helpful” — adding “be grounded” produces a contradiction the model resolves toward whichever instruction is more recent in its training distribution (usually: helpful). Rewrite the prompt around grounding; don’t add grounding as a caveat.
Retrieval brings the document. Generation decides what to do with it. When the dashboard is green and users are unhappy, the failure has moved downstream — patch one sub-mechanism, ship, measure, repeat.
This isn’t this pattern when…
| You observe… | This is probably… | Read next |
|---|---|---|
| The retrieval eval also disagrees with user signal | Stale eval set masking everything | Eval drift |
| Failures concentrate on long, multi-doc contexts | Mid-window inattention specifically | Context rot |
| A frontier LLM is doing the relevance scoring AND the answering | Single-model overspend | Single-LLM overspend |
| Score-based gates suddenly under-fire after a model swap | Threshold drift, not generation drift | Threshold by feel |
| Judge metric flat, user-correct dropping | Judge has drifted off the production distribution | Distillation drift |
The disambiguation rule: this pattern lives between the retriever and the model. If the failure moves under doc reordering, the pattern is firing. If the failure moves under eval refresh, it is eval drift. If the failure moves under threshold change, it is threshold drift. Same surface symptom — the dashboards lie — different mechanism underneath.
Numbers that matter
| signal | healthy | suspect | confirmed |
|---|---|---|---|
| Recall@5 minus user-correct rate | under 0.05 | 0.05-0.15 | over 0.15 |
| human-from-context yes-rate (Test 1) | under 30% | 30-60% | over 70% |
| median answer-doc overlap (Test 2) | over 0.45 | 0.25-0.45 | under 0.25 |
| dominant sub-mechanism share (Test 3) | under 25% | 25-35% | over 35% |
| mean grounded-ness (Test 4, 0-3) | over 2.4 | 2.0-2.4 | under 2.0 |
| citation rate in answers | over 90% | 60-90% | under 60% |
Starting thresholds. High-precision domains (legal, medical) should target mean grounded-ness over 2.7; discovery domains (product search, recommendations) tolerate 2.2.
Adjacent patterns
- Eval drift: the parent disambiguation. When the offline retrieval metric also disagrees with user signal, the pattern may not be firing — the eval itself might be stale. Refresh first; if the metric-vs-user gap persists after refresh, the workload is in right-doc-wrong-answer.
- Context rot: a strict sub-mechanism of this pattern (mid-window inattention specifically), common enough to warrant its own diagnostic page. When Test 3 puts context rot as dominant, jump straight there.
- Single-LLM overspend: the structural cousin. This pattern is hard to debug precisely because the same LLM is doing relevance reasoning AND answer generation — two tasks with different optimization targets. Attributing the work to specialized models (a reranker for relevance, a faithfulness checker for grounding) decomposes the failure surface and makes each layer separately observable.
- Distillation drift: when a distilled judge is already deployed to score grounded-ness and that score has flattened while user-correct keeps dropping, the judge has drifted off the production distribution. The right-doc-wrong-answer failure is real but invisible to the monitor.
Disambiguation rule: this pattern is the one where retrieval is honest and generation is dishonest. The neighbors above each move one of those two facts.
