Hallucination

Also known as: confabulation, ungrounded generation, factual error

TL;DR

Hallucination is when an LLM generates a confident-sounding statement that's factually wrong or unsupported by the input. It's the load-bearing failure mode of LLMs in production.

Hallucination is when a generates content that’s plausible-sounding but factually wrong, fabricated, or unsupported by the input. The LLM doesn’t know it’s hallucinating — its tone is identical to when it’s correct, which is what makes the failure mode dangerous in production.

Why it happens

Pre-training optimizes the model to predict likely next tokens given everything seen so far. The objective rewards plausibility, not factual accuracy. Several specific failure modes:

  • Memory degradation. The model knows something approximately from training but can’t recall the exact detail; it confabulates a plausible-looking version. Made-up citations, slightly-wrong dates, fabricated quotes — all of these.
  • Beyond-training extrapolation. Asked about something post-cutoff or outside the training distribution, the model still produces an answer because saying “I don’t know” wasn’t well-rewarded during training.
  • Context misreading. Even when the right information is in the prompt, the model can miss it (especially mid-context, see ) and fall back to training memory.
  • Reasoning failures. Multi-step inference goes off the rails; the model commits to an early wrong claim and elaborates from there.
Concrete hallucination patterns to watch for
  • Citation fabrication: a paper, book, or court case that does not exist, with a plausible author and year
  • Quantitative drift: a real statistic remembered approximately (“about 30%”) when the source said “27.4%”
  • Person-conflation: facts from one author or executive attributed to another with a similar name
  • API hallucination: invoking a method or argument that doesn’t exist in the named library
  • Confident date errors: a specific year asserted for an event whose date the model is uncertain about
  • Mid-context blindness: ignoring evidence at position 4000 of an 8000-token context, falling back to training memory

Why it’s so hard to fix at the model level

The bias toward plausibility is structural. Pre-training data overwhelmingly contains confident, declarative text; very little of it is “I don’t know” or “I’m uncertain about X”. The model learns the distribution, which is the problem. RLHF/DPO can push toward more cautious outputs but at the cost of helpfulness — over-aligned models refuse too often.

Calibration improvements (the model’s stated confidence matching its accuracy) help downstream systems decide when to trust the output. But calibration is hard to achieve via training alone; it usually requires explicit supervision signal that doesn’t exist at scale.

RAG changes what the model is hallucinating from. Without retrieval, the model draws on parametric memory — every fact in its training data plus every plausible-looking interpolation — which is a vast space, and most failures are confident memory recalls of slightly-wrong facts. With retrieval, the bulk of the model’s evidence is grounded in the prompt, which sharply reduces the prior on confabulating from training memory. But two failure modes survive: (1) the retrieved context doesn’t contain the answer, and the model falls back on parametric memory anyway; (2) the retrieved context contains the answer but the model synthesizes a claim that goes beyond what the context says (over-generalization, aggregation drift, hedge inversion). The first is fixed by better retrieval; the second is fixed by faithfulness checking on top. Neither is a model-level intervention — both are architectural.

How production systems work around it

The dominant pattern in production AI is to never trust LLM-generated facts at all. Instead:

  1. Ground the model in retrieved evidence ( ). Send the relevant documents in the prompt; instruct the model to answer only from them. Reduces hallucination dramatically but doesn’t eliminate it.
  2. Improve retrieval quality. Stronger first-pass + a means the retrieved context is more likely to contain the actual answer, so the model has less reason to fall back on memory.
  3. Verify generated claims. A small fine-tuned faithfulness model checks whether each output sentence is entailed by the retrieved context. Failed claims trigger regeneration or filtering.
  4. Cite and surface sources. Even when hallucination slips through, showing users the source documents lets them sanity-check.
  5. Use . A reranker that says “0.3” on every retrieved doc is a signal to fall back to a clarifying-question prompt instead of generating a confident-sounding answer over weak evidence.

Hallucination is the central reliability problem in production LLM applications, not an edge case. The whole architecture of modern RAG — retrieve, rerank, ground, verify — is defense in depth against this single failure mode.

The architecture of modern RAG is the architecture of defense-in-depth against hallucination. Retrieval sharpens the prior, reranker raises the supporting passage, faithfulness checks verify the output. Each layer assumes the others won’t be perfect.

Go further

Does RAG eliminate hallucination?

It reduces it dramatically but doesn't eliminate it. The model can still synthesize claims that go beyond what the retrieved context actually says, especially when retrieval is incomplete. RAG turns 'hallucinating from training memory' into 'hallucinating from a smaller, grounded prompt' — better, but the failure mode persists. Citation extraction and answer-grounding checks are the next layer of defense.

Why are LLMs so confidently wrong?

Pre-training rewards plausibility, not truth. The model is trained to produce text that looks like its training distribution; saying 'I don't know' is statistically rare in the training data, so the model is biased toward generating something that sounds right. Calibration tuning helps, but the underlying bias survives.

How do production systems detect hallucination at runtime?

The robust pattern is to verify each claim in the LLM's output against the retrieved context — typically with a small fine-tuned model that does sentence-level entailment ('does this claim follow from the retrieved doc?'). Outputs that fail entailment get flagged or regenerated. Specialized faithfulness models are increasingly the right tool for this; a frontier LLM doing the same job is slow and expensive.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord