RAG (Retrieval-Augmented Generation)

Also known as: retrieval-augmented generation, retrieval augmented generation

TL;DR

RAG is the pattern of retrieving relevant documents and feeding them into an LLM as context, so the LLM can answer with grounded, citeable information instead of guessing from its training data.

RAG (Retrieval-Augmented Generation) describes any system where an LLM’s response is conditioned on documents fetched at query time. Rather than rely solely on what the LLM memorized during training (which is stale, lossy, and uncitable), RAG augments the model’s context with fresh, source-attributable text retrieved from a corpus you control.

RETRIEVAL-AUGMENTED GENERATIONQuery → Retrieve → Assemble → Generate → CiteQUERYVECTOR STORETOP-KPROMPTLLMUSER QUERY"what is RAG?"INDEX · N≈10⁶0.94d₄₂Hybrid: BM25 + dense...0.88d₁₇Reranker boosts NDCG...0.81d₈₁Hybrid search recall...0.76d₂₃Chunking tradeoffs...ASSEMBLED PROMPTQUERY:what is RAG?CONTEXT:[d₄₂]Hybrid: BM25 + dense..[d₁₇]Reranker boosts NDCG..[d₈₁]Hybrid search recall..[d₂₃]Chunking tradeoffs...LLMGENERATEGROUNDED ANSWERRAG retrieves relevant chunks and feeds them to an LLM as context,producing an answer grounded in cited sources[d₄₂][d₁₇][d₈₁][d₂₃]A user question enters the pipeline.

The canonical pipeline:

User query arrives

Raw user input enters the system, optionally with conversation history.

Retrieval

Typically or dense over a vector index returns N candidate documents.

Reranking (optional but recommended)

A reorders the candidate set so the actually-relevant docs are at the top.

Context assembly

The top-K reranked documents are formatted into the LLM’s prompt, often with citations.

Generation

The LLM produces an answer grounded in those documents.

Cite-and-verify (optional)

The response is checked against the retrieved context.

Why every step matters

The famous failure mode of RAG is hallucination, but hallucination is downstream of retrieval failure. If step 2 returns the wrong documents, the LLM will dutifully synthesize an answer from the wrong context — and sound confident. That’s why retrieval quality (Recall@K) and reranking quality (NDCG@10) matter so much for end-to-end RAG quality, even though the user never sees those metrics directly.

A common production rule of thumb: a 1-point NDCG@10 improvement on the reranker maps to a measurable lift in user-perceived answer quality. The signal compounds.

Frame the system as a pipeline. End-to-end answer quality is bounded by . Modern frontier LLMs handle the second term remarkably well — given clean grounded context, GPT-class generators produce faithful, citeable answers most of the time. The first term, however, varies widely: a baseline embedder might hit 0.75 Recall@100; a SoTA embedder + reranker stack might hit 0.95. That 20-point swing in retrieval recall translates almost 1:1 into end-to-end accuracy, because no matter how strong the generator, it can’t synthesize an answer from a context that doesn’t contain it. The empirical rule of thumb in production: a 1-point NDCG@10 improvement on the reranker is usually worth more user-perceived quality than a generation-model upgrade — and at a fraction of the latency and cost.

What “augmented” actually buys you

What RAG actually buys you
  • Freshness — the LLM can answer about events that happened after its training cutoff, as long as your retrieval index is up to date.
  • Domain authority — answers ground in your private docs, your customer’s data, your knowledge base.
  • Citations — you can show the user which documents the answer came from. Critical for legal, medical, financial use cases.
  • Editability — to change what the LLM “knows”, you update your index, not your model weights.

Where RAG breaks

  • Bad retrieval — relevant doc isn’t in the candidate set; nothing the LLM does can recover it.
  • Context overflow — too many retrieved tokens; LLM ignores the middle ( helps here).
  • Aggregation queries — “how many products did we ship last quarter?” doesn’t fit retrieval; needs structured data, not RAG.
  • Outdated index — index drift. Re-embed or re-index regularly.

Hallucination in RAG is almost always a retrieval bug. Fix the retrieval — the generator is rarely the bottleneck.

Go further

How do I keep the LLM from ignoring the retrieved context?

Two levers: shrink the prompt with a cross-encoder reranker so only top-K passages survive, then compress further if you're still pushing the model's effective context window. Long, redundant context is the main driver of [context rot](/concepts/context-rot/) failures.

When is RAG the wrong tool?

Aggregation queries ('how many', 'what's the total'), tasks that require executing code or hitting an API, and questions whose answer is implicit across thousands of documents. For those you want structured query, tool-use, or summarization pipelines — not similarity search.

Related articles

Posts on the ZeroEntropy blog that reference rag (retrieval-augmented generation).

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord