RAG (Retrieval-Augmented Generation)

Also known as: retrieval-augmented generation, retrieval augmented generation

TL;DR

RAG is the pattern of retrieving relevant documents and feeding them into an LLM as context, so the LLM can answer with grounded, citeable information instead of guessing from its training data.

RAG (Retrieval-Augmented Generation) describes any system where an LLM’s response is conditioned on documents fetched at query time. Rather than rely solely on what the LLM memorized during training (which is stale, lossy, and uncitable), RAG augments the model’s context with fresh, source-attributable text retrieved from a corpus you control.

The canonical pipeline:

User query arrives

Raw user input enters the system, optionally with conversation history.

Retrieval

Typically hybrid search or dense embeddings over a vector index returns N candidate documents.

Reranking (optional but recommended)

A cross-encoder reorders the candidate set so the actually-relevant docs are at the top.

Context assembly

The top-K reranked documents are formatted into the LLM’s prompt, often with citations.

Generation

The LLM produces an answer grounded in those documents.

Cite-and-verify (optional)

The response is checked against the retrieved context.

Why every step matters

The famous failure mode of RAG is hallucination, but hallucination is downstream of retrieval failure. If step 2 returns the wrong documents, the LLM will dutifully synthesize an answer from the wrong context — and sound confident. That’s why retrieval quality (Recall@K) and reranking quality (NDCG@10) matter so much for end-to-end RAG quality, even though the user never sees those metrics directly.

A common production rule of thumb: a 1-point NDCG@10 improvement on the reranker maps to a measurable lift in user-perceived answer quality. The signal compounds.

Frame the system as a pipeline. End-to-end answer quality is bounded by . Modern frontier LLMs handle the second term remarkably well — given clean grounded context, GPT-class generators produce faithful, citeable answers most of the time. The first term, however, varies widely: a baseline embedder might hit 0.75 Recall@100; a SoTA embedder + reranker stack might hit 0.95. That 20-point swing in retrieval recall translates almost 1:1 into end-to-end accuracy, because no matter how strong the generator, it can’t synthesize an answer from a context that doesn’t contain it. The empirical rule of thumb in production: a 1-point NDCG@10 improvement on the reranker is usually worth more user-perceived quality than a generation-model upgrade — and at a fraction of the latency and cost.

What “augmented” actually buys you

What RAG actually buys you

Freshness — the LLM can answer about events that happened after its training cutoff, as long as your retrieval index is up to date.
Domain authority — answers ground in your private docs, your customer’s data, your knowledge base.
Citations — you can show the user which documents the answer came from. Critical for legal, medical, financial use cases.
Editability — to change what the LLM “knows”, you update your index, not your model weights.

Where RAG breaks

Bad retrieval — relevant doc isn’t in the candidate set; nothing the LLM does can recover it.
Context overflow — too many retrieved tokens; LLM ignores the middle ( context compression helps here).
Aggregation queries — “how many products did we ship last quarter?” doesn’t fit retrieval; needs structured data, not RAG.
Outdated index — index drift. Re-embed or re-index regularly.

Hallucination in RAG is almost always a retrieval bug. Fix the retrieval — the generator is rarely the bottleneck.

Go further

Where in the pipeline should I invest first?

Almost always retrieval quality, not the LLM. A 1-point NDCG@10 lift on the reranker compounds into measurable answer-quality lift; a bigger generator on top of bad context just produces confidently-wrong answers faster.

Reranker NDCG@K Evaluating a reranker on your own data (playbook)

How do I keep the LLM from ignoring the retrieved context?

Two levers: shrink the prompt with a cross-encoder reranker so only top-K passages survive, then compress further if you're still pushing the model's effective context window. Long, redundant context is the main driver of [context rot](/concepts/context-rot/) failures.

Context compression Chunking Reranker

When is RAG the wrong tool?

Aggregation queries ('how many', 'what's the total'), tasks that require executing code or hitting an API, and questions whose answer is implicit across thousands of documents. For those you want structured query, tool-use, or summarization pipelines — not similarity search.

Query rewriting First-pass retrieval Hybrid search

← All concepts

Posts on the ZeroEntropy blog that reference rag (retrieval-augmented generation).

Jun 24, 2025

What is a reranker and do I need one?

Learn what a reranker is and how it boosts precision in RAG pipelines—surface the right results faster, reduce hallucinations, and improve LLM output quality.

The best AI teams build with ZeroEntropy models

Book Demo View docs