Also known as: retrieval-augmented generation, retrieval augmented generation
TL;DR
RAG is the pattern of retrieving relevant documents and feeding them into an LLM as context, so the LLM can answer with grounded, citeable information instead of guessing from its training data.
RAG (Retrieval-Augmented Generation) describes any system where an LLM’s response is conditioned on documents fetched at query time. Rather than rely solely on what the LLM memorized during training (which is stale, lossy, and uncitable), RAG augments the model’s context with fresh, source-attributable text retrieved from a corpus you control.
The canonical pipeline:
User query arrives
Raw user input enters the system, optionally with conversation history.
A cross-encoder reorders the candidate set so the actually-relevant docs are at the top.
Context assembly
The top-K reranked documents are formatted into the LLM’s prompt, often with citations.
Generation
The LLM produces an answer grounded in those documents.
Cite-and-verify (optional)
The response is checked against the retrieved context.
Why every step matters
The famous failure mode of RAG is hallucination, but hallucination is downstream of retrieval failure. If step 2 returns the wrong documents, the LLM will dutifully synthesize an answer from the wrong context — and sound confident. That’s why retrieval quality (Recall@K) and reranking quality (NDCG@10) matter so much for end-to-end RAG quality, even though the user never sees those metrics directly.
A common production rule of thumb: a 1-point NDCG@10 improvement on the reranker maps to a measurable lift in user-perceived answer quality. The signal compounds.
Frame the system as a pipeline. End-to-end answer quality is bounded by . Modern frontier LLMs handle the second term remarkably well — given clean grounded context, GPT-class generators produce faithful, citeable answers most of the time. The first term, however, varies widely: a baseline embedder might hit 0.75 Recall@100; a SoTA embedder + reranker stack might hit 0.95. That 20-point swing in retrieval recall translates almost 1:1 into end-to-end accuracy, because no matter how strong the generator, it can’t synthesize an answer from a context that doesn’t contain it. The empirical rule of thumb in production: a 1-point NDCG@10 improvement on the reranker is usually worth more user-perceived quality than a generation-model upgrade — and at a fraction of the latency and cost.
What “augmented” actually buys you
What RAG actually buys you
Freshness — the LLM can answer about events that happened after its training cutoff, as long as your retrieval index is up to date.
Domain authority — answers ground in your private docs, your customer’s data, your knowledge base.
Citations — you can show the user which documents the answer came from. Critical for legal, medical, financial use cases.
Editability — to change what the LLM “knows”, you update your index, not your model weights.
Where RAG breaks
Bad retrieval — relevant doc isn’t in the candidate set; nothing the LLM does can recover it.
Context overflow — too many retrieved tokens; LLM ignores the middle ( context compression helps here).
Aggregation queries — “how many products did we ship last quarter?” doesn’t fit retrieval; needs structured data, not RAG.
Outdated index — index drift. Re-embed or re-index regularly.
Hallucination in RAG is almost always a retrieval bug. Fix the retrieval — the generator is rarely the bottleneck.
Go further
Where in the pipeline should I invest first?
Almost always retrieval quality, not the LLM. A 1-point NDCG@10 lift on the reranker compounds into measurable answer-quality lift; a bigger generator on top of bad context just produces confidently-wrong answers faster.
How do I keep the LLM from ignoring the retrieved context?
Two levers: shrink the prompt with a cross-encoder reranker so only top-K passages survive, then compress further if you're still pushing the model's effective context window. Long, redundant context is the main driver of [context rot](/concepts/context-rot/) failures.
Aggregation queries ('how many', 'what's the total'), tasks that require executing code or hitting an API, and questions whose answer is implicit across thousands of documents. For those you want structured query, tool-use, or summarization pipelines — not similarity search.