Also known as: candidate generation, first-stage retrieval, recall stage
TL;DR
First-pass retrieval is the initial wide-net stage of a production search pipeline that surfaces a few hundred candidate documents per query out of millions. It optimizes for recall and speed; precision-at-the-top is left to a reranker downstream.
First-pass retrieval is the first stage of a two-stage retrieval pipeline. Its job: out of every document in the corpus, hand the next stage a few hundred candidates that probably contain the relevant one.
What they have in common: ability to scan millions or billions of documents per query in milliseconds. What they trade away: per-pair accuracy. They have a tight latency budget, so depth-first scoring approaches like cross-encoder rerankers come later in the pipeline.
The metric to optimize
First-pass retrieval is judged by Recall@K — the fraction of queries whose relevant document appears anywhere in the top-K candidates. The reranker can fix ordering within those K, but it can’t recover a document that wasn’t surfaced. So Recall@100 is the silent ceiling on every RAG pipeline.
This is why you’ll see embedding models marketed on Recall@100 specifically. NDCG matters at the top of the result list, but Recall@100 is what determines whether the right answer can possibly reach the top.
Recall@K is the ceiling; everything downstream lives below it. A reranker can only reorder what first-pass surfaces — if the right document isn’t in the top-K, no amount of cross-attention recovers it.
How wide should the candidate set be?
K is a knob. Wider candidate sets give more headroom for the reranker but cost more — both in network bandwidth and in reranker compute. Common production values: K = 50, 100, or 200. Going past 500 rarely helps unless your first-pass is genuinely poor; the marginal recall gain doesn’t justify the extra reranker cost.
The diagnostic is the Recall@K curve: sweep K from 10 up to several hundred and plot the recall on a held-out eval set. If recall is climbing steeply at K=200, your first-pass is leaving signal on the table — widen until the curve flattens. If recall plateaus early (say, gains under 1 percentage point between K=50 and K=500), the candidate set is saturated and the rare misses are pathological cases the first-pass fundamentally can’t catch. At that point the right move is hybrid retrieval (add BM25 if dense-only, add dense if BM25-only), domain-adapted embeddings, or query rewriting — not just more K. The cost asymmetry helps the decision: doubling K doubles reranker spend, while adding BM25 to a dense-only stack is a fixed engineering cost that pays off across every query.
Typical production first-pass shapes
Pure BM25 over an inverted index — fast, exact, no model serving cost
Dense embeddings + HNSW (or FAISS, ScaNN) — semantic, requires GPU/CPU embedding inference at query time
Hybrid (BM25 + dense, fused with RRF) — the most common production default
BM25 + dense + ColBERT late-interaction — three legs, expensive, used at the high end
Coarse-then-fine: cheap first-pass over binary-quantized embeddings, re-scored by full-precision dense, then reranker
Go further
How do I know if my first-pass recall is good enough?
Sweep K from 10 up to a few hundred, plot Recall@K, and look for the knee. If recall is still climbing fast at K=200, your first-pass is leaving signal on the table — fix that before tuning the reranker.
Why not skip first-pass and just rerank everything?
A cross-encoder needs a fresh forward pass per (query, doc) pair, so reranking a million-doc corpus per query is roughly a million times too expensive. First-pass exists precisely to compress 'all docs' down to a few hundred the cross-encoder can afford.
What changes when the corpus reaches billions of documents?
ANN index choice (HNSW vs IVF vs ScaNN) starts to dominate cost and recall, embedding quantization becomes mandatory to fit memory, and you typically shard by topic or tenant to keep per-query latency flat.