First-Pass Retrieval

Also known as: candidate generation, first-stage retrieval, recall stage

TL;DR

First-pass retrieval is the initial wide-net stage of a production search pipeline that surfaces a few hundred candidate documents per query out of millions. It optimizes for recall and speed; precision-at-the-top is left to a reranker downstream.

First-pass retrieval is the first stage of a two-stage retrieval pipeline. Its job: out of every document in the corpus, hand the next stage a few hundred candidates that probably contain the relevant one.

FIRST-PASS RETRIEVALThe bridge from millions to hundreds.CORPUSeverything indexedN = 10MFIRST-PASS RETRIEVALBM25 + ANN over embeddingsk = 1000recall@1000 = 0.94STAGE 2 · RERANKsmall cross-encoderk = 100recall@100 = 0.91STAGE 3 · RERANKlarge cross-encoder · LLMk = 10recall@10 = 0.78FINAL TOP-Kwhat the user seesk = 5recall@5 = 0.62DOWNSTREAMBOUND BYk = 1000A reranker can only reorder what first-pass surfaces — the missed doc is invisible.OPTIMIZE FIRST-PASS FOR RECALL · OPTIMIZE RERANKERS FOR ORDER

The dominant implementations are:

  • — fast lexical search via inverted indexes
  • + ANN — vector similarity via approximate-nearest-neighbor index (HNSW, IVF, ScaNN)
  • — both, merged

What they have in common: ability to scan millions or billions of documents per query in milliseconds. What they trade away: per-pair accuracy. They have a tight latency budget, so depth-first scoring approaches like come later in the pipeline.

The metric to optimize

First-pass retrieval is judged by — the fraction of queries whose relevant document appears anywhere in the top-K candidates. The reranker can fix ordering within those K, but it can’t recover a document that wasn’t surfaced. So Recall@100 is the silent ceiling on every RAG pipeline.

This is why you’ll see embedding models marketed on Recall@100 specifically. NDCG matters at the top of the result list, but Recall@100 is what determines whether the right answer can possibly reach the top.

Recall@K is the ceiling; everything downstream lives below it. A reranker can only reorder what first-pass surfaces — if the right document isn’t in the top-K, no amount of cross-attention recovers it.

How wide should the candidate set be?

K is a knob. Wider candidate sets give more headroom for the reranker but cost more — both in network bandwidth and in reranker compute. Common production values: K = 50, 100, or 200. Going past 500 rarely helps unless your first-pass is genuinely poor; the marginal recall gain doesn’t justify the extra reranker cost.

RECALL VS K · THE DESIGN KNOBHow wide should the candidate set be?K = 1K = 10K = 100K = 1KK = 10Kcandidate-set size · log scale00.250.50.751.0RECALL@K0%25%50%75%100%RERANKER COSTRECALL@Kdiminishing returnsRERANKER COSTlinear in KSWEET SPOT · K ≈ 200recall 0.93 · cost 2.0%typical production: K = 100 … 500Recall climbs steeply until K ≈ 100, then plateaus; the reranker cost scales linearly with K.The intersection is the design knob.WIDEN K UNTIL THE CURVE FLATTENS · STOP BEFORE COST OUTRUNS RECALL

The diagnostic is the Recall@K curve: sweep K from 10 up to several hundred and plot the recall on a held-out eval set. If recall is climbing steeply at K=200, your first-pass is leaving signal on the table — widen until the curve flattens. If recall plateaus early (say, gains under 1 percentage point between K=50 and K=500), the candidate set is saturated and the rare misses are pathological cases the first-pass fundamentally can’t catch. At that point the right move is hybrid retrieval (add BM25 if dense-only, add dense if BM25-only), domain-adapted embeddings, or query rewriting — not just more K. The cost asymmetry helps the decision: doubling K doubles reranker spend, while adding BM25 to a dense-only stack is a fixed engineering cost that pays off across every query.

Typical production first-pass shapes
  • Pure BM25 over an inverted index — fast, exact, no model serving cost
  • Dense embeddings + HNSW (or FAISS, ScaNN) — semantic, requires GPU/CPU embedding inference at query time
  • Hybrid (BM25 + dense, fused with RRF) — the most common production default
  • BM25 + dense + ColBERT late-interaction — three legs, expensive, used at the high end
  • Coarse-then-fine: cheap first-pass over binary-quantized embeddings, re-scored by full-precision dense, then reranker
Go further

Why not skip first-pass and just rerank everything?

A cross-encoder needs a fresh forward pass per (query, doc) pair, so reranking a million-doc corpus per query is roughly a million times too expensive. First-pass exists precisely to compress 'all docs' down to a few hundred the cross-encoder can afford.

What changes when the corpus reaches billions of documents?

ANN index choice (HNSW vs IVF vs ScaNN) starts to dominate cost and recall, embedding quantization becomes mandatory to fit memory, and you typically shard by topic or tenant to keep per-query latency flat.

Related articles

Posts on the ZeroEntropy blog that reference first-pass retrieval.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord