Recall@K is the fraction of queries whose relevant document appears anywhere in the top-K results. It measures whether retrieval found the right document at all — the silent ceiling on every downstream stage.
Recall@K is the fraction of queries whose at-least-one relevant document is contained somewhere in the top-K results. Concretely:
Recall@100 = (queries where the relevant doc is in top-100) / (total queries)
Unlike NDCG@K , it doesn’t care where in top-K the document is — only whether it’s there at all. So a relevant doc at position 1 and a relevant doc at position 100 both count fully.
Why Recall@100 is the load-bearing metric for retrieval
In a two-stage pipeline ( first-pass → reranker ), the reranker can only sort the candidates that first-pass surfaced. If the relevant document didn’t make it into the top-100, nothing the reranker does can put it at #1. Recall@K (with K typically equal to your candidate-set size, often 100) is therefore the ceiling on every downstream metric.
Recall@K is the silent ceiling on every metric you measure downstream.
That’s why embedding models are often marketed on Recall@100 specifically. The gap between, say, 0.77 and 0.79 might look small, but it represents 2% more queries where the relevant document is now reachable. Multiply that by every query you serve.
K is not a free hyperparameter — it’s tied to your reranker’s latency budget. Cross-encoder rerankers run forward passes at roughly 5-30 ms per (query, doc) pair on commodity GPUs; at K=100, you’re spending 500ms-3s on reranking. K=10 leaves the reranker no headroom: there’s nothing for it to recover, so a missed-at-100 doc stays missed regardless. K=1000 multiplies reranker latency tenfold, usually past production budgets. K=100 is the sweet spot — large enough that recall ceiling barely caps reranker performance, small enough that reranker latency fits a real-time SLO. As reranker throughput improves (smaller specialized models, batching tricks), K=200-500 becomes feasible and Recall@K at the larger K becomes the more honest first-pass metric. Tune K to your reranker’s throughput; don’t pick it from convention.
Recall vs precision
Recall: “how many relevant items did we find?” (out of all relevant ones)
Precision: “of what we returned, how much was relevant?”
In retrieval the canonical metric set is Recall@K (the candidate set has the relevant doc) plus NDCG@K (the ordering puts it on top). Precision-style metrics (P@K, MAP) exist but are less common in modern retrieval evaluation.
Concrete numbers
Recall@100 across embedders (28-dataset average)
OpenAI Large: Recall@100 = 0.751
Voyage-4: Recall@100 = 0.751
harrier-27b: Recall@100 = 0.749
zembed-1: Recall@100 = 0.771
A 2-point Recall@100 advantage means 2% more queries where the relevant document survives first-pass. End-to-end, that’s a measurable lift in RAG answer quality and reranker headroom.
Go further
Why pick K=100 specifically?
K is conventionally set to your candidate-set size — the number of documents you'll feed into the reranker. 100 is the modern default because it balances reranker latency budget against headroom: anything not in top-100 is invisible to the reranker no matter how good it is.
If Recall@100 is 0.95, am I done with first-pass retrieval?
Almost — the remaining 5% is queries where the relevant doc isn't reachable, and no reranker can save them. The fix is either a stronger embedder, hybrid search adding a sparse signal, or query rewriting to surface synonymous phrasings.
Recall asks 'is the relevant doc anywhere in top-K?'; NDCG asks 'is it near the top?'. You typically track Recall at a large K (100) and NDCG at a small K (10) — Recall measures first-pass; NDCG measures reranker quality on top of that.