Also known as: re-ranker, rerank model, second-stage ranker
TL;DR
A reranker is a second-stage retrieval model that takes a candidate set from first-pass retrieval and reorders it by relevance. It's how production search systems get high precision without paying full LLM cost on every query.
A reranker is a model that reorders a list of candidate documents for a given query, putting the most relevant ones first. It’s the second stage of a typical retrieval pipeline. The first stage (BM25, dense embeddings, or hybrid) casts a wide net and returns hundreds of candidates per query — fast but imprecise. The reranker then scores each (query, document) pair more carefully and reorders them.
The right division of labor is fast and approximate first, slow and precise second. Rerankers exist because that asymmetry is the only way to ship retrieval at production scale.
Why a separate stage?
You could use a single big model to do both stages, but it would be impossibly expensive. First-pass retrieval has to consider millions of documents per query; a reranker only sees a few hundred. So the right division of labor is “fast and approximate” first, “slow and precise” second. The reranker sees a much smaller candidate set, so it can afford to be more thoughtful per pair.
What kinds of rerankers are there?
The dominant architecture today is a cross-encoder — the query and document are concatenated and fed jointly into a transformer, producing a single relevance score per pair. This is much more accurate than a bi-encoder (which embeds each separately) but slower per pair. Because rerankers only see ~100 pairs per query instead of millions, the cost is manageable.
Other shapes: listwise rerankers see the whole candidate set at once and produce a permutation; pointwise ones score each pair independently. LLM-as-reranker is a recent variation that uses a frontier LLM with prompting; it tends to be slower and more expensive than a specialized cross-encoder at lower or equal accuracy.
Reranker shapes in the wild
Cross-encoder, pointwise — zerank-2, Cohere rerank-3, Voyage rerank-2.5. The production default.
Listwise prompted LLM — RankGPT, RankZephyr. Best accuracy on benchmarks; latency rules them out for most production traffic.
Late-interaction (ColBERT) — sits between bi-encoder and cross-encoder. Token-level embeddings with maxsim aggregation.
Cascade rerankers — cheap pointwise pass narrows top-100 to top-20, expensive listwise pass orders the survivors.
The honest answer is “almost always, if your top-1 quality matters.” A typical cross-encoder reranker adds 30-150ms to a retrieval pipeline that’s already 50-200ms. The win on NDCG@10 is usually 10-30 absolute points on production traffic where first-pass already returns plausible candidates. That’s a colossal accuracy gain for a sub-second latency cost.
The exceptions are narrow. (1) Pure-keyword domains where BM25 is already top-1 accurate — legal citations, SKU lookup, exact-match code search. (2) Workloads where p99 latency is a hard constraint below 200ms total. (3) Workloads where downstream consumers tolerate noisy top-k (some classification or summarization use cases). For everything else — RAG over docs, customer-support search, e-commerce product retrieval — the reranker is the single highest-leverage addition you can make.
Specialization. A reranker is asked one narrow question — “given this query and this document, how relevant is the document?” It doesn’t need to write code, do math, or hold long-form context. A 0.1-1B parameter cross-encoder, fine-tuned on retrieval supervision, beats a 70B+ frontier LLM at this task while running 100-1000× cheaper.
This is the central thesis of specialized-model retrieval: production AI is a constellation of small fine-tuned specialists, not one giant generalist. zerank-2 is 0.5B parameters and beats every commercial reranker on the BEIR benchmark by a wide margin. The size difference isn’t a tradeoff — it’s an advantage. Smaller models are easier to specialize, deploy, and serve at low latency.
When does it help most?
Rerankers shine when first-pass retrieval surfaces many plausible candidates but the actually-relevant document isn’t necessarily at the top. That’s basically every production RAG pipeline. The accuracy gain from reranking compounds through the rest of the stack: better top-k means a better LLM context window means better answers.
What ZeroEntropy ships
zerank-2 is our flagship reranker — calibrated, multilingual (100+ languages), instruction-following, and 50% cheaper than every other commercial reranker. See the Rerankers product page for the full specs.
Go further
How do I evaluate a reranker on my own data?
Build a small eval set (50-200 queries with graded-relevance judgments on top-100 candidates), measure NDCG@10 and recall@10 with and without the reranker, and compare across models. Don't rely on public benchmarks — they rarely match the shape of production traffic.
What's the difference between a cross-encoder and a bi-encoder reranker?
Cross-encoders see the query and document jointly in one forward pass, capturing token-level interactions — much more accurate but more expensive per pair. Bi-encoders embed each separately and compare via dot product — fast enough for first-pass but typically not used as the reranking stage.
Via the [zELO](/concepts/zelo/) methodology: frontier LLMs vote pairwise on relevance, a Thurstone fit recovers continuous Elo-style scores, and a small specialized cross-encoder learns to predict those scores pointwise. The result is a calibrated, instruction-following reranker trained without human annotation.