Reranker

Also known as: re-ranker, rerank model, second-stage ranker

TL;DR

A reranker is a second-stage retrieval model that takes a candidate set from first-pass retrieval and reorders it by relevance. It's how production search systems get high precision without paying full LLM cost on every query.

A reranker is a model that reorders a list of candidate documents for a given query, putting the most relevant ones first. It’s the second stage of a typical retrieval pipeline. The first stage (BM25, dense embeddings, or hybrid) casts a wide net and returns hundreds of candidates per query — fast but imprecise. The reranker then scores each (query, document) pair more carefully and reorders them.

RERANKERRe-score the candidates, jointly.QUERYhow do I sign back into my account?1 · FIRST-PASS TOP-7bi-encoder cosine order2 · RERANKER(q, dᵢ) scored jointly3 · RERANKED TOP-7cross-encoder score order1D14Resetting password f…0.862D27Common password mist…0.833D55Password complexity …0.814D08I can't log in — wha…0.785D62Two-factor authentic…0.746D33Forgot username or p…0.707D91Account help center …0.67CROSS-ENCODERq[SEP]dN layers · joint attentionCLS · LINEAR · σscore(q, dᵢ)1D08I can't log in — wha…0.962D33Forgot username or p…0.893D14Resetting password f…0.614D91Account help center …0.525D55Password complexity …0.426D27Common password mist…0.347D62Two-factor authentic…0.28The relevant doc was already in the candidate set — the reranker just put it on top.ORDERING IS THE LAST MILE · NDCG@10 IS WHERE IT SHOWS UP

The right division of labor is fast and approximate first, slow and precise second. Rerankers exist because that asymmetry is the only way to ship retrieval at production scale.

Why a separate stage?

You could use a single big model to do both stages, but it would be impossibly expensive. First-pass retrieval has to consider millions of documents per query; a reranker only sees a few hundred. So the right division of labor is “fast and approximate” first, “slow and precise” second. The reranker sees a much smaller candidate set, so it can afford to be more thoughtful per pair.

What kinds of rerankers are there?

The dominant architecture today is a cross-encoder — the query and document are concatenated and fed jointly into a transformer, producing a single relevance score per pair. This is much more accurate than a bi-encoder (which embeds each separately) but slower per pair. Because rerankers only see ~100 pairs per query instead of millions, the cost is manageable.

Other shapes: listwise rerankers see the whole candidate set at once and produce a permutation; pointwise ones score each pair independently. LLM-as-reranker is a recent variation that uses a frontier LLM with prompting; it tends to be slower and more expensive than a specialized cross-encoder at lower or equal accuracy.

Reranker shapes in the wild
  • Cross-encoder, pointwise — zerank-2, Cohere rerank-3, Voyage rerank-2.5. The production default.
  • Listwise prompted LLM — RankGPT, RankZephyr. Best accuracy on benchmarks; latency rules them out for most production traffic.
  • Late-interaction (ColBERT) — sits between bi-encoder and cross-encoder. Token-level embeddings with maxsim aggregation.
  • Cascade rerankers — cheap pointwise pass narrows top-100 to top-20, expensive listwise pass orders the survivors.

The honest answer is “almost always, if your top-1 quality matters.” A typical cross-encoder reranker adds 30-150ms to a retrieval pipeline that’s already 50-200ms. The win on NDCG@10 is usually 10-30 absolute points on production traffic where first-pass already returns plausible candidates. That’s a colossal accuracy gain for a sub-second latency cost.

The exceptions are narrow. (1) Pure-keyword domains where BM25 is already top-1 accurate — legal citations, SKU lookup, exact-match code search. (2) Workloads where p99 latency is a hard constraint below 200ms total. (3) Workloads where downstream consumers tolerate noisy top-k (some classification or summarization use cases). For everything else — RAG over docs, customer-support search, e-commerce product retrieval — the reranker is the single highest-leverage addition you can make.

Specialization. A reranker is asked one narrow question — “given this query and this document, how relevant is the document?” It doesn’t need to write code, do math, or hold long-form context. A 0.1-1B parameter cross-encoder, fine-tuned on retrieval supervision, beats a 70B+ frontier LLM at this task while running 100-1000× cheaper.

This is the central thesis of specialized-model retrieval: production AI is a constellation of small fine-tuned specialists, not one giant generalist. zerank-2 is 0.5B parameters and beats every commercial reranker on the BEIR benchmark by a wide margin. The size difference isn’t a tradeoff — it’s an advantage. Smaller models are easier to specialize, deploy, and serve at low latency.

When does it help most?

Rerankers shine when first-pass retrieval surfaces many plausible candidates but the actually-relevant document isn’t necessarily at the top. That’s basically every production RAG pipeline. The accuracy gain from reranking compounds through the rest of the stack: better top-k means a better LLM context window means better answers.

RERANKER LIFT · NDCG@10The last stage is where the points come from.0.00.20.40.60.8NDCG@10 on a BEIR-style averageBM25 aloneSPARSE LEXICAL RETRIEVALDense embedder aloneBI-ENCODER COSINEHybrid (BM25 + dense)RECIPROCAL RANK FUSIONHybrid + rerankerZERANK-1 ON THE TOP-1000.450.520.580.71LIFT+0.13STYLIZED · BEIR-STYLE AVERAGESStylized BEIR-like averages; absolute numbers vary by domain.The qualitative gap — reranker adds ~10–15 NDCG points on top of any first-pass — is consistent.

What ZeroEntropy ships

zerank-2 is our flagship reranker — calibrated, multilingual (100+ languages), instruction-following, and 50% cheaper than every other commercial reranker. See the Rerankers product page for the full specs.

Go further

How do I evaluate a reranker on my own data?

Build a small eval set (50-200 queries with graded-relevance judgments on top-100 candidates), measure NDCG@10 and recall@10 with and without the reranker, and compare across models. Don't rely on public benchmarks — they rarely match the shape of production traffic.

What's the difference between a cross-encoder and a bi-encoder reranker?

Cross-encoders see the query and document jointly in one forward pass, capturing token-level interactions — much more accurate but more expensive per pair. Bi-encoders embed each separately and compare via dot product — fast enough for first-pass but typically not used as the reranking stage.

How was zerank-2 trained?

Via the [zELO](/concepts/zelo/) methodology: frontier LLMs vote pairwise on relevance, a Thurstone fit recovers continuous Elo-style scores, and a small specialized cross-encoder learns to predict those scores pointwise. The result is a calibrated, instruction-following reranker trained without human annotation.

Related articles

Posts on the ZeroEntropy blog that reference reranker.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord