Listwise Reranking

Q: How does listwise compare to pointwise in production?

On most production RAG benchmarks, a calibrated pointwise cross-encoder matches or beats listwise LLM rerankers at 100×-1000× lower cost and latency. Listwise wins on small, high-stakes corpora where ordering matters more than throughput.

Also known as: listwise ranker, list-aware reranking

TL;DR

Listwise reranking processes the entire candidate list as a single input and produces a permutation, rather than scoring each (query, document) pair independently. More expressive but more expensive — typically powered by an LLM.

Listwise reranking is the family of methods where the model sees all candidate documents at once and produces an ordered ranking, rather than scoring each pair independently. The model can compare candidates to each other directly — “this one beats that one because…” — rather than committing to absolute scores.

In recent practice, listwise rerankers are most often LLMs prompted with a list of candidates and asked to return them in relevance order. RankGPT (Sun et al., 2023) established the recipe and showed it competitive with dedicated cross-encoders on TREC-DL and BEIR.

A frontier LLM with a 200K context can fit maybe 50 to 100 medium-length documents alongside the query. For top-K reranking over a 1,000-candidate first-pass output, you can’t fit them all — you slide a window of, say, 30 documents at a time, rerank within the window, then slide forward, carrying the top of the previous window into the next. The trouble: each window only knows about the documents inside it. A document that ranked 5th in window A might be globally better than the 1st-ranked of window B, but the model never compared them directly. RankGPT proposed a “bubble-sort” sliding pattern that mitigates this by passing the current top-K through every window, but you still pay multiple LLM calls and the global ordering is approximate.

Listwise vs pointwise vs pairwise

Pointwise — score each (query, document) independently. The standard cross-encoder approach. Simple and parallelizable.
Pairwise — given (query, doc A, doc B), pick which is more relevant. The training signal behind zELO .
Listwise — given the full list, produce a permutation. Most context-aware; least parallelizable.

Where listwise wins

When candidates are similar in quality — pointwise scorers get noisy near the top of close lists; listwise can reason about subtle differences.
When the user wants a coherent ordering — listwise sees diversity / redundancy and can pick a varied set, not 5 near-duplicates.
When the LLM is already in the loop — at the cost of one prompt you avoid an extra dedicated reranker model.

Where it loses

Latency — running an LLM listwise on 100 candidates is much slower than a cross-encoder reranker.
Cost — frontier LLM cost per token vs. ~$0.025/1M tokens for a specialized reranker. Three orders of magnitude difference in cost-per-rerank.
Calibration — listwise output is a permutation, not a score. You can’t threshold “anything below 0.6 is irrelevant” because there are no scores.
Context length limits — most LLMs can’t fit hundreds of long candidate documents in a single prompt; you end up sliding-window listwise, which loses coherence.

Notable listwise rerankers

RankGPT (Sun et al., 2023) — the canonical zero-shot listwise prompt; established the sliding-window pattern.
RankVicuna and RankZephyr — open-weight listwise rerankers distilled from RankGPT supervision.
LRL (Listwise Reranker with LLMs) — hybrid that combines listwise comparisons with pointwise calibration heads.
Cohere Rerank v3 — production-API reranker that exposes a listwise mode for low-volume premium ranking.

When to actually use it

Production systems that benefit from listwise are typically lower-volume / higher-stakes (legal review, medical literature search) where the absolute-best ordering matters more than per-query latency. For high-volume RAG, a calibrated cross-encoder beats listwise on most metrics including end-to-end answer quality.

Listwise wins where ordering matters more than throughput.

Go further

How does listwise compare to pointwise in production?

On most production RAG benchmarks, a calibrated pointwise [cross-encoder](/concepts/cross-encoder/) matches or beats listwise LLM rerankers at 100×-1000× lower cost and latency. Listwise wins on small, high-stakes corpora where ordering matters more than throughput.

Pointwise scoring Cross-encoder Reranker

Why doesn't listwise output a calibrated score?

By construction, listwise produces a permutation (ranking) rather than a per-pair score. There's no anchor that says 'rank 5 is the cutoff for irrelevant' — only that rank 5 was preferred over rank 6 within this query. Use [pointwise scoring](/concepts/pointwise-scoring/) when you need calibration.

Score calibration Pointwise scoring

Can pairwise preferences give you the best of both?

Yes — pairwise data is much less noisy than absolute scores, and a [Thurstone fit](/concepts/thurstone-model/) over a sparse pairwise graph recovers calibrated pointwise scores you can use to train a fast pointwise reranker. That's the [zELO](/concepts/zelo/) recipe.

Pairwise preference Thurstone model zELO

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs