Also known as: listwise ranker, list-aware reranking
TL;DR
Listwise reranking processes the entire candidate list as a single input and produces a permutation, rather than scoring each (query, document) pair independently. More expressive but more expensive — typically powered by an LLM.
Listwise reranking is the family of methods where the model sees all candidate documents at once and produces an ordered ranking, rather than scoring each pair independently. The model can compare candidates to each other directly — “this one beats that one because…” — rather than committing to absolute scores.
In recent practice, listwise rerankers are most often LLMs prompted with a list of candidates and asked to return them in relevance order. RankGPT (Sun et al., 2023) established the recipe and showed it competitive with dedicated cross-encoders on TREC-DL and BEIR.
A frontier LLM with a 200K context can fit maybe 50 to 100 medium-length documents alongside the query. For top-K reranking over a 1,000-candidate first-pass output, you can’t fit them all — you slide a window of, say, 30 documents at a time, rerank within the window, then slide forward, carrying the top of the previous window into the next. The trouble: each window only knows about the documents inside it. A document that ranked 5th in window A might be globally better than the 1st-ranked of window B, but the model never compared them directly. RankGPT proposed a “bubble-sort” sliding pattern that mitigates this by passing the current top-K through every window, but you still pay multiple LLM calls and the global ordering is approximate.
Listwise vs pointwise vs pairwise
Pointwise — score each (query, document) independently. The standard cross-encoder approach. Simple and parallelizable.
Pairwise — given (query, doc A, doc B), pick which is more relevant. The training signal behind zELO .
Listwise — given the full list, produce a permutation. Most context-aware; least parallelizable.
Where listwise wins
When candidates are similar in quality — pointwise scorers get noisy near the top of close lists; listwise can reason about subtle differences.
When the user wants a coherent ordering — listwise sees diversity / redundancy and can pick a varied set, not 5 near-duplicates.
When the LLM is already in the loop — at the cost of one prompt you avoid an extra dedicated reranker model.
Where it loses
Latency — running an LLM listwise on 100 candidates is much slower than a cross-encoder reranker.
Cost — frontier LLM cost per token vs. ~$0.025/1M tokens for a specialized reranker. Three orders of magnitude difference in cost-per-rerank.
Calibration — listwise output is a permutation, not a score. You can’t threshold “anything below 0.6 is irrelevant” because there are no scores.
Context length limits — most LLMs can’t fit hundreds of long candidate documents in a single prompt; you end up sliding-window listwise, which loses coherence.
Notable listwise rerankers
RankGPT (Sun et al., 2023) — the canonical zero-shot listwise prompt; established the sliding-window pattern.
RankVicuna and RankZephyr — open-weight listwise rerankers distilled from RankGPT supervision.
LRL (Listwise Reranker with LLMs) — hybrid that combines listwise comparisons with pointwise calibration heads.
Cohere Rerank v3 — production-API reranker that exposes a listwise mode for low-volume premium ranking.
When to actually use it
Production systems that benefit from listwise are typically lower-volume / higher-stakes (legal review, medical literature search) where the absolute-best ordering matters more than per-query latency. For high-volume RAG, a calibrated cross-encoder beats listwise on most metrics including end-to-end answer quality.
Listwise wins where ordering matters more than throughput.
Go further
How does listwise compare to pointwise in production?
On most production RAG benchmarks, a calibrated pointwise [cross-encoder](/concepts/cross-encoder/) matches or beats listwise LLM rerankers at 100×-1000× lower cost and latency. Listwise wins on small, high-stakes corpora where ordering matters more than throughput.
By construction, listwise produces a permutation (ranking) rather than a per-pair score. There's no anchor that says 'rank 5 is the cutoff for irrelevant' — only that rank 5 was preferred over rank 6 within this query. Use [pointwise scoring](/concepts/pointwise-scoring/) when you need calibration.
Can pairwise preferences give you the best of both?
Yes — pairwise data is much less noisy than absolute scores, and a [Thurstone fit](/concepts/thurstone-model/) over a sparse pairwise graph recovers calibrated pointwise scores you can use to train a fast pointwise reranker. That's the [zELO](/concepts/zelo/) recipe.