Listwise Reranking

Also known as: listwise ranker, list-aware reranking

TL;DR

Listwise reranking processes the entire candidate list as a single input and produces a permutation, rather than scoring each (query, document) pair independently. More expressive but more expensive — typically powered by an LLM.

Listwise reranking is the family of methods where the model sees all candidate documents at once and produces an ordered ranking, rather than scoring each pair independently. The model can compare candidates to each other directly — “this one beats that one because…” — rather than committing to absolute scores.

LISTWISE RERANKINGScore each in isolation, or read the whole list at once.POINTWISEmodel(q, dᵢ) → score · k forwardsCANDIDATES · UNORDEREDLISTWISEmodel(q, [d₁…dₖ]) → permutation · 1 forwardCANDIDATES · UNORDEREDd₁₇RAG retrieves chunks then prompts …d₄₂Retrieval-augmented generation fet…d₅₁RAG: a retrieval step before gener…d₂₃Vector indexes power semantic sear…d₈₈BM25 scores lexical overlap per te…d₆₀Rerankers reorder a candidate list…d₁₇RAG retrieves chunks then prompts …d₄₂Retrieval-augmented generation fet…d₅₁RAG: a retrieval step before gener…d₂₃Vector indexes power semantic sear…d₈₈BM25 scores lexical overlap per te…d₆₀Rerankers reorder a candidate list…3 NEARDUPSR1q · d1R2q · d2R3q · d3R4q · d4R5q · d5R6q · d66 INDEPENDENT FORWARD PASSESR · LLM PROMPTrank [d₁, d₂, d₃, d₄, d₅, d₆] for qsees every pair · cross-doc context1 FORWARD PASS · WHOLE LISTOUTPUT · SORTED BY SCOREOUTPUT · PERMUTATION1.d₁₇RAG retrieves chunks then prompts …0.942.d₄₂Retrieval-augmented generation fet…0.923.d₅₁RAG: a retrieval step before gener…0.904.d₂₃Vector indexes power semantic sear…0.715.d₆₀Rerankers reorder a candidate list…0.686.d₈₈BM25 scores lexical overlap per te…0.551.d₁₇RAG retrieves chunks then prompts …2.d₂₃Vector indexes power semantic sear…3.d₆₀Rerankers reorder a candidate list…4.d₄₂Retrieval-augmented generation fet…5.d₈₈BM25 scores lexical overlap per te…6.d₅₁RAG: a retrieval step before gener…Six candidates arrive — three are near-duplicates of the strongest hit.

In recent practice, listwise rerankers are most often LLMs prompted with a list of candidates and asked to return them in relevance order. RankGPT (Sun et al., 2023) established the recipe and showed it competitive with dedicated cross-encoders on TREC-DL and BEIR.

A frontier LLM with a 200K context can fit maybe 50 to 100 medium-length documents alongside the query. For top-K reranking over a 1,000-candidate first-pass output, you can’t fit them all — you slide a window of, say, 30 documents at a time, rerank within the window, then slide forward, carrying the top of the previous window into the next. The trouble: each window only knows about the documents inside it. A document that ranked 5th in window A might be globally better than the 1st-ranked of window B, but the model never compared them directly. RankGPT proposed a “bubble-sort” sliding pattern that mitigates this by passing the current top-K through every window, but you still pay multiple LLM calls and the global ordering is approximate.

Listwise vs pointwise vs pairwise

  • — score each (query, document) independently. The standard cross-encoder approach. Simple and parallelizable.
  • — given (query, doc A, doc B), pick which is more relevant. The training signal behind .
  • Listwise — given the full list, produce a permutation. Most context-aware; least parallelizable.

Where listwise wins

  • When candidates are similar in quality — pointwise scorers get noisy near the top of close lists; listwise can reason about subtle differences.
  • When the user wants a coherent ordering — listwise sees diversity / redundancy and can pick a varied set, not 5 near-duplicates.
  • When the LLM is already in the loop — at the cost of one prompt you avoid an extra dedicated reranker model.

Where it loses

  • Latency — running an LLM listwise on 100 candidates is much slower than a cross-encoder reranker.
  • Cost — frontier LLM cost per token vs. ~$0.025/1M tokens for a specialized reranker. Three orders of magnitude difference in cost-per-rerank.
  • Calibration — listwise output is a permutation, not a score. You can’t threshold “anything below 0.6 is irrelevant” because there are no scores.
  • Context length limits — most LLMs can’t fit hundreds of long candidate documents in a single prompt; you end up sliding-window listwise, which loses coherence.
Notable listwise rerankers
  • RankGPT (Sun et al., 2023) — the canonical zero-shot listwise prompt; established the sliding-window pattern.
  • RankVicuna and RankZephyr — open-weight listwise rerankers distilled from RankGPT supervision.
  • LRL (Listwise Reranker with LLMs) — hybrid that combines listwise comparisons with pointwise calibration heads.
  • Cohere Rerank v3 — production-API reranker that exposes a listwise mode for low-volume premium ranking.

When to actually use it

Production systems that benefit from listwise are typically lower-volume / higher-stakes (legal review, medical literature search) where the absolute-best ordering matters more than per-query latency. For high-volume RAG, a calibrated cross-encoder beats listwise on most metrics including end-to-end answer quality.

Listwise wins where ordering matters more than throughput.

Go further

How does listwise compare to pointwise in production?

On most production RAG benchmarks, a calibrated pointwise [cross-encoder](/concepts/cross-encoder/) matches or beats listwise LLM rerankers at 100×-1000× lower cost and latency. Listwise wins on small, high-stakes corpora where ordering matters more than throughput.

Why doesn't listwise output a calibrated score?

By construction, listwise produces a permutation (ranking) rather than a per-pair score. There's no anchor that says 'rank 5 is the cutoff for irrelevant' — only that rank 5 was preferred over rank 6 within this query. Use [pointwise scoring](/concepts/pointwise-scoring/) when you need calibration.

Can pairwise preferences give you the best of both?

Yes — pairwise data is much less noisy than absolute scores, and a [Thurstone fit](/concepts/thurstone-model/) over a sparse pairwise graph recovers calibrated pointwise scores you can use to train a fast pointwise reranker. That's the [zELO](/concepts/zelo/) recipe.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord