Pairwise Preference

Also known as: pairwise comparison, head-to-head preference

TL;DR

Pairwise preference is the supervision signal where, for a query and two candidate documents, an annotator (or LLM) picks which one is more relevant.

Pairwise preference is the supervision shape (query, doc A, doc B) → which is more relevant. The annotator (human or ) doesn’t commit to an absolute score for either document — just picks the better one. This sounds equivalent to scoring each independently. In practice the noise difference is dramatic.

Why pairwise is less noisy than pointwise

Ask two annotators to score “how relevant is this document on a 0-1 scale?” and you’ll get wildly different numbers — different raters use different parts of the scale, drift over time, and hit bizarre attractor points (3/5 stars for everything mediocre). The variance per (query, document) score is enormous.

Ask the same two annotators “which of these two is more relevant?” and they almost always agree. The relative judgment is far more stable than the absolute one. This is true for human annotators, and it’s even more true for LLM annotators, which is why frontier-LLM-as-rater works at all.

Annotators disagree on absolute scores. They agree on which of two is better. zELO is built on that gap.

How preferences turn into scores

The catch with pairwise preferences: you don’t directly get pointwise scores out, just a graph of (A beats B) relationships. To produce a you need scores. The bridge is a (or equivalently, ) — same statistical idea as chess ratings. Many pairwise outcomes feed an MLE that recovers a single continuous score per document, calibrated against every pairwise comparison observed.

Naively you might expect that recovering scores from pairwise comparisons requires pairs — dense coverage. The Thurstone / Bradley-Terry MLE is much more forgiving. Theoretical results on rank aggregation show that as long as the comparison graph is connected and has bounded diameter, comparisons suffice for stable score recovery. zELO uses k-regular graphs with — every doc is compared to four others — yielding a graph of diameter 2 with only edges. That’s 200 comparisons for a 100-doc query instead of for the dense case (a 25× cost cut), and the recovered scores hit within ~0.05 of the dense-pairwise gold standard. The MLE does the heavy lifting; the graph just has to be connected enough to propagate ordering signal.

What “pairwise” enables in practice

  • Unannotated training data at scale — frontier LLMs can produce pairwise preferences for millions of pairs cheaply, where they’d choke at producing well-calibrated absolute scores.
  • Robust to rater disagreement — different LLMs as raters can be calibrated against each other through their pairwise behavior, even if they disagree on absolute scoring.
  • Better target labels for rerankers — the pointwise scores recovered via Thurstone are smoother and more calibrated than scores you’d get by asking for absolute judgments directly.
Where pairwise supervision shows up
  • Reranker training — pairwise labels feeding Thurstone, then pointwise distillation
  • RLHF / DPO — preference pairs over LLM outputs, no absolute reward needed
  • LLM-as-judge benchmarking — pairwise A/B much more stable than 1-10 ratings
  • Recommender systems — implicit pairwise signals from clicks (clicked > not clicked)
  • Chatbot Arena — Elo over millions of pairwise human votes

This is the entire premise behind , the training methodology that produced zerank-1 and zerank-2.

Go further

If pairwise is so much better, why do most rerankers still train on pointwise labels?

Because pointwise is what the model needs to output at inference. The trick zELO solved: collect pairwise (low-noise), fit Thurstone (recover continuous scores), then train pointwise (with the recovered scores as targets). You get the noise advantage of pairwise and the inference shape of pointwise.

Aren't N choose 2 pairs intractable for large candidate sets?

Yes — dense pairwise is O(N^2) per query. But the Thurstone MLE works on sparse graphs. zELO uses k-regular graphs (k=4) — only 200 comparisons per 100-doc query, 0.4% of dense. The graph stays connected with diameter 2, which is enough for stable score recovery.

Does an LLM-as-rater really agree with itself on pairwise judgments?

Far more than on absolute scores. Within-rater pairwise agreement is typically 90%+; absolute-score self-agreement (rate this 0-10) is much lower. The pairwise question collapses scale-calibration ambiguity and just asks 'which one'.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord