Elo is a continuous skill rating recovered from pairwise win/loss outcomes — originally for chess, now repurposed in retrieval to convert pairwise document preferences into pointwise relevance scores.
Elo, named after Arpad Elo who designed it for chess in the 1960s, is a method for assigning each player a continuous skill rating from a sequence of pairwise game outcomes. The premise: if A beats B, A’s rating should go up and B’s should go down, in proportion to how surprising the result was. Over many games, each player’s rating converges to a stable value that predicts their probability of beating any other rated player.
Mathematically, Elo is a logistic-link variant of the Thurstone model . The probability that A beats B is:
P(A wins) = 1 / (1 + 10^((B - A) / 400))
Which is just a logistic function of the rating difference, scaled to a 400-point spread.
What Elo gives you that pairwise outcomes don’t
A list of pairwise comparisons is a graph. Elo (or Thurstone) collapses that graph into a line — every player on the same continuous scale, comparable across opponents they never played. This is the same trick that makes zELO work for retrieval.
Pairwise outcomes are a graph; Elo collapses that graph into a line. That projection is what turns “A beats B” judgments into a continuous, calibrated regression target.
The 400 sets the rating-difference scale: a 400-point gap means the higher-rated player wins at a 10:1 odds ratio (about 91% win probability). Halve it to 200 and ratings move twice as much per game; double it to 800 and they barely move. Arpad Elo picked 400 because it made per-game swings interpretable to chess players (a typical upset shifts ~15-25 points). For retrieval the constant is irrelevant — downstream models ingest the raw scalar, so teams rescale to [0, 1] anyway.
From chess to retrieval
Replace “player” with “(query, document) pair”, and “game” with “pairwise preference judgment”. The math is identical. Many pairwise preferences (A more relevant than B for query Q) feed the Elo fit; out comes one continuous relevance score per (query, document), calibrated against every preference observed.
This score is the regression target for training a pointwise reranker . The reranker learns to predict the Elo score directly, given (query, document), without ever needing pairwise inputs at inference.
Why “z” Elo?
The “z” in zELO is just for ZeroEntropy. The methodology is otherwise classical: assemble pairwise judgments from frontier-LLM ensembles, fit a Thurstone/Elo model to recover continuous targets per (q, d), distill those targets into a small specialized reranker . See the zELO paper for the full pipeline.
Why not just ask for absolute scores?
Because pairwise preferences are vastly less noisy than absolute scores. Annotators (and LLMs) disagree wildly on whether a document is “0.7 relevant” but almost always agree on “is A or B more relevant”. Elo lets you exploit the easy question and recover the hard answer.
Two pathologies show up. First, transitivity violations — A beats B, B beats C, C beats A — cause the gradient to oscillate. Real annotation noise produces a small fraction of these, which the MLE smooths over fine, but if your judges are systematically inconsistent (e.g., different LLM judges with different criteria), Elo will fight itself. Second, degeneracy: if some document wins every comparison it appears in, the MLE wants to send its rating to +∞. Both are fixed by adding a Gaussian prior on ratings (regularized Elo) — penalize extreme values, get a stable fit even with sparse or noisy data.
Go further
Why does Elo's 400-point spread show up in chess but not retrieval?
The 400 is just a scaling constant baked into chess convention — it makes the per-game rating change interpretable to humans. In retrieval applications the recovered scores are typically rescaled to [0, 1] for use as regression targets; the underlying math is identical.
How is the Elo fit done at scale on millions of (q, d) pairs?
Per-query MLE: each query is its own Elo tournament with its candidate documents as 'players'. The fit is independent across queries and embarrassingly parallel. The expensive part is gathering pairwise outcomes; the fit itself is cheap.
Could you train a reranker directly on pairwise preferences and skip Elo?
Yes — that's contrastive ranking, what most rerankers do. But you only learn relative ordering, not how-much-better. Elo recovers a continuous, calibrated scalar; pointwise regression against it gives you graded scores you can threshold on.