zELO

Also known as: zELO method, zELO methodology

TL;DR

ZeroEntropy's training methodology for rerankers and embeddings. Frontier LLMs vote pairwise on document relevance; a Thurstone fit recovers continuous Elo-style scores; the scores become regression targets for a small specialized model.

zELO is a training methodology for retrieval models that converts cheap pairwise preference data into well-calibrated pointwise relevance scores via a . It produced zerank-1, zerank-2, and (transitively, via distillation) zembed-1.

The Same Math Behind Chess EloPOINTWISEQ: "how relevant is this — 0 to 1?"q:"what is RAG?"d:"RAG combines retrieval and..."MODEL010.320.710.48SAME (q, d) · DIFFERENT SCORE EVERY CALLvsPAIRWISEQ: "which answers it better — A or B?"q:"what is RAG?"A:"Cheap cotton cloth for..."B:"RAG is retrieval +..."MODEL→ B→ B→ BSAME (q, A, B) · SAME ANSWER EVERY CALLAGGREGATING PAIRWISE OUTCOMES → CONTINUOUS SCORESPAIRWISE OUTCOMESA vs BBwinsB vs CBwinsA vs CAwinsB vs DBwinsC vs DCwinsTHURSTONEFITCONTINUOUS RELEVANCE SCOREA0.50B0.50C0.50D0.50↳ TRAIN THE RERANKER ON THESE

The pipeline, in four stages:

Pairwise Preferences → Continuous Relevance ScoresSTEP 1 · GROUND TRUTHfrontier LLMs(q, dᵢ, dⱼ)1 random pairclaudeCoTgptCoTgeminiCoTpᵢⱼ = ⟨ensemble⟩112K QUERIES · 112K GOLD PAIRSexpensive · slow · gold→ DISTILL VIA BCE LOSSSTEP 2 · DISTILL PAIRWISEpairwise SLM rerankerR'_pairℒ = BCE(pᵢⱼ, p'ᵢⱼ)→ B→ B→ BQWEN3-4B INIT · ~1000× FASTERnear-ensemble accuracy, SLM speed↓ INFERENCE OVER A GRAPHSTEP 3 · zELO FITgraph of pairs → fitted Elosd₁d₂d₃d₄d₅d₆d₇d₈cycle 1cycle 2k = 4 · diam = 2~0.4% of all pairsk/2 random cycles, unionedTHURSTONE FIT10Eloᵢ − EloⱼP = ½(1 + erf · Δ)FITTED Elos PER (q, d)A0.86B0.68C0.49D0.32E0.18STEP 4 · DISTILL POINTWISEpointwise rerankerzerank-1ℒ = (R(q,d) − Elo)²(q, d):single fwd-pass0.82~5M (q, d, Elo) MSE PAIRSQwen3-4B → zerank-1→ SHIPS AS zerank-1112K LLM-ENSEMBLE INFERENCES → 5M MSE TARGETS · NO HUMAN ANNOTATIONS

1. Gather pairwise preferences from frontier LLMs

For each (query, doc_A, doc_B) triple, ask 3 frontier LLMs (Claude, GPT, Gemini) which is more relevant. The pairwise question is much less noisy than asking for absolute scores, so even with disagreeing models you get high-information judgments.

For zerank-1 this produced 112,000 gold pairs across 112,000 queries (one random pair per query). zerank-2 expanded the data and added per-rater calibration.

2. Distill a fast pairwise SLM

Train a small (4B) cross-encoder to mimic the ensemble’s pairwise judgments. The loss is binary cross-entropy on the ensemble probability . This pairwise SLM () is initialized from Qwen3-4B and runs ~1000× faster than the LLM ensemble. Its job: judge any new pair at near-ensemble accuracy at SLM cost.

3. Inference over a sparse preference graph

For each query with candidate documents, run on a sparse -regular preference graph — the union of random Hamiltonian cycles over the candidates. With you visit 0.4% of the dense comparison matrix and still recover a fully-connected graph with diameter 2. Vastly cheaper than dense pairwise inference.

Then a on that sparse graph recovers a continuous -style score per (query, doc). These are the continuous fitted relevance scores — not annotations, but recovered statistically from many pairwise judgments.

4. Train the production pointwise reranker

The fitted Elo scores become the regression targets for a — typically MSE loss on . Qwen3-4B → zerank-1 (or zerank-2). At inference, the production model is a single forward pass per (query, doc) — no pairwise inputs needed.

Why this beats binary contrastive training

Most rerankers are trained on binary triples with a contrastive loss. The signal is coarse — the model only learns “more relevant than irrelevant”, not “how much more”. zELO’s continuous targets carry graded information: the difference between “extremely relevant” and “marginally relevant” is preserved. This is why zELO-trained models do disproportionately well on graded-relevance benchmarks like the .

The whole pipeline is a coordinate change. Pairwise judgments are cheap and high-information; pointwise scores are what production rerankers need to emit. Thurstone is the bridge.

The Thurstone MLE needs a connected comparison graph to recover scores — disconnected components share no observations, so their scores cannot be linked. Random sampling of edges sometimes produces a disconnected graph, especially at small or small .

A -regular graph built from random Hamiltonian cycles guarantees three properties simultaneously: (1) every node has exactly neighbors (perfect regularity makes the MLE well-conditioned), (2) the graph is -edge-connected (so it stays connected even if edges are corrupted), and (3) the diameter is small (typically 2 at , ) — every item is two hops from every other, so information flows quickly across the score field.

Empirically, on nodes recovers Thurstone scores within 0.02 of the dense-graph fit while using 0.4% of the comparisons. That ratio is what makes per-query Thurstone economically tractable across millions of queries.

What downstream gets distilled

zembed-1 is itself distilled from zerank-2’s pointwise scores — a trained to produce embeddings whose dot-product approximates the cross-encoder’s score. So the same relevance signal flows: pairwise LLM votes → pairwise SLM → Thurstone fit → pointwise reranker → bi-encoder embeddings.

Reading more

The full paper: arXiv:2509.12541.

Paper
loading…

zELO: ELO-inspired Training Method for Rerankers and Embedding Models

Pipitone, Houir Alami, Avadhanam, Kaminskyi, Khoo

We introduce a novel training methodology named zELO, which optimizes retrieval performance via the analysis that ranking tasks are statically equivalent to a Thurstone model. Trained end-to-end from unannotated queries and documents in less than 10,000 H100-hours, zerank-1 achieves the highest retrieval scores across finance, legal, code, and STEM — outperforming closed-source proprietary rerankers on NDCG@10 and Recall.

arXiv abstract
Go further

How does zELO compare to standard contrastive reranker training?

Contrastive trains on binary (relevant, irrelevant) pairs — the model learns ordering but not magnitude. zELO trains pointwise against continuous Thurstone-recovered targets, so the model learns a calibrated relevance scalar. The advantage shows up disproportionately on graded-relevance benchmarks.

Why distill into a bi-encoder if the cross-encoder is more accurate?

Cross-encoders score (q, d) pairs at query time — N forward passes per query. Bi-encoders embed docs offline and only query at runtime — one forward pass plus an ANN lookup. zembed-1 is the bi-encoder distillation of zerank-2: most of the quality, vastly more throughput.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord