zELO

Also known as: zELO method, zELO methodology

TL;DR

ZeroEntropy's training methodology for rerankers and embeddings. Frontier LLMs vote pairwise on document relevance; a Thurstone fit recovers continuous Elo-style scores; the scores become regression targets for a small specialized model.

zELO is a training methodology for retrieval models that converts cheap pairwise preference data into well-calibrated pointwise relevance scores via a Thurstone fit . It produced zerank-1, zerank-2, and (transitively, via distillation) zembed-1.

The pipeline, in four stages:

1. Gather pairwise preferences from frontier LLMs

For each (query, doc_A, doc_B) triple, ask 3 frontier LLMs (Claude, GPT, Gemini) which is more relevant. The pairwise question is much less noisy than asking for absolute scores, so even with disagreeing models you get high-information judgments.

For zerank-1 this produced 112,000 gold pairs across 112,000 queries (one random pair per query). zerank-2 expanded the data and added per-rater calibration.

2. Distill a fast pairwise SLM

Train a small (4B) cross-encoder to mimic the ensemble’s pairwise judgments. The loss is binary cross-entropy on the ensemble probability . This pairwise SLM () is initialized from Qwen3-4B and runs ~1000× faster than the LLM ensemble. Its job: judge any new pair at near-ensemble accuracy at SLM cost.

3. Inference over a sparse preference graph

For each query with candidate documents, run on a sparse -regular preference graph — the union of random Hamiltonian cycles over the candidates. With you visit 0.4% of the dense comparison matrix and still recover a fully-connected graph with diameter 2. Vastly cheaper than dense pairwise inference.

Then a Thurstone MLE on that sparse graph recovers a continuous Elo -style score per (query, doc). These are the continuous fitted relevance scores — not annotations, but recovered statistically from many pairwise judgments.

4. Train the production pointwise reranker

The fitted Elo scores become the regression targets for a pointwise reranker — typically MSE loss on . Qwen3-4B → zerank-1 (or zerank-2). At inference, the production model is a single forward pass per (query, doc) — no pairwise inputs needed.

Why this beats binary contrastive training

Most rerankers are trained on binary triples with a contrastive loss. The signal is coarse — the model only learns “more relevant than irrelevant”, not “how much more”. zELO’s continuous targets carry graded information: the difference between “extremely relevant” and “marginally relevant” is preserved. This is why zELO-trained models do disproportionately well on graded-relevance benchmarks like the MTEB rework .

The whole pipeline is a coordinate change. Pairwise judgments are cheap and high-information; pointwise scores are what production rerankers need to emit. Thurstone is the bridge.

The Thurstone MLE needs a connected comparison graph to recover scores — disconnected components share no observations, so their scores cannot be linked. Random sampling of edges sometimes produces a disconnected graph, especially at small or small .

A -regular graph built from random Hamiltonian cycles guarantees three properties simultaneously: (1) every node has exactly neighbors (perfect regularity makes the MLE well-conditioned), (2) the graph is -edge-connected (so it stays connected even if edges are corrupted), and (3) the diameter is small (typically 2 at , ) — every item is two hops from every other, so information flows quickly across the score field.

Empirically, on nodes recovers Thurstone scores within 0.02 of the dense-graph fit while using 0.4% of the comparisons. That ratio is what makes per-query Thurstone economically tractable across millions of queries.

What downstream gets distilled

zembed-1 is itself distilled from zerank-2’s pointwise scores — a bi-encoder trained to produce embeddings whose dot-product approximates the cross-encoder’s score. So the same relevance signal flows: pairwise LLM votes → pairwise SLM → Thurstone fit → pointwise reranker → bi-encoder embeddings.

Reading more

The full paper: arXiv:2509.12541.

Paper

loading…

zELO: ELO-inspired Training Method for Rerankers and Embedding Models

Pipitone, Houir Alami, Avadhanam, Kaminskyi, Khoo

We introduce a novel training methodology named zELO, which optimizes retrieval performance via the analysis that ranking tasks are statically equivalent to a Thurstone model. Trained end-to-end from unannotated queries and documents in less than 10,000 H100-hours, zerank-1 achieves the highest retrieval scores across finance, legal, code, and STEM — outperforming closed-source proprietary rerankers on NDCG@10 and Recall.

arXiv abstract

Go further

How does zELO compare to standard contrastive reranker training?

Contrastive trains on binary (relevant, irrelevant) pairs — the model learns ordering but not magnitude. zELO trains pointwise against continuous Thurstone-recovered targets, so the model learns a calibrated relevance scalar. The advantage shows up disproportionately on graded-relevance benchmarks.

Pairwise preference Score calibration MTEB

Why distill into a bi-encoder if the cross-encoder is more accurate?

Cross-encoders score (q, d) pairs at query time — N forward passes per query. Bi-encoders embed docs offline and only query at runtime — one forward pass plus an ANN lookup. zembed-1 is the bi-encoder distillation of zerank-2: most of the quality, vastly more throughput.

Knowledge distillation Bi-encoder Cross-encoder

Could I run the zELO pipeline on my own domain?

Yes — that's the architectural point. The pipeline is dataset-agnostic: bring your queries and a document corpus, run pairwise LLM voting, fit Thurstone, distill. ZeroEntropy productizes this for customers as custom-trained rerankers and embedders.

Reranker Pointwise scoring Evaluating a reranker on your own data (playbook)

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs