ZeroEntropy's training methodology for rerankers and embeddings. Frontier LLMs vote pairwise on document relevance; a Thurstone fit recovers continuous Elo-style scores; the scores become regression targets for a small specialized model.
zELO is a training methodology for retrieval models that converts cheap pairwise preference data into well-calibrated pointwise relevance scores via a Thurstone fit . It produced zerank-1, zerank-2, and (transitively, via distillation) zembed-1.
The pipeline, in four stages:
1. Gather pairwise preferences from frontier LLMs
For each (query, doc_A, doc_B) triple, ask 3 frontier LLMs (Claude, GPT, Gemini) which is more relevant. The pairwise question is much less noisy than asking for absolute scores, so even with disagreeing models you get high-information judgments.
For zerank-1 this produced 112,000 gold pairs across 112,000 queries (one random pair per query). zerank-2 expanded the data and added per-rater calibration.
2. Distill a fast pairwise SLM
Train a small (4B) cross-encoder to mimic the ensemble’s pairwise judgments. The loss is binary cross-entropy on the ensemble probability . This pairwise SLM () is initialized from Qwen3-4B and runs ~1000× faster than the LLM ensemble. Its job: judge any new pair at near-ensemble accuracy at SLM cost.
3. Inference over a sparse preference graph
For each query with candidate documents, run on a sparse-regular preference graph — the union of random Hamiltonian cycles over the candidates. With you visit 0.4% of the dense comparison matrix and still recover a fully-connected graph with diameter 2. Vastly cheaper than dense pairwise inference.
Then a Thurstone MLE on that sparse graph recovers a continuous Elo -style score per (query, doc). These are the continuous fitted relevance scores — not annotations, but recovered statistically from many pairwise judgments.
4. Train the production pointwise reranker
The fitted Elo scores become the regression targets for a pointwise reranker — typically MSE loss on . Qwen3-4B → zerank-1 (or zerank-2). At inference, the production model is a single forward pass per (query, doc) — no pairwise inputs needed.
Why this beats binary contrastive training
Most rerankers are trained on binary triples with a contrastive loss. The signal is coarse — the model only learns “more relevant than irrelevant”, not “how much more”. zELO’s continuous targets carry graded information: the difference between “extremely relevant” and “marginally relevant” is preserved. This is why zELO-trained models do disproportionately well on graded-relevance benchmarks like the MTEB rework .
The whole pipeline is a coordinate change. Pairwise judgments are cheap and high-information; pointwise scores are what production rerankers need to emit. Thurstone is the bridge.
The Thurstone MLE needs a connected comparison graph to recover scores — disconnected components share no observations, so their scores cannot be linked. Random sampling of edges sometimes produces a disconnected graph, especially at small or small .
A -regular graph built from random Hamiltonian cycles guarantees three properties simultaneously: (1) every node has exactly neighbors (perfect regularity makes the MLE well-conditioned), (2) the graph is -edge-connected (so it stays connected even if edges are corrupted), and (3) the diameter is small (typically 2 at , ) — every item is two hops from every other, so information flows quickly across the score field.
Empirically, on nodes recovers Thurstone scores within 0.02 of the dense-graph fit while using 0.4% of the comparisons. That ratio is what makes per-query Thurstone economically tractable across millions of queries.
What downstream gets distilled
zembed-1 is itself distilled from zerank-2’s pointwise scores — a bi-encoder trained to produce embeddings whose dot-product approximates the cross-encoder’s score. So the same relevance signal flows: pairwise LLM votes → pairwise SLM → Thurstone fit → pointwise reranker → bi-encoder embeddings.
zELO: ELO-inspired Training Method for Rerankers and Embedding Models
Pipitone, Houir Alami, Avadhanam, Kaminskyi, Khoo
We introduce a novel training methodology named zELO, which optimizes retrieval performance via the analysis that ranking tasks are statically equivalent to a Thurstone model. Trained end-to-end from unannotated queries and documents in less than 10,000 H100-hours, zerank-1 achieves the highest retrieval scores across finance, legal, code, and STEM — outperforming closed-source proprietary rerankers on NDCG@10 and Recall.
How does zELO compare to standard contrastive reranker training?
Contrastive trains on binary (relevant, irrelevant) pairs — the model learns ordering but not magnitude. zELO trains pointwise against continuous Thurstone-recovered targets, so the model learns a calibrated relevance scalar. The advantage shows up disproportionately on graded-relevance benchmarks.
Why distill into a bi-encoder if the cross-encoder is more accurate?
Cross-encoders score (q, d) pairs at query time — N forward passes per query. Bi-encoders embed docs offline and only query at runtime — one forward pass plus an ANN lookup. zembed-1 is the bi-encoder distillation of zerank-2: most of the quality, vastly more throughput.
Yes — that's the architectural point. The pipeline is dataset-agnostic: bring your queries and a document corpus, run pairwise LLM voting, fit Thurstone, distill. ZeroEntropy productizes this for customers as custom-trained rerankers and embedders.