Back

Paper TLDR: How we trained zerank-1 with the zELO method

Sep 19, 2025 ·

TL;DR

A 10-step summary of the zELO paper: how ZeroEntropy trained zerank-1 using ELO-inspired pairwise LLM comparisons, sparse sampling, and cross-query calibration to build the fastest and cheapest API-based reranker—outperforming all competitors by >+5% NDCG@10 on every domain.

After popular demand, this is an executive TLDR of the paper zELO: ELO-inspired Training Method for Rerankers and Embedding Models.

**Try zerank-1 and zerank-1-small in this** 5-line Google Colab.

1. What is a reranker?

A reranker is a cross-encoder that takes a query and candidate documents and reorders them for accuracy. It’s the step that makes RAG actually useful: the reranker decides which 5-10 documents an LLM sees.

Diagram showing how a reranker reorders candidate documents for a query — A reranker takes a query and candidate documents and reorders them by relevance.

2. Why not triplet loss + human annotations?

Traditional rerankers are trained on queries with mined positive and negative results. The goal is to find negatives that look relevant but aren’t, so the model learns fine-grained distinctions. But as mining improves, many of those “negatives” are actually relevant (=false negatives) which confuses the model and degrades performance.

Diagram illustrating the false negative problem in hard negative mining — As negative mining improves, many 'negatives' are actually relevant—confusing the model and degrading performance.

3. Pairwise with LLMs

Instead of human triplets, we use LLMs to compare document pairs. Pairwise comparisons are more robust, scale cheaply, and align well with human intuition.

4. From pairwise to pointwise: Elo

We model outcomes with the Bradley–Terry / Elo system: documents “battle,” scores accumulate, and we get calibrated continuous relevance values.

5. Tackling scale

Naively, for 1 query, and k=100 candidates, that’s 10k pairwise inferences per query. Even our small fine-tuned pairwise model is too costly to inference at this scale.

6. Sparse sampling

We solved this with random cycles sampling: O(n) comparisons instead of O(n²). Only ~400 pairs per query are needed, instead of 10k, with little accuracy loss.

Chart comparing sparse sampling efficiency to naive pairwise comparisons — Random cycles sampling reduces comparisons from O(n²) to O(n)—~400 pairs instead of 10k per query with little accuracy loss.

7. Cross-query calibration

Elo is relative within one query. We estimate and subtract cross-query biases to align scores across corpora, so the reranker generalizes everywhere, across verticals and tasks.

8. Training setup

We LoRA fine-tuned Qwen-4B and Qwen-1.7B on queries from healthcare, finance, legal, manufacturing, STEM, and code. Ablation studies show that mixing diverse domains yields the strongest performance within each vertical.

9. Performance

See full benchmarks here.

Outperforms BM25, OpenAI embeddings, and hybrid search by >+15% NDCG@10
Outperforms all other API-based rerankers by >+5% NDCG@10 on every domain

Benchmark results showing zerank-1 outperforming competitors across domains — zerank-1 outperforms all API-based rerankers by >+5% NDCG@10 on every domain.

10. Availability

Accessible via API + AWS Marketplace
Open weights on HuggingFace
Latency: 129ms p50 for 75kB payloads (fastest API-based reranker we’re aware of).
Cost: $0.025 / 1M tokens (cheapest API-based reranker we’re aware of)

Don't miss out on future research

→ Benchmarks Full benchmark results for zerank-1 → Paper zELO: ELO-inspired Training Method for Rerankers and Embedding Models → Technical Blog Post Improving retrieval with Elo scores → Discord Join the ZeroEntropy Discord community

Get in touch: Email, LinkedIn · Join our Slack · Or the Context Engineers General Discord

Related Blogs

Catch all the latest releases and updates from ZeroEntropy.

Apr 02, 2026

Smarter Context Compression for LLM Pipelines: zerank-2 as a Calibrated Classifier

How to use zerank-2's calibrated relevance scores as a binary classifier for context compression, document routing, and multi-label classification — at 50-100x less cost than LLM classification.

Mar 02, 2026

"Let's eat, grandma" vs "let's eat grandma": how embedding models encode the world

A deep dive into how embedding models encode meaning, why famous training examples create the illusion of capability, and what consistent behavior across 10k+ nouns tells us about genuine understanding.

Feb 23, 2026

2026's Top 10 Embedding Companies Powering Search Technology

The best AI teams retrieve with ZeroEntropy

Book Demo View docs