Paper TLDR: How we trained zerank-1 with the zELO method

Sep 19, 2025 · GitHub Twitter Slack LinkedIn Discord
Paper TLDR: How we trained zerank-1 with the zELO method
TL;DR

A 10-step summary of the zELO paper: how ZeroEntropy trained zerank-1 using ELO-inspired pairwise LLM comparisons, sparse sampling, and cross-query calibration to build the fastest and cheapest API-based reranker—outperforming all competitors by >+5% NDCG@10 on every domain.

After popular demand, this is an executive TLDR of the paper zELO: ELO-inspired Training Method for Rerankers and Embedding Models.

**Try zerank-1 and zerank-1-small in this** 5-line Google Colab.

1. What is a reranker?

A reranker is a cross-encoder that takes a query and candidate documents and reorders them for accuracy. It’s the step that makes RAG actually useful: the reranker decides which 5-10 documents an LLM sees.

Diagram showing how a reranker reorders candidate documents for a query
A reranker takes a query and candidate documents and reorders them by relevance.

2. Why not triplet loss + human annotations?

Traditional rerankers are trained on queries with mined positive and negative results. The goal is to find negatives that look relevant but aren’t, so the model learns fine-grained distinctions. But as mining improves, many of those “negatives” are actually relevant (=false negatives) which confuses the model and degrades performance.

Diagram illustrating the false negative problem in hard negative mining
As negative mining improves, many 'negatives' are actually relevant—confusing the model and degrading performance.

3. Pairwise with LLMs

Instead of human triplets, we use LLMs to compare document pairs. Pairwise comparisons are more robust, scale cheaply, and align well with human intuition.

4. From pairwise to pointwise: Elo

We model outcomes with the Bradley–Terry / Elo system: documents “battle,” scores accumulate, and we get calibrated continuous relevance values.

5. Tackling scale

Naively, for 1 query, and k=100 candidates, that’s 10k pairwise inferences per query. Even our small fine-tuned pairwise model is too costly to inference at this scale.

6. Sparse sampling

We solved this with random cycles sampling: O(n) comparisons instead of O(n²). Only ~400 pairs per query are needed, instead of 10k, with little accuracy loss.

Chart comparing sparse sampling efficiency to naive pairwise comparisons
Random cycles sampling reduces comparisons from O(n²) to O(n)—~400 pairs instead of 10k per query with little accuracy loss.

7. Cross-query calibration

Elo is relative within one query. We estimate and subtract cross-query biases to align scores across corpora, so the reranker generalizes everywhere, across verticals and tasks.

8. Training setup

We LoRA fine-tuned Qwen-4B and Qwen-1.7B on queries from healthcare, finance, legal, manufacturing, STEM, and code. Ablation studies show that mixing diverse domains yields the strongest performance within each vertical.

9. Performance

See full benchmarks here.

  • Outperforms BM25, OpenAI embeddings, and hybrid search by >+15% NDCG@10
  • Outperforms all other API-based rerankers by >+5% NDCG@10 on every domain
Benchmark results showing zerank-1 outperforming competitors across domains
zerank-1 outperforms all API-based rerankers by >+5% NDCG@10 on every domain.

10. Availability

  • Accessible via API + AWS Marketplace
  • Open weights on HuggingFace
  • Latency: 129ms p50 for 75kB payloads (fastest API-based reranker we’re aware of).
  • Cost: $0.025 / 1M tokens (cheapest API-based reranker we’re aware of)

Related Blogs

Catch all the latest releases and updates from ZeroEntropy.

ZeroEntropy
The best AI teams retrieve with ZeroEntropy
Follow us on
GitHubTwitterSlackLinkedInDiscord