Score Calibration (Rerankers)

Also known as: calibrated relevance score, absolute relevance

TL;DR

A calibrated reranker outputs scores whose absolute value is meaningful — 0.8 means roughly 80% relevance consistently across queries and domains, so you can threshold and filter reliably. Most rerankers are not calibrated.

A reranker’s scores can be either rank-correct (the order is right but the absolute numbers are arbitrary) or calibrated (the absolute number means something consistent across queries). Most rerankers are only rank-correct.

This matters because production AI systems frequently need to threshold:

Threshold-driven decisions that break without calibration

Filter weak candidates — drop anything below 0.5 before passing to the LLM.
Trigger clarification — if no candidate scores above 0.7, ask the user to refine.
Confidence labels — show “high confidence” only when top score is above 0.8.
Refusal logic — return “I do not know” instead of hallucinating when no source is highly relevant.
Routing — send high-confidence queries to a fast path, low-confidence to a slow agentic path.

If the scores aren’t calibrated, these thresholds work for some queries and break for others. A 0.7 from one query might mean “very relevant” and a 0.7 from another query might mean “barely relevant.” You can’t write robust rules against that.

Rank correctness is necessary for retrieval. Calibration is necessary for any decision built on top of retrieval — refusals, fallbacks, ensembling, abstention.

What calibration looks like in practice

A well-calibrated reranker has the property:

P(document is actually relevant | model scored it 0.7) ≈ 0.7

Plot model scores against ground-truth relevance and the line is approximately diagonal. Most off-the-shelf rerankers, by contrast, cluster scores in a narrow band regardless of true relevance — useful for relative ordering, useless as an absolute signal.

How it’s achieved in zerank-2

zerank-2’s training included explicit calibration via per-rater Beta distribution fits: each LLM rater (Claude, GPT, Gemini) gets a (μ, κ) calibration that’s iteratively mixed, so each rater’s judgment is weighted by how reliable it has actually been. The pointwise scores recovered from the Thurstone fit on top of those weighted preferences are calibrated against ground-truth relevance, not just against each other.

The result: a 0.8 from zerank-2 means roughly 80% relevance, consistently, across queries and domains.

Confidence statistic

Calibrated rerankers can additionally output a confidence score per call — “I’m 0.92 confident the relevance score I assigned is correct.” Together, score + confidence let downstream systems make robust decisions: “high score, high confidence → trust it; high score, low confidence → defer or ensemble.”

Three approaches, in order of cost and quality:

Platt scaling. Fit a single logistic regression on top of the reranker’s raw scores using a held-out labeled set: . Cheap, takes hundreds of labeled examples, and corrects monotone miscalibration but not query-dependent bias.
Isotonic regression. Fit a non-parametric monotone function from raw scores to empirical relevance rates. More flexible than Platt; needs more data (thousands of examples) and is prone to overfitting at distribution boundaries.
Per-query Z-normalization. Normalize each query’s scores to mean 0, std 1, then apply a single global calibration on the normalized scores. Handles query-dependent score shifts that Platt and isotonic miss.

All three are bandages on a fundamentally uncalibrated training pipeline. The clean fix is the zerank-2 approach: use a Thurstone fit during training to produce calibrated targets, train pointwise against those, and skip the post-hoc calibration step entirely.

A reranker’s calibration is implicit in its training distribution. If 70% of your training pairs were “moderately relevant” and 10% were “highly relevant,” the model learns a score distribution biased toward the moderate range — and when deployed on a domain where the actual distribution is bimodal (50% irrelevant, 50% perfectly relevant), the score histogram looks completely different.

This is why off-the-shelf reranker scores almost never map directly onto your domain’s relevance scale. Two practical defenses: (1) re-validate the threshold on each new domain — never reuse “0.7 is the trust threshold” across deployments, and (2) prefer rerankers explicitly trained for cross-domain calibration. The zELO methodology was designed for this — Beta-distribution per-rater calibration plus Thurstone fits give consistent score semantics across domain shifts that Platt-corrected models can’t match.

Go further

How do you actually measure whether a reranker is calibrated?

Bin model scores (0.0-0.1, 0.1-0.2, …) and compute the empirical relevance rate within each bin against held-out labels. A calibrated model's plot lies on the diagonal; a miscalibrated one curves or scatters. Expected Calibration Error (ECE) is the standard summary statistic.

NDCG@K Evaluating a reranker on your own data (playbook)

Why does this matter more for context compression than ranking?

For pure ranking, only the order matters — uncalibrated scores still sort correctly. The moment you start thresholding (drop everything below 0.5, or summarize sources scoring 0.3-0.6), uncalibrated scores silently pick different cutoffs on every query.

Context compression Reranker

What's the connection between calibration and pairwise training?

A pure pairwise loss only constrains relative ordering, so absolute scores can drift anywhere. zELO recovers calibration by passing pairwise judgments through a Thurstone fit (which produces continuous targets on a fixed scale) and then training pointwise against those — calibration is baked in, not retrofitted.

zELO Thurstone model Pairwise preference

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs