Calibration-Discrimination Analysis

Also known as: calibration AUC, score-vs-target discrimination, calibration residual analysis

TL;DR

When you compare two scoring systems on the same items — index score vs reranker score, model score vs ground-truth grade — the residuals from a regression line tell you where they disagree.

You have two scoring systems on the same set of items — say a first-pass retriever’s similarity and a reranker’s relevance score, or a model’s predicted relevance and a graded-relevance judge’s label. The standard summary is or Spearman correlation. That gives you one number. The technique here gives you a per-item residual, lets you stratify those residuals by any known subset, and turns the gap between subset distributions into a Youden-J-style discrimination statistic.

The mental picture: regress one score on the other. The fitted line is the global agreement. The signed perpendicular distance from each point to the line is its disagreement. If a known-positive subset has systematically larger residuals on the agreement-positive side, the two scorers are jointly identifying positives, and the size of the gap tells you how cleanly.

CALIBRATION vs DISCRIMINATION · TWO QUALITIESOrthogonal axes of a good probabilistic model.CALIBRATION · RELIABILITY DIAGRAM“does 0.7 happen 70 percent of the time?”PERFECT · y = x0.000.250.500.751.000.000.250.500.751.00predicted probabilityobserved freq.ECE0.013EXPECTED CALIB. ERR.DISCRIMINATION · ROC CURVE“can it separate positives from negatives?”CHANCE · AUC = 0.50.000.250.500.751.000.000.250.500.751.00false positive ratetrue positive rateAUC = 0.88THE TWO QUALITIES ARE INDEPENDENTCALIB ✓ DISCR ✓the goalCALIB ✗ DISCR ✓over/under-confident but ranks rightCALIB ✓ DISCR ✗honest but uninformativeCALIB ✗ DISCR ✗broken

The Setup — Two Scoring Systems, One Ground-Truth Subset

Concretely, you have:

  • A set of items — typically (query, document) pairs.
  • Two scores per item: (e.g., retriever similarity) and (e.g., reranker score).
  • A subset of ground-truth positives — items you know are correct, from labeled qrels or a held-out gold set.

Fit a linear regression on the full set. For each item, compute the signed residual . Positive residuals mean ” scored higher than would predict”; negative residuals mean ” scored lower.”

The residuals are the per-point disagreement signal. Correlation summarizes their squared sum into a single number. Stratifying them by a known subset surfaces where the two scorers agree to the same conclusion — and where they don’t.

Residuals as Agreement Signal

Two scoring systems that perfectly agree (modulo a linear rescaling) have residuals concentrated tightly around zero. Two systems with weak agreement have residuals spread wide. The interesting case is when residuals are spread and structured — when subsets of items reliably sit above or below the regression line.

For our purposes, “structured” has a specific meaning: the ground-truth-positive subset should have residuals systematically above zero. The intuition: a positive item is one both scorers are likely to score high on, and “both high” lifts the residual above the regression-implied baseline. A non-positive item that one scorer happened to over-score and the other appropriately under-scored has a residual driven by single-scorer noise, not joint agreement.

If you plot two histograms — residuals over vs residuals over the complement — they should be displaced. The displacement is the discrimination signal.

Discrimination Via Residual-Distribution Divergence

For a candidate threshold , define:

This is exactly the gap, at threshold , between the GT-positive residual survival function and the general residual survival function. The argmax is the Youden-J statistic for the residual distribution:

is the maximum vertical gap between the two ROC-style curves traced out as sweeps. It is a single number on that measures how cleanly the two scorers’ joint agreement separates ground-truth positives from the rest. is the threshold that achieves the gap — the operationally useful number, because it’s the cutoff you’d actually pick.

means the two scorers’ agreement carries no information about positivity. means the agreement perfectly separates from its complement. Real systems land in the middle, and the absolute number is comparable across model versions in a way correlation alone is not.

What This Catches That Raw Correlation Doesn’t

A high Pearson correlation between two scoring systems is a necessary condition for them to be useful together. It is not sufficient. Two scorers can be highly correlated globally and still systematically disagree on the cases that matter — the positives.

Failure modes raw correlation hides
  • Correlated noise on negatives, agreement on positives. Both scorers are noisy on irrelevant items but precisely aligned on relevant ones. Pearson sees high global correlation; residual analysis sees a tight, displaced GT-positive cluster.
  • Correlated agreement on negatives, divergence on positives. Both scorers cleanly identify obvious negatives but disagree on which positives are most relevant. Pearson is high; residual GT distribution is wide and centered near zero. The metric flags it; correlation doesn’t.
  • One scorer is well-calibrated, one isn’t. The reranker assigns 0.9 to true positives and 0.1 to negatives. The retriever assigns 0.6 to positives and 0.4 to negatives — same order, much smaller dynamic range. Pearson is high. Residual analysis reveals the retriever’s residuals on barely differ from its residuals on the complement, even though the reranker’s do.

This is the same logic behind the standard diagnostics — model scores can have right rank order and wrong scale — applied jointly to two scorers instead of one scorer plus ground truth.

Worked Example — Index Score vs Reranker Score

Take a retrieval task with (query, document) pairs. For each, you have from a dense-retrieval bi-encoder and from a cross-encoder reranker. You have a -judged subset of items with grade .

The procedure:

  1. Fit the regression. on all 10K pairs. Pearson might be 0.6 — the two scorers globally agree.
  2. Compute residuals. for every .
  3. Stratify. Split residuals into and .
  4. Sweep . Compute over a fine grid of candidate thresholds. Take the argmax.

A typical result on a well-behaved retriever-reranker pair: , — meaning the threshold “GT positives have residuals above 0.05” separates them from non-positives with a 40-percentage-point survival-function gap. That is more discrimination signal than either score alone provides on the same set, and it gives you both a quality summary () and an operating threshold ().

The regression residual is the natural choice because it is the projection of disagreement onto the dependent-variable axis, not onto an arbitrary diagonal. If you swap which score is dependent, the residuals change but the rank order over items is preserved (modulo the slope), and the resulting argmax is the same.

Perpendicular distances (i.e., total least squares residuals) are sometimes proposed for “symmetric” treatment of the two scores, but they are sensitive to the relative scaling of and in a way regression residuals are not. Z-normalize both first, then perpendicular distance is fine, but at that point you’ve recovered something proportional to the regression residual anyway.

Raw deviations are the wrong move when the two scores live on different scales — they collapse calibration error and disagreement into one number. Residuals from the fit explicitly subtract the calibration effect, leaving only the disagreement.

The correct mental model: the regression line is the global “calibration map” between the two scorers, and the residuals are the leftover disagreement after that map is removed.

Where This Connects To Standard Tools

This technique is essentially AUC computed in residual space. Same shape as the standard ROC analysis — sweep a threshold, plot two survival functions against each other, take the maximum vertical gap — but the coordinate is residual, not raw score. The advantages over raw-score AUC are exactly the advantages of looking at calibration leftovers over raw outputs: you’ve subtracted the global agreement, and the remaining signal is purely about where the two scorers disagree.

It is most useful as a diagnostic: comparing across model versions, across datasets, or across reranker-retriever pairs tells you whether your stack’s joint discrimination is improving in a way correlation alone can hide. It is less useful as a deployed threshold — the is overfit to the specific GT subset you used and will drift across domains the same way a raw-score threshold does.

Summary

Calibration-discrimination analysis is the right tool when you have two scoring systems and a known-positive subset and you want to know how cleanly the two scorers jointly identify positives. Fit a regression, take residuals, stratify by GT, compute the Youden-J of the residual distribution. The result is a single comparable number () and an operating threshold () that together summarize what global correlation cannot.

Go further

Why look at residuals instead of just correlating the two scores directly?

Correlation is a single scalar — it tells you the global shape of the relationship and nothing about where it breaks. Residuals are per-point: each pair gets a signed distance from the regression line, and you can stratify those distances by any subset (ground-truth-positive, ground-truth-negative, query domain, document length). The residual distribution is the high-resolution version of , and it's where the diagnostic signal lives.

How is this different from AUC on raw scores?

Raw-score AUC asks: how well does a single score separate positives from negatives? Residual AUC asks: how well does the joint agreement of two scores separate positives from negatives? It catches cases where neither score alone is discriminative but their agreement is — e.g., a positive doc where both the retriever and the reranker scored high, even if either score alone is noisy. It's strictly more information than either single AUC.

When does this technique fail?

Three cases. (1) When the two scoring systems are essentially identical — a reranker distilled from the same teacher as the retriever — residuals are tiny and noise-dominated. (2) When the ground-truth-positive subset is too small (under ~50 items), the empirical residual distribution is too noisy to trust the argmax. (3) When the relationship between the two scores is non-linear; the regression line is the wrong reference, and you should fit a non-parametric (LOWESS, isotonic) calibration first, then take residuals from that.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord