Pearson Correlation

Q: When does a high Pearson r mislead you?

Whenever the true relationship is monotone but non-linear. A perfect FORMULA over FORMULA has Pearson r around 0.97; over FORMULA it has Pearson r exactly 0. The number is the same data, the same monotone-on-each-half structure — Pearson is just measuring the wrong thing. Switch to Spearman if you only care that bigger x predicts bigger y.

Q: Why does Pearson break on retrieval scores?

Reranker logits and dense-retrieval similarities are often heavy-tailed — a few queries score very high, most score modest. Pearson is dominated by those tail observations because it weights squared deviations. A single high-score outlier can flip the reported r by 0.1 or more. Either log-transform the scores, trim outliers, or report Spearman alongside.

Q: Is Pearson's r a real test statistic?

Yes — under the null hypothesis of zero correlation and bivariate normality, FORMULA follows a t-distribution with FORMULA degrees of freedom. So you can attach a p-value to any r. But the bivariate-normality assumption is rarely satisfied for ML scores; bootstrap confidence intervals are usually more honest.

Also known as: Pearson r, Pearson product-moment correlation, linear correlation

TL;DR

Pearson's measures the strength of a linear relationship between two variables, on .

Pearson correlation is the textbook scalar for “how linearly related are these two variables.” Defined as the covariance normalized by the product of standard deviations:

The result lives in . Plus one is a perfect upward line, minus one a perfect downward line, zero is no linear association. It is the most-cited correlation coefficient in science — and the most misused, because most relationships in the wild are not lines.

Linear, not monotone

The single fact to internalize: Pearson measures linear dependence. A perfectly deterministic but curved relationship can have any Pearson value, including zero. The classic illustration is on a symmetric interval around the origin: the relationship is exact, but every positive deviation on pairs with a positive regardless of sign, so the covariance numerator cancels to zero. Pearson reports for a relationship a five-year-old could draw.

This is why Anscombe’s quartet — four datasets with identical Pearson , identical means, identical regression lines, but radically different scatter plots — is a standard cautionary tale. Pearson summarizes one moment of one projection of your joint distribution. If the structure does not live there, Pearson cannot see it.

Why retrieval evals reach for it

Pearson is the right correlation when the absolute values of two score sequences need to track each other — that is, when you care about calibration , not just ranking. Two examples:

Pearson on retrieval scores

Reranker calibration check — a cross-encoder’s logits should track human-judged graded relevance (0-3 labels). Pearson on (logit, label) pairs answers “is the score linearly meaningful?” If r is high but Spearman is higher, the model knows the order but not the scale.
LLM-as-judge agreement — when an LLM judge outputs a 1-5 score, you want Pearson against human 1-5 scores. Same numerical scale on both sides; Pearson tells you they map to each other linearly.
Ensemble score fusion — fusing two retrievers with α · score_A + (1-α) · score_B requires that the two score scales be roughly comparable. Pearson between them tells you whether linear combination is even meaningful.

For pure ranking quality — “does retriever A’s order agree with retriever B’s order?” — Pearson is the wrong tool. Use Spearman .

Failure modes

Three places where Pearson silently lies:

Heavy tails. Pearson weights squared deviations, so one or two outliers dominate the sum. Reranker logits are unbounded and often log-normal-ish; a single high-confidence prediction can move r by 0.1. Either trim, log-transform, or switch to a rank-based correlation.
Non-linearity. As above. Even if the relationship is monotone, Pearson under-reports its strength when the curve is sigmoidal, exponential, or logarithmic.
Restricted range. If you correlate only over a narrow slice of , you cut down the variance in the denominator and can artificially inflate or deflate r. Filtering your eval set to “hard queries” before computing correlation is a common silent footgun.

The point estimate of is just an arithmetic formula and works on any data. But the standard t-test for assumes the joint distribution of is bivariate normal. Under that assumption:

When the data are heavy-tailed, this t-test mis-states the false-positive rate, sometimes badly. The honest move is to bootstrap: resample the pairs with replacement times, compute on each resample, and report the empirical 95% interval.

For ML score correlations, where the per-pair distribution is rarely Gaussian, bootstrap CIs are essentially always the right answer. Cost: a few milliseconds in NumPy.

A working diagnostic

Compute both Pearson and Spearman whenever you are correlating two score sequences. Three regimes:

Pearson ≈ Spearman, both high. A clean linear relationship. Score calibration is good.
Pearson < Spearman. The order is right but the scale is wrong — either non-linear monotone or distorted by outliers. Probably want a calibration map (Platt, isotonic) before using these scores in any downstream linear combination.
Pearson high, Spearman lower. Almost always means a few extreme observations are driving the linear fit while the bulk of the data shows weaker rank agreement. Treat with suspicion; trim and recompute.

The correlation pair takes one extra line and tells you which kind of relationship you actually have.

Go further

When does a high Pearson r mislead you?

Whenever the true relationship is monotone but non-linear. A perfect over has Pearson r around 0.97; over it has Pearson r exactly 0. The number is the same data, the same monotone-on-each-half structure — Pearson is just measuring the wrong thing. Switch to Spearman if you only care that bigger x predicts bigger y.

Spearman correlation

Why does Pearson break on retrieval scores?

Reranker logits and dense-retrieval similarities are often heavy-tailed — a few queries score very high, most score modest. Pearson is dominated by those tail observations because it weights squared deviations. A single high-score outlier can flip the reported r by 0.1 or more. Either log-transform the scores, trim outliers, or report Spearman alongside.

Score calibration

Is Pearson's r a real test statistic?

Yes — under the null hypothesis of zero correlation and bivariate normality, follows a t-distribution with degrees of freedom. So you can attach a p-value to any r. But the bivariate-normality assumption is rarely satisfied for ML scores; bootstrap confidence intervals are usually more honest.

Statistical significance

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs