Pointwise Scoring

Q: Why is calibration only possible with pointwise scoring?

Calibration requires that the score has meaning across queries — '0.7 means 70% relevant' should hold for query A and query B alike. Pointwise produces a per-pair score you can anchor to a target distribution; listwise output is just an ordering, with no cross-query signal.

Q: If pointwise targets are noisy, how do modern rerankers train on them?

They don't, directly. zELO collects much-less-noisy pairwise judgments, runs a Thurstone fit to recover continuous Elo-style pointwise scores, then trains the production reranker via MSE regression on those fitted scores. Pairwise in, pointwise out.

Also known as: pointwise reranking, independent scoring

TL;DR

Pointwise scoring evaluates each (query, document) pair independently, producing one relevance score per pair. The dominant pattern for cross-encoder rerankers because it's simple, parallelizable, and produces calibrated scores.

Pointwise scoring is the simplest reranking shape: take a query and one document, output a number indicating how relevant that document is to that query. To rerank a candidate set of N documents, run the model N times in parallel — once per pair — and sort by the resulting scores.

Most production cross-encoder rerankers are pointwise (zerank-2 included). The interface is just score = model(query, document).

Why pointwise is the production default

Why pointwise wins in production

Parallelizable — N pairs, N independent forward passes. Easy to batch on GPU.
Calibratable — scores can be calibrated so that 0.7 means roughly “70% relevant” across queries and domains. This is impossible with listwise ordering output.
Interpretable — the score per pair is a value you can threshold, log, monitor, and reason about. “Drop anything below 0.4” is a reasonable filter only if scores are calibrated and pointwise.
Composes with retrieval — a pointwise reranker is a drop-in second stage; you don’t need to re-architect anything.

What pointwise can’t see

The pointwise model evaluates each pair in isolation. It can’t:

Notice that two candidate documents are near-duplicates and break the tie by topic.
Decide that given the user has already seen doc A, doc B becomes less useful.
Re-rank for diversity.

For those, you compose pointwise scoring with a downstream MMR or diversity step, or use a listwise approach (with all the costs that entails).

Training pointwise rerankers

A classic challenge: pointwise targets (a single relevance score per (q, d)) are noisy because annotators disagree on absolute relevance. The zELO methodology sidesteps this by training on pairwise preferences and recovering pointwise targets via a Thurstone fit — same statistical trick that powers chess Elo rankings.

Ask two annotators “how relevant is doc D to query Q on a scale of 1-5?” and you’ll get correlations as low as 0.4-0.6 in real datasets. Humans anchor differently: one annotator’s “4” is another’s “3”, and a single annotator drifts across a session. Worse, the meaning of “3” depends on what they’ve seen recently — a mediocre doc looks great after a batch of irrelevant ones, terrible after a batch of perfect ones. Ask the same two annotators “is doc A or doc B more relevant to Q?” and inter-annotator agreement jumps to 0.85+. Pairwise comparisons cancel the absolute-scale drift. zELO collects pairwise judgments at scale, then uses a Thurstone-style maximum-likelihood fit to recover continuous pointwise scores that are consistent — and trains the production reranker via MSE regression on those fitted scores. Result: pointwise model with calibrated outputs, trained on data that was never directly pointwise.

Pointwise scoring trades global awareness for the one property production retrieval cannot give up: a calibrated, thresholdable, per-pair number.

Go further

Why is calibration only possible with pointwise scoring?

Calibration requires that the score has meaning across queries — '0.7 means 70% relevant' should hold for query A and query B alike. Pointwise produces a per-pair score you can anchor to a target distribution; listwise output is just an ordering, with no cross-query signal.

Score calibration Listwise reranking

If pointwise targets are noisy, how do modern rerankers train on them?

They don't, directly. [zELO](/concepts/zelo/) collects much-less-noisy pairwise judgments, runs a [Thurstone fit](/concepts/thurstone-model/) to recover continuous Elo-style pointwise scores, then trains the production reranker via MSE regression on those fitted scores. Pairwise in, pointwise out.

zELO Pairwise preference Thurstone model

What pointwise scoring can't do — and what to compose with?

Pointwise can't see other candidates, so it can't deduplicate near-duplicates or enforce diversity. The standard pattern is to compose pointwise scores with a downstream MMR-style diversity step, or fall back to a [listwise reranker](/concepts/listwise-reranking/) when ordering coherence dominates.

Listwise reranking Reranker

← All concepts

The best AI teams build with ZeroEntropy models

Book Demo View docs