Pointwise Scoring

Also known as: pointwise reranking, independent scoring

TL;DR

Pointwise scoring evaluates each (query, document) pair independently, producing one relevance score per pair. The dominant pattern for cross-encoder rerankers because it's simple, parallelizable, and produces calibrated scores.

Pointwise scoring is the simplest reranking shape: take a query and one document, output a number indicating how relevant that document is to that query. To rerank a candidate set of N documents, run the model N times in parallel — once per pair — and sort by the resulting scores.

Most production are pointwise (zerank-2 included). The interface is just score = model(query, document).

Why pointwise is the production default

Why pointwise wins in production
  • Parallelizable — N pairs, N independent forward passes. Easy to batch on GPU.
  • Calibratable — scores can be calibrated so that 0.7 means roughly “70% relevant” across queries and domains. This is impossible with ordering output.
  • Interpretable — the score per pair is a value you can threshold, log, monitor, and reason about. “Drop anything below 0.4” is a reasonable filter only if scores are calibrated and pointwise.
  • Composes with retrieval — a pointwise reranker is a drop-in second stage; you don’t need to re-architect anything.

What pointwise can’t see

The pointwise model evaluates each pair in isolation. It can’t:

  • Notice that two candidate documents are near-duplicates and break the tie by topic.
  • Decide that given the user has already seen doc A, doc B becomes less useful.
  • Re-rank for diversity.

For those, you compose pointwise scoring with a downstream MMR or diversity step, or use a approach (with all the costs that entails).

Training pointwise rerankers

A classic challenge: pointwise targets (a single relevance score per (q, d)) are noisy because annotators disagree on absolute relevance. The methodology sidesteps this by training on and recovering pointwise targets via a — same statistical trick that powers chess Elo rankings.

Ask two annotators “how relevant is doc D to query Q on a scale of 1-5?” and you’ll get correlations as low as 0.4-0.6 in real datasets. Humans anchor differently: one annotator’s “4” is another’s “3”, and a single annotator drifts across a session. Worse, the meaning of “3” depends on what they’ve seen recently — a mediocre doc looks great after a batch of irrelevant ones, terrible after a batch of perfect ones. Ask the same two annotators “is doc A or doc B more relevant to Q?” and inter-annotator agreement jumps to 0.85+. Pairwise comparisons cancel the absolute-scale drift. zELO collects pairwise judgments at scale, then uses a Thurstone-style maximum-likelihood fit to recover continuous pointwise scores that are consistent — and trains the production reranker via MSE regression on those fitted scores. Result: pointwise model with calibrated outputs, trained on data that was never directly pointwise.

Pointwise scoring trades global awareness for the one property production retrieval cannot give up: a calibrated, thresholdable, per-pair number.

Go further

Why is calibration only possible with pointwise scoring?

Calibration requires that the score has meaning across queries — '0.7 means 70% relevant' should hold for query A and query B alike. Pointwise produces a per-pair score you can anchor to a target distribution; listwise output is just an ordering, with no cross-query signal.

If pointwise targets are noisy, how do modern rerankers train on them?

They don't, directly. [zELO](/concepts/zelo/) collects much-less-noisy pairwise judgments, runs a [Thurstone fit](/concepts/thurstone-model/) to recover continuous Elo-style pointwise scores, then trains the production reranker via MSE regression on those fitted scores. Pairwise in, pointwise out.

What pointwise scoring can't do — and what to compose with?

Pointwise can't see other candidates, so it can't deduplicate near-duplicates or enforce diversity. The standard pattern is to compose pointwise scores with a downstream MMR-style diversity step, or fall back to a [listwise reranker](/concepts/listwise-reranking/) when ordering coherence dominates.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord