NDCG@K

Also known as: NDCG, DCG@K, normalized DCG

TL;DR

Normalized Discounted Cumulative Gain at K is a ranking quality metric that rewards relevant documents appearing high in the result list, with logarithmic discounting for lower positions. The standard top-of-list quality metric for rerankers.

NDCG@K (Normalized Discounted Cumulative Gain at K) measures how good the ordering of the top-K results is, with credit weighted toward the top.

NDCG@10 · GRADED RELEVANCETop-weighted ranking quality, measuredRANKED LISTGRADEDISCOUNT(2^G−1) / LOG₂(P+1)NDCG#13/1.007.00#22/1.581.89#33/2.003.50#41/2.320.43#52/2.581.16#6·/2.810.00#71/3.000.33#82/3.170.95#9·/3.320.00#101/3.460.29DCG15.553IDEAL DCG16.374BEST POSSIBLE ORDER0.95NDCG@100115.5516.37RELEVANCE GRADES3 · highly relevant2 · relevant1 · marginal0 · not relevant

The formula has three parts:

  1. Gain: each result has a relevance grade (binary 0/1, or graded 0-3 / 0-10).
  2. Discounted: each position’s gain is divided by so position 1 counts fully, position 2 less, position 10 much less.
  3. Normalized: divide by the ideal DCG (the score you’d get if the K most relevant documents were perfectly ordered). NDCG ranges 0 to 1.

So NDCG@10 = 0.85 means the top-10 ordering is 85% of optimal, weighted toward the top.

Why position weighting matters

Users (and LLMs in RAG pipelines) read top-down and tire fast. A relevant doc at position 1 is worth way more than the same doc at position 10. NDCG’s logarithmic discount captures this: doubling the position roughly halves the gain.

Compare to plain Recall@K (does the relevant doc appear anywhere in top-K?), which doesn’t care about ordering inside the cutoff. Recall is a measure of “did we find it”; NDCG is a measure of “did we put it where it belongs”. You want both.

The original DCG paper (Järvelin & Kekäläinen, 2002) chose for two reasons: it’s smooth (no abrupt cliff like a hard top-K cutoff), and it roughly models user attention curves observed in eye-tracking studies — attention drops steeply between ranks 1-3, gently after that. The base of the log mostly affects how aggressive the discount is; is convention, not law. The “+1” is to avoid dividing by zero at position 1 (). Some DCG formulations use with position starting at 2; the rankings are identical, only the absolute scale shifts. None of this matters for comparing systems on the same evaluation — only for understanding what the absolute number means.

Binary vs graded relevance

  • Binary NDCG — gain is 0 or 1. You either found a relevant doc or you didn’t.
  • Graded NDCG — gain follows a relevance scale (e.g., 0-3 for “irrelevant / marginal / relevant / highly relevant”), and is often passed through to weight high-grade matches more heavily.

Graded NDCG is more expensive to evaluate (you need graded labels) but more informative — a reranker that consistently puts “highly relevant” above “marginally relevant” should score better than one that just gets “relevant” first.

ZeroEntropy’s reannotation uses graded relevance with LLM judges for exactly this reason.

What’s a good number?

Corpus-dependent. On the 29-dataset reranker benchmark we use, NDCG@10 of 0.65-0.70 is “good”, 0.75+ is leaderboard-class — zerank-2 averages 0.7625. A 2-3 point NDCG@10 lift typically moves end-to-end RAG answer quality enough to be visible in eval.

Common NDCG@K cutoffs and what they're for
  • NDCG@1 — single-result consumers (top tool, top answer, navigational queries)
  • NDCG@5 — RAG with tight context budgets feeding 3-5 passages to the LLM
  • NDCG@10 — the field default; what reranker leaderboards report
  • NDCG@20-50 — generous-context RAG and human-in-the-loop browsing
  • NDCG@100 — ceiling check — should track Recall@100 closely
Go further

Should I optimize NDCG@10 or Recall@100?

Both, but at different stages. Recall@100 is the ceiling on what your reranker can possibly do — first-pass quality. NDCG@10 measures whether the reranker is putting the right documents at the top. A two-stage pipeline needs you to track them independently.

What's the difference between binary and graded NDCG in practice?

Binary NDCG only knows whether a doc is relevant or not; graded NDCG distinguishes 'highly relevant' from 'marginally relevant'. Models that look strong on binary leaderboards can be noisy on graded labels — and graded is what production retrieval actually needs.

How do I measure NDCG on my own corpus?

Sample queries, generate graded relevance labels (LLM judges work well at scale), run your retrieval + reranker, then compute NDCG@10 against the labels. The playbook walks through the full evaluation loop end-to-end.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord