Do you work in search, recommendations, or anything that ranks stuff? You’ve probably heard of NDCG or DCG to evaluate search relevance.

Unfortunately, for every team, there is a subtly different way of defining and using these metrics. I want to catalog all the (n)DCGs out there, why and how you might use them. I’ll start today with one aspect - how NDCG is normalized – the Ideal DCG it’s taken against.

But first and foremost, some definitions. DCG attempts to tell us how good query’s search results do compared to what we’ve labeled relevant / irrelevant search results for that query. IE a judgment list. Labels perhaps we’ve gotten from clicks, or from labeling results 1-5 (irrelevant to relevant) by hand. Really any number where a higher value is more relevant. Systems for deriving labels is a whole nother topic entirely.

But assuming a judgment list, DCG then discounts (by rank) the gain (ie labels or grades) of each result in a search result ranking. Then it adds them up into a score telling us how good the query is doing compared to the judgment list.

For example, here’s a simple DCG calculation for a query “zoolander” using relevance grades from 0-1.

First we get judgments, somehow, such as these:

query document grade
zoolander “Zoolander” (the movie) 1.0
zoolander Zoolander 2 0.9
zoolander My doggy is named “zoolander” 0.1

And if our search system returns this ranking:

Rank Document title grade discounted: grade * (1 / rank)
1 “Zoolander” (the movie) 1.0 1.0
2 My doggy is named “zoolander” 0.1 0.05
3 Zoolander 2 0.9 0.3

We take DCG as the sum of the position discounted grades: 1.0 + 0.05 + 0.3 = 1.35.

(Note 1/rank is just a naive discount for example’s sake, we actually do something more complicated – see the Wikipedia article)

The many ways of normalizing DCG

Sometimes we want to NORMALIZE this to 0-1. Where 1 is notionally “the best we possibly could do”. Why? Maybe because it’s a bit hard to reason about. Maybe whatever ML method you’re using to optimize search needs a normalized objective.

And that’s where an extremely consequential decision comes in that every team defines differently. Normalize to WHAT!?

Notionally we want to normalize to the “ideal DCG” so NDCG = DCG / iDCG. But people seem to define iDCG quite differently depending on the team… or weather… or amount of solar radiation.

How you normalize DCG - ie choose iDCG - completely changes how you’d interpret the resulting NDCG. So let’s discuss the options. Here’s the iDCG flavors I have seen in the wild:

  • NDCG-localLocal Ideal - your ideal DCG is taken as the current top N search results sorted by label. We ignore any labels not in the top N
  • NDCG-recallRecall set ideal - out of some large top K recalled documents, we compute an ideal. Like local except we extend the window to try to capture the ideal among the larger recall set
  • NDCG-globalGlobal ideal - our ideal is computed from the labels for a query - whether or not they were retrieved. We look at the labeled results for query and compute and ideal top N
  • NDCG-maxMax ideal - we just compute ideal DCG as if our top N got the max possible label - regardless of whether we have any results labeled as such

Here’s an example of each of these, expanding on the Zoolander example

Judgment List

query Document title grade
zoolander “Zoolander” (the movie) 1.0
zoolander Zoolander 2 0.9
zoolander Ben Stiller photo in Zoolander 0.7
zoolander A helicopter landing next to giraffes is a “zoo lander” 0.1
zoolander My doggy is named “zoolander” 0.1

Consider NDCG@2 (ie over top 2) and the following retrieved search results:

query Document title grade
zoolander A helicopter landing next to giraffes is a “zoo lander” 0.1
zoolander “Zoolander” (the movie) 1.0
zoolander Ben Stiller photo in Zoolander 0.7

NDCG@2-local: IDCG is DCG over top 2 results (just helicopter+zoolander the movie), sorting these top 2 in their ideal ordering NDCG@2-recall: IDCG is DCG over full results, sorting all the results in their ideal ordering (all 3 above)

Below we add the ideal rank the in the ideal ordering we’d use:

query Document title grade Ideal Rank (local) Ideal Rank (recall set)
zoolander A helicopter landing next to giraffes is a “zoo lander” 0.1 2  
zoolander “Zoolander” (the movie) 1.0 1 1
zoolander Ben Stiller photo in Zoolander 0.7   2

Using our naive DCG, we know DCG@2 is 0.1 / 1 + 1.0 / 2 = 0.1 + 0.5 = 0.6.

But to compute NDCG we take that relative to an ideal. In these two cases, we compute ideal DCGs using the ideal ranks above:

IDCG@2-local: 1.0 / 1 + 0.1 / 2 = 1.05

IDCG@2-recall: 1.0 / 1 + 0.7 / 2 = 1.35

So NDCG being DCG / IDCG, we get these DCG values:

NDCG@2-local = 0.6 / 1.05 = 0.571…

NDCG@2-recall = 0.6 / 1.35 = 0.4444…

Now let’s consider the GLOBAL. We look at our judgment list, and we sort by grade to get the top 2, and we see

IDCG@2-global = 1.0 / 1 + 0.9 / 2 = 1.45

NDCG@2-global = 0.6 / 1.45 = 0.414

And finally MAX. Our highest grade possible is a 1. If we had 1s in all our slots (ignoring the fact we don’t have that many 1s labeled) we get:

IDCG@2-max = 1.0 / 1 + 1.0 / 2 = 1.5

NDCG@2-max = 0.6 / 1.5 = 0.4

Now the astute reader will note some pros and cons, and you need to choose a metric that measures what you intend.

Local and recall set focus entirely on ranking. Assuming we recall some top N, we ask whether our ranker sorts them correctly. Recall is left as a separate measurement activity.

Global mixes recall and precision into one metric. We do better when we recall more labeled results and also rank them closer to ideal. This might give a more holistic picture, but also combining these two concerns in one metric might muddy the waters. If our experiment is simply focused on optimizing ranking - like an LTR model - and has no control over recall is it fair to punish the model for something out of its control? Second, consider if your labels are based on engagement+clicks, your labels will be biased towards control - you’ll find control is always the winner if recall changes.

Max helps measure not just ranking quality but our whole search systems ability to fill top N with relevant results. For example if we don’t have enough good content for this query - shouldn’t we go find some! It’s not just about how we do within our labels but whether the entire search ecosystem in our app can provide lots of relevant content. Again the downside being so holistic is it becomes less targeted at just one use case. Many concerns get mixed into the metric.

To stress test the metrics, think about the edge cases.

What happens when the recall set is really low (like 1-2 results)? For the local / recall cases to get an NDCG of 1 is to order these two results in the correct order - which may seem weird! For the global case, consider what happens when our labels come from clicks on some control, then well, almost every variant that improves precision within the recall set - but doesn’t recall ALL the labeled docs in control - will have lower NDCG?

So tred carefully with how you define NDCG! You may not be measuring what you intend to. Consider too that NDCG is only one view, and if your goal is to actually get into an online A/B test, NDCG may be overrated! Other methodologies that answer questions like “did I make the change I intended”? “How many queries did I impact?” might better help you ensure you’re doing a better job shipping the intended changes to your A/B test – leaving the “goodness” to the A/B test itself.

There are so many different goals in offline testing that no “one size fits all” can truly hope to cope with every need. Experiment carefully!

Doug Turnbull

More from Doug
Twitter | LinkedIn | Mastodon
Doug's articles at OpenSource Connections | Shopify Eng Blog