If we’re building a search app, we often want to ask ‘How good is its relevance?’ As users will try millions of unique search queries, we can’t just try 2-3 searches, and get a gut feeling! We need to put a robust number on search quality. As you experiment, you’ll want to compute such a statistic over thousands of queries. For this reason, I want to look at how Pandas can be used to rapidly compute one such statistic: Mean Reciprocal Rank.
Mean Reciprocal Rank or MRR measures how far down the ranking the first relevant document is. If MRR is close to 1, it means relevant results are close to the top of search results - what we want! Lower MRRs indicate poorer search quality, with the right answer farther down in the search results.
As MRR really just cares about the ranking of the first relevant document, it’s usually used when we have ‘one relevant result’ to our query. This occurs in applications such as question answering, where one result is labeled relevant. Such as in the two questions below:
Query | Relevant Document |
---|---|
How far away is Mars? | NASA Mars Website |
Who is PM of Canada? | Justin Trudeau Wikipedia |
Each question here has one labeled, relevant answer. In question answering, everything else is presumed irrelevant.
If we search for “How far away is Mars?” and our result listing is the following, note how we know the rank of the correct answer. We can then compute a reciprocal rank or just 1 / rank
in the examples below
Query | Relevant Document | rank | recip. rank |
---|---|---|---|
How far away is Mars? | Mars, God of War | 1 | |
NASA Mars Website | 2 | 1/2 | |
Then, similarly, we search for “Who is PM of Canada?” we get back:
Query | Relevant Document | rank | recip. rank |
---|---|---|---|
Who is PM of Canada? | Joe Biden’s Website | 1 | |
Conor Landry for PM! | 2 | ||
Justin Trudeau Wikipedia | 3 | 1/3 |
We see in the tables above the reciprocal rank of each query’s first relevant search result - in other words 1 / rank
of that result. The mean of these two reciprocal ranks is 1/2 + 1/3 == 0.4167
. This is the mean reciprocal rank or MRR. An MRR close to 1 means relevant results tend to be towards the top of relevance ranking.
Of course, we do this over possibly many thousands of queries! Which is where Pandas comes in.
An MRR walk-thru in MSMarco with Pandas
MSMarco is a question-answering dataset used in competitions and to prototype new/interesting ideas. You can find the datasets here.
For exploring MRR, for now we really just care about one file for MSMarco, the qrels. This holds the judgment list used as the ground truth of MSMarco. A judgment list, is just a term of art for the documents labeled as relevant/irrelevant for each query. How we arrive at what’s relevant / irrelevant is itself a complicated topic, and I recommend my previous article if you’re curious.
But for now, let’s just dive into MSMarco’s data, if we load the qrels file, we can inspect its contents:
judgments = pd.read_csv('data/msmarco-doctrain-qrels.tsv.gz', delimiter=' ', header=None)\
.rename(columns={0:'qid', 1:'iteration', 2: 'docid', 3: 'relevancy grade'})[['qid', 'docid', 'relevancy grade']]
judgments.head(20)
Output:
qid | docid | relevancy grade | |
---|---|---|---|
0 | 3 | D312959 | 1 |
1 | 5 | D140227 | 1 |
2 | 12 | D213890 | 1 |
3 | 15 | D1033338 | 1 |
4 | 16 | D508131 | 1 |
5 | 18 | D2286511 | 1 |
6 | 24 | D69114 | 1 |
7 | 26 | D1350520 | 1 |
8 | 31 | D304123 | 1 |
9 | 42 | D1439360 | 1 |
10 | 48 | D322379 | 1 |
11 | 51 | D920249 | 1 |
12 | 54 | D361377 | 1 |
13 | 55 | D1366317 | 1 |
14 | 60 | D246777 | 1 |
15 | 63 | D1450821 | 1 |
16 | 67 | D494640 | 1 |
17 | 68 | D1896896 | 1 |
18 | 69 | D2155744 | 1 |
19 | 70 | D2845225 | 1 |
Notice how each unique query (the qid
) has exactly one document labeled as relevant. A search solution would be evaluated on how well it gets that one document (in this case an answer to a question) towards the top of the ranking. This is what we want our MRR metric to help measure.
So we might implement some kind of search system, and issue a couple of queries. It returns the following ranked search results:
search_results = pd.DataFrame([
# Results for query id 1185869
# 'How far away is Mars?''
{'rank': 1,
'docid': 'D2008201',
'qid': 1185869},
{'rank': 2,
'docid': 'D59219',
'qid': 1185869},
# Results for query id 5
# 'Who is PM of Canada?''
{'rank': 1,
'docid': 'D494640',
'qid': 5},
{'rank': 2,
'docid': 'D123456',
'qid': 5},
{'rank': 3,
'docid': 'D140227',
'qid': 5},
])
search_results
Output:
rank | docid | qid | |
---|---|---|---|
0 | 1 | D2008201 | 1185869 |
1 | 2 | D59219 | 1185869 |
2 | 1 | D494640 | 5 |
3 | 2 | D123456 | 5 |
4 | 3 | D140227 | 5 |
Our first step would be to label each search result as relevant or not from our judgments. We do this by merging the judgments into the search results. Effectively this is just a left join of judgments into our search results on the query, doc id. Any correct answers are labeled a 1, everything else we force to 0 (assumed irrelevant):
labeled_search_results = search_results.merge(judgments, how='left', on=['qid', 'docid']).fillna(0)
labeled_search_results
Output:
rank | docid | qid | relevancy grade | |
---|---|---|---|---|
0 | 1 | D2008201 | 1185869 | 0 |
1 | 2 | D59219 | 1185869 | 1 |
2 | 1 | D494640 | 5 | 0 |
3 | 2 | D123456 | 5 | 0 |
4 | 3 | D140227 | 5 | 1 |
In the next bit of code, we inspect the best rank for each relevancy grade. In other words: what’s the lowest rank that relevancy grade == 1
occurs? Notice how in the output, we have a breakdown of the best rank (the min rank) each relevancy grade was seen at. We see, for example, qid 5, the best rank for relevancy grade of 1 is rank 3.
relevances_rank = labeled_search_results.groupby(['qid', 'relevancy grade'])['rank'].min()
relevances_rank
Output:
qid relevancy grade
5 0.0 1
1.0 3
1185869 0.0 1
1.0 2
Name: rank, dtype: int64
Of course, for reciprocal rank calculation, we only care about where relevant results ended up in the listing. Next we filter to just the relevancy grades of 1s for each query:
ranks = relevances_rank.loc[:, 1]
ranks
Output:
qid
5 3
1185869 2
Name: rank, dtype: int64
These are the ranks of each relevant document per query! We can now compute the reciprocal rank for each query.
reciprocal_ranks = 1 / (ranks)
reciprocal_ranks
Output:
qid
5 0.333333
1185869 0.500000
Name: rank, dtype: float64
Finally we arrive at the mean of each query’s reciprocal rank, by, you guessed it, taking the mean.
reciprocal_ranks.mean()
0.41666666666666663
And that is oooone mean reciprocal rank! :)
That’s all folks
This is just a dumb one-off post, mostly to help me remember how I arrived at some code ;). Please do get in touch if you noticed any mistakes or have thought (or want to join me and my fellow relevance engineers at Shopify!).