In the previous tip, we discussed how pointwise 1-5 labels fall apart. The expert rater gives only nit-picky ratings, way beyond the considerations of actual users. A naive rater has little knowledge of the domain, and may tend to consider most results relevant.
How do we handle this situation?
We handle it by using multiple raters for the same document. We can’t rely on just one!
Then, when we have enough ratings, we can use a metric like Fleiss’s Kappa to measure whether raters tend to agree / disagree. You apply this to your full dataset - or perhaps an entire query to understand how aligned your raters are for that specific query.
Fleiss’s Kappa approaches 1 when raters are neatly in agreement, and can even go negative when they heavily disagree.
If you’re using pointwise evals, use a statistic like Fleiss’s. Or get surprised by the nitpicker’s conflict with the naive rater
-Doug
This is part of Doug’s Daily Search tips - subscribe here
Enjoy softwaredoug in training course form!
Starting June 22!
I hope you join me at Cheat at Search with LLMs to learn how to apply LLMs to search applications. Check out this post for a sneak preview.