Measure your rater reliability

In the previous tip, we discussed how pointwise 1-5 labels fall apart. The expert rater gives only nit-picky ratings, way beyond the considerations of actual users. A naive rater has little knowledge of the domain, and may tend to consider most results relevant.

How do we handle this situation?

We handle it by using multiple raters for the same document. We can’t rely on just one!

Then, when we have enough ratings, we can use a metric like Fleiss’s Kappa to measure whether raters tend to agree / disagree. You apply this to your full dataset - or perhaps an entire query to understand how aligned your raters are for that specific query.

Fleiss’s Kappa approaches 1 when raters are neatly in agreement, and can even go negative when they heavily disagree.

If you’re using pointwise evals, use a statistic like Fleiss’s. Or get surprised by the nitpicker’s conflict with the naive rater

-Doug

This is part of Doug’s Daily Search tips - subscribe here

Enjoy softwaredoug in training course form!

Starting May 18!

Signup here - http://maven.com/softwaredoug/cheat-at-search

I hope you join me at Cheat at Search with Agents to learn use agents in search. build better RAG and use LLMs in query understanding.

Doug Turnbull

More from Doug
Twitter | LinkedIn | Newsletter | Bsky