How pointwise evals fall apart

A judgment list labels a document as relevant / irrelevant for a query. So you get a label, say 1-5 for how relevant the movie First Blood is for the query Rambo.

Here’s what happens though in practice:

First, a rater see Rambo III - they give it a rating of 5 / 5
Next they see First Blood, the original Rambo movie, they also rate it 5/5
That rater might reflect - wait should I go back and adjust my original rating for the sequel?

Even with careful coaching, raters often use inconsistent rating criteria. Some raters, especially those less savvy in the domain will give more optimistic labels - looks like a Rambo movie, 5/5. Other raters, especially those very savvy in the domain, can skew pessimistic - nit-picking far beyond what users think matters - “this specific BluRay isn’t the BEST edition of First Blood, it should get a 1/5”

You can mitigate this with careful coaching, feedback, and great care. But it’s not easy.

-Doug

This is part of Doug’s Daily Search tips - subscribe here

Enjoy softwaredoug in training course form!

Starting May 18!

Signup here - http://maven.com/softwaredoug/cheat-at-search

I hope you join me at Cheat at Search with Agents to learn use agents in search. build better RAG and use LLMs in query understanding.

Doug Turnbull

More from Doug
Twitter | LinkedIn | Newsletter | Bsky