A judgment list labels a document as relevant / irrelevant for a query. So you get a label, say 1-5 for how relevant the movie First Blood is for the query Rambo.
Here’s what happens though in practice:
- First, a rater see Rambo III - they give it a rating of 5 / 5
- Next they see First Blood, the original Rambo movie, they also rate it 5/5
- That rater might reflect - wait should I go back and adjust my original rating for the sequel?
Even with careful coaching, raters often use inconsistent rating criteria. Some raters, especially those less savvy in the domain will give more optimistic labels - looks like a Rambo movie, 5/5. Other raters, especially those very savvy in the domain, can skew pessimistic - nit-picking far beyond what users think matters - “this specific BluRay isn’t the BEST edition of First Blood, it should get a 1/5”
You can mitigate this with careful coaching, feedback, and great care. But it’s not easy.
-Doug
This is part of Doug’s Daily Search tips - subscribe here
Enjoy softwaredoug in training course form!
Starting June 22!
I hope you join me at Cheat at Search with LLMs to learn how to apply LLMs to search applications. Check out this post for a sneak preview.