Evaluating search?
Don’t jump to complex labeling systems, just do simple side-by-sides.
When I worked at Reddit, I would socialize a spreadsheet of search results. On each side were search results for a test query. One side was “control” - autogenerated from prod. The other “test” - my new fancy algorithm.
But the reviewers didn’t know which was which. They were blind.
What they saw: one side labeled “Pepsi” the other “Coke” (referencing the old commercial). They’d give me a preference over several dozen queries - isolating to types of queries my change impacted. This gave me a good feeling whether continuing with deeper evaluation (ie A/B test) was worth it.
Your search evals need not involve a PhD. Start grug-brained. Don’t get ahead of your skis!
https://softwaredoug.com/blog/2025/06/22/grug-brained-search-eval
-Doug
This is part of Doug’s Daily Search tips - subscribe here