After years of chasing the perfect search evals, my dumb grug brain has finally realized:
No such thing âperfect metricâ. Playing with metric mean no real work get done. Grug just want to know if âchange do what I expectâ. Let big brain A/B test tell me if good.
In other words: evals exist to see if a change works as intended. Not to tell us if its âgoodâ.
NDCG is overrated
In search, NDCG is a metric that computes whether weâre producing the expected search results. It ranges from from 0(bad?)-1(good?). NDCG relies on labeling individual search results accurately for a query. Is the move âForrest Gumpâ relevant for Rambo? No. But âFirst Bloodâ is. If First Blood ranks above Forrest Gump, NDCG for the rambo
query will be closer to 1. Get enough of labeled search results, for enough queries, and we can test and iterate on our ranker.
Or so the story goes.
Iâve written in the past this eval regime is overrated. Trying to label a search result accurately hard. For many reasons:
- human labelers and LLMs are often not representative of your users.
- human labelers get tired and make mistakes
- internal, influential HIPPOs labeling results, may have opinions not correlated with real users
- clickstreams are hard to interpret, and contain all kinds of biases that dictate why something was âclickedâ (the UI, the tendency for users to click high up on the page regardless, if its spicy, or not click if it looks boring, etc)
- long tail queries get few interaction, leaving us with interaction data only on the most common queries
People tie themselves in knots eliminating all these errors. And what happens? To eliminate errors, we add complexity. Complexity begets harder to understand errors.
Sure getting Steve from accounting to label some search results will be error prone. But you know exactly how Steve will screw up. You know Steve well.
But modeling out how users click search results to tease out ârelevanceâ can be a tricky task, for all the reasons listed above.
Not to mention, the fundamental assumption here may be flawed. Labeling a single search result as relevant/not, then improving that ranking, is only one piece of search quality. Search quality means so much more: diverse search results, clear query understanding, speed, reflecting the intent back to the user, good percieved relevance, etc âŚ
Given all these issues, which path do you go down?
Grug-brained eval loop
IMO, most people would do fine with the following algorithm:
- Identify a population of queries you want to fix
- Gather 10-20 labeled queries with a tool from internal users (ie Quepid or something)
- Tune those queries so tehir NDCG gets better (change ranking, improve query understanding, ??)
- Regression test old queries, from previous iterations of this loop, make sure those didnât break
- GOTO 1
This doesnât measure âqualityâ. It measures âare we fixing things we intend, without breaking other thingsâ?
That is the team is defining a (maybe wrong) goal and measuring its progress towards that goal. That is it. Thatâs the tweet.
For example, if you intend your change to improve searches for products by name, then youâd go and ask the team to label some search results for queries. Including what the âright answerâ ought to be. You then try some fixes, making sure you only solved this problem (steps 3/4), and ship if it works.
We know âthe teamâ has error. But we can probably define that error. It boils down to the team needing to continually learn what users want from search. Which they should do anyways.
How do we measure quality then? We ship our changes to an actual A/B test. Or usability study. These are explicit studies of quality. Actual âqualityâ means whether we sell more products, increase DAU, or help solve business problems. This goes far beyond just a relevance change - and indeed it may be that relevance itself isnât what holds back our search, but other search quality issues.
A great deal of complexity comes in strongly coupling these two worlds. World (a) do we solve the problem we intend to solve and (b) does solving that problem matter to users? Tightly coupling the two with complex modeling is hard, and requires big brains. Do it last.
When to go big-brained
Of course cases still exist when we want to go big brained. If we want to train a model, our judgment list needs to point to âtrue northâ of quality. Thatâs 90% of the work, and takes considerable care. We donât spend our cycles modeling training, but getting trustworthy training data. This is what has mattered when Iâve built ranking models â Every. single. time.
Here you want to pull out all the stops. Get big brained data scientists to learn about topics like click models. Get a good Learning to Rank book, like, naturally AI Powered Search chapters 10-12, and really buff up your relevance capabilities.
But Iâd argue many teams arenât here yet, theyâre just trying to make their Elasticsearch or whatever a bit better, and theyâd do better to be grug-brained data scientists, then waste their limited cycles on big-brained dreams that may take away from actual work tuning relevance.
Enjoy softwaredoug in training course form!
