OpenSource Connections founder Eric Pugh and I were chatting about his participation in my upcoming ML-Powered Sphere class.

What’s my perspective on search today? Is there a point of view you’re advocating for?

Huh I honestly hadn’t thought about that. What IS my perspective? Where’s the field at? What do I want everyone to know and understand that maybe they’re missing?

My perspective?

My knee jerk reaction is to say people need to stop obsessing over sexy solutions and dig into the boring topic of evaluation.

This isn’t unique to search. Other systems where models respond to users share this issue. Want to know how well your chatbot resolves support tickets? How well your news feed engages users? Whether your recommendation system drives purchases? Well knowing that is actually hard. Instead, it’s easier to live in ignorance, hoping that a miraculous solution X [AI / BERT / deep learning / vector search / some product …?] ‘just works’. Chasing silver bullets instead of doing the hard evaluation work continues to be the story of what it’s been like to work in this field the last 10 years.

When I talk about this, I sometimes get eye rolls. Stakeholders hear what I’m saying and assume I’m sidestepping the need for strategy by just saying ‘we’ll just YOLO A/B test our way to greatness” No “You can’t A/B test your way to greatness” - A/B testing has nothing to little with strategy and vision.

Evaluation (A/B testing, etc) is more like unit testing. You would never mistake your unit tests for a strategy. Having some sense of your goals is a prereq. With regular software development, you stimulate input, you assert exactly what you expect, seeing if it met those goals. However, with a field like search relevance, recommender systems, chat bots, etc the number of possible inputs is vast. It could correspond to any user’s purchase history, any search query, any chat message. Do you write a billion test cases and ensure each response does what it’s supposed to? And how do you even know what a chatbots supposed to do when someone enters “YOLO my Spkr is teh broke!?” times a million other weird queries?

You can’t constrain the problem space by traditional unit testing.

Instead of ‘assert’, you measure the business goal you hope to achieve. The millions of ‘tests’ become ‘things we need to analyze with data science tools’. You need to know did we actually achieve strategic goal X? Did we harm strategic constraints Y and Z? Is our data representative of millions of queries? We get incrementally better. We go from a conversion rate of 2% to 2.02%? Can we trust that gain? Is it significant? Is the underlying data biased? We iterate and iterate, making our “unit testing” better and better within the existing strategic goals.

BUT… unit testing isn’t a product strategy.

So many constraints exist beyond our initial goals. If we just care about conversions, what do we do when fake diamonds are the top bought product for diamond searches? Do we still serve it, despite the arguable harm to our brand perception? What about cases where users enter searches that don’t naturally lead to conversions, earlier in their journey? Do we try to push purchase when users are just browsing? What about times when search interfaces don’t even generate clicks, as in cases of good abandonment?

Well your head may be just broken thinking about both the strategy and the evaluation here. A simple strategic goal of ‘increase conversion’ can be fraught with complication.

You can easily get lost, arguing about all the ways people will use search (or recommendations, chat bots, …) can become boiling the ocean. We can have deep philosophical discussions about the nature of the problem. Or we can…

Just get started on our broken definition of the problem and iterate as we go

And most importantly, we must be skeptics of what ‘correct’ means - or even that there can be ‘correct’. We must take our data-“driven” unit tests with several heapings of salt. Our current evaluation always lives within a broken view of the problem. Nevertheless, we need some kind of north star to optimize towards and prevent going backwards, even a simple one like manual labels from coworkers entered into a tool like Quepid.

And that perhaps is my perspective. You can’t move without some kind of goal and measurement framework, however broken. You work at that goal a bit, only to realize it’s not quite the right goal. You take other, qualitative forms of feedback of how search is broken beyond your evalutaion framework. You revisit the strategy, adding that nuance, then revisit your evaluation (the unit tests) and finally the solution.

Good search teams build the plane while they fly it. They ‘unit test’ and ship within the constraints of the current, best guess at a strategy. But they don’t shut their ears to qualitative data sources pointing at deeper strategic flaws. Teams only learn those flaws by shipping. Instead of data-driven we must be data-informed or even data skeptics. Teams must be prepared to say “no” to winners of A/B tests when other, qualitative business concerns override, to question how evaluation is performed, and constantly revisit strategic north stars when new qualitative data arrives.

It’s messy. We don’t proceed in perfection. Through the muck, we need opinions, direction, and measurement. Imperfect evaluation measures to go forward - carefully deciding when to keep the blinders on, and when to take a step back and question everything.

This is the way.

If you’re interested in this topic or other Machine Learning related search products, be sure to check out my course on ML Powered Search

Special Thanks to Kim Falk for reviewing this post and giving substantive edits and feedback!

Doug Turnbull

More from Doug
Twitter | LinkedIn | Mastodon
Doug's articles at OpenSource Connections | Shopify Eng Blog