Reflecting on 6 years of "AI Powered Search"

Really excited to announce that AI Powered Search is released!

It all started in 2018… at Haystack in a much ‘cozier’ search relevance community…

AI in the before times

Back then, in the before times, my friend and collaborator Trey Grainger! invited me to help out with his new book AI Powered Search. Trey had been a key reviewer (and wrote the foreward) on my first book Relevant Search. Still a “relevant” book on ye olde lexical search.

ML techniques were timeless, meaning we could take our time with this book. Math never changes!

We weren’t wrong, but we weren’t right either. Everything about “AI” has changed.

At the time, “AI” meant something closer to “classic Machine Learning” but with a marketing spin. The feedback loop to gathering training data, training, deploying models. Despite this feeling more “classic”, today we still don’t appreciate the difficulty behind ML. Getting reliable training data, exploring how/where new features might help, measuring their performanec, and operationalizing them. Rinse and repeat. Of course, it’s not the model itself that’s hard (the import sklearn... model.fit()). It’s the thousands of lines before and after that matter.

Ranking itself is an odd duck in Machine Learning. We often teach ML in terms of regression or classification. But ranking is a bit of a different formulation - a learned sort. Where often getting the top result correct matters more than the 50th. Classic algorithms like SVMRank and LambdaMART help reformulate ranking using tabular features. And now, using embeddings, we often use cross-encoders.

Then search systems have even more challenges. The massive biases on clickstream training data - user’s click on top results. And if your legacy search result stinks, you’ll never get positive labels. If your search is amazing, you’ll never get negative examples! Then anytime you add a new feature to your ranking system, well, you’ll probably upset the applecart.

At the time Solr+Elasticsearch were all the rage, and I was involved in Elasticsearch Learning to Rank. I was invited to write about all this stuff. WIn 2020-2021 - I intentionally chose the simplest model architecture (SVMRank) to learn the deeper ML challenges - trying new ranking features, debiasing clickstream data, iterating on model quality. Balancing explore/exploit in a search ranking, showing users prospectively relevant results, just to explore the training data / feature space. Moving a traditional ML problem to more on of active, and eventually, reinforcement learning problem.

I’m really proud of the three chapters I wrote on Learning to Rank (2 just on training data!!) in AI Powered Search. As they’ll remain applicable as long as users type in keywords and get a listing of results.

Enter ChatGPT

That was all state of the art - at least for what was relatively ubiquitous. But times changed in 2022 with the release of OpenAI. Vector databases/indexes became a first line for new AI engineers. Embeddings now ubiqutious. Today many of us work on RAG systems, with an LLM that can deeply understand a users lengthy prompt, extracting complex query metadata, searching with multiple embeddings, at insane speed. Query understanding seemingly made easier. But expectations in search quality have never been higher.

Both a new sense of “AI” and a brand new way of searching (RAG) was born overnight.

That’s why I’m so grateful for both Trey and Max Irwin’s efforts. They really worked to revamp thu book to consider these new usecases. With embedding-first approaches to search, but alongside the strengths of classic keyword systems.

Because, why this book matters - as people get into Information Retrieval from an embeddings, they “back into” some classic problems. We’ve seen the revenge of BM25 in these AI years. We’ve seen the discovery of the importance of evaluation. And I suspect we will see a rediscovery of classic ML approaches like LambdaMART for their appropriate use cases – complimenting two-tower models and cross-encoders so familiar to us today.

Trey in particular took balancing this all of this on his back. He’s too be commended, and you’d not go wrong to seek out his search expertise via his company SearchKernel. After my measly 3 chapters were written, he dove in to really rethink the book, its structure, and how to talk to the RAG audience - while still keeping it applicable to a core search audience.

I’m really grateful that he was kind enough to include me in this journey, and let me have my name on a second cover.

Anyway, I hope you’ll check it out. If you’re passionate about the field, join us in the search relevance slack community or get in touch on linkedin. I’m always happy to grab coffee and learn from new colleagues.

Enjoy softwaredoug in training course form!

I hope you join me at Cheat at Search with LLMs to learn how to apply LLMs to search applications. Check out this post for a sneak preview.

Reflecting on 6 years of "AI Powered Search"

AI in the before times

Enter ChatGPT

Enjoy softwaredoug in training course form!

Doug Turnbull