BM25 models the odds a term would be observed in a relevant document (vs the term occurring in an irrelevant doc).

It’s based on probabilistic relevance, capturing:

  • t - a query term match occurs
  • R - the doc is relevant

Queries of course contain multiple terms. How do we combine those odds? The odds of BOTH terms being in a relevant doc, we’d need to multiply Odds(t1) * Odds(t2).

If we take the log of these multiplied odds, we can take advantage of a property of logarithms: log(Odds(t1) * Odds(t2)) = log(Odds(t1)) + log(Odds(t2)). Or taken together across each term:

BM25 tries to model this!

How?

We can imagine cases where a term match would occur in a relevant doc (the numerator). Why might a term match occur an in an irrelevant doc?

  • The term’s ambiguous, not specific to the user’s intent. luke skywalker - luke occurs all over the place. It’s common. The rare term skywalker occurs only in only Star Wars → thats how we get Inverse Document Frequency.
  • The document is not ABOUT that term - A term occurring spuriously (low term frequency) would not mean the document is relevant. Or a term not being a large enough proportion of the text: Skywalker being mentioned a few times in a book length tome may just happen accidentally.

Through trial, error, and experimentation with open datasets we arrived at the exact BM25 formula.

That’s how we get a BM25 that’s probabilistic, but not a probability

My AI Powered Search training with Trey Grainger starts THURSDAY - signup here

-Doug

This is part of Doug’s Daily Search tips - subscribe here


Enjoy softwaredoug in training course form!

Starting June 22!

I hope you join me at Cheat at Search with LLMs to learn how to apply LLMs to search applications. Check out this post for a sneak preview.

Doug Turnbull

More from Doug
Twitter | LinkedIn | Newsletter | Bsky
Take My New Course - Cheat at Search with LLMs