Hybrid search means combining lexical and vector search results into one result listing.
“We’ll just use Reciprocal Rank Fusion” I’m sure I’ve said from time to time.
As if RRF is kind of “a miracle occurs”. You get the best of both worlds, and suddenly your search looks incredible.
Take the query hello to the planet
. Let’s say we start with reasonable results from a vector search system (follow along in this notebook)
vector_sim | texts | vector_rank |
---|---|---|
0.19054140351577573 | greetings to the people of the earth | 1 |
0.18714326530195094 | hello to the planets in my empire | 2 |
0.18575998354351458 | hello world | 3 |
0.176119269155595 | hello my world | 4 |
0.17393706389759572 | hi Planet Earth | 5 |
0.16546218247899153 | hello to my planet, where I lost my keys. | 6 |
0.16345108862553018 | hello to the planet where I keep my stuff, a beautiful place with trees. | 7 |
0.16196139040721674 | the planet says hello to bees | 8 |
0.16190546834847355 | hello mars! | 9 |
0.1486092230250017 | Hello to Terra! | 10 |
0.09424116471867336 | tomorrow is the first day of the rest of your life | 11 |
0.05778716802233709 | belching is a bad habit | 12 |
These are not bad. All the top results have to do with greeting ‘the planet’. Primarilly the planet earth.
We might notice a minor improvement we could make.
What if the user actually remembers text beginning with the phrase “hello to the planet”… Specifically they want the document beginning with “hello to the planet where I keep my stuff…”. If we added some lexical search, we might promote this to the top.
To perform RRF, we just also need the rank of the BM25 scores, and then we can merge the ranking with
RRF_score = 1/vector_rank + 1/bm25_rank
Easy enough.
We do this, and run our A/B test, only to see… 📉💥😢 Not Stonks.
What happened? We look under the hood at this specific query again, and…
index | vector_sim | texts | vector_rank | bm25_sim | bm25_rank | rrf_score |
---|---|---|---|---|---|---|
9 | 0.19054140351577573 | greetings to the people of the earth | 1 | 0.8085092902183533 | 4 | 1.25 |
1 | 0.16196139040721674 | the planet says hello to bees | 8 | 1.2901966571807861 | 1 | 1.125 |
5 | 0.18714326530195094 | hello to the planets in my empire | 2 | 1.2078437805175781 | 2 | 1.0 |
6 | 0.16345108862553018 | hello to the planet where I keep my stuff, a beautiful place with trees. | 7 | 0.8348331451416016 | 3 | 0.47619047619047616 |
0 | 0.18575998354351458 | hello world | 3 | 0.26555198431015015 | 9 | 0.4444444444444444 |
2 | 0.16546218247899153 | hello to my planet, where I lost my keys. | 6 | 0.7465024590492249 | 5 | 0.3666666666666667 |
7 | 0.17393706389759572 | hi Planet Earth | 5 | 0.49154359102249146 | 7 | 0.34285714285714286 |
4 | 0.176119269155595 | hello my world | 4 | 0.24279040098190308 | 11 | 0.34090909090909094 |
8 | 0.1486092230250017 | Hello to Terra! | 10 | 0.6388745307922363 | 6 | 0.26666666666666666 |
11 | 0.09424116471867336 | tomorrow is the first day of the rest of your life | 11 | 0.4355449378490448 | 8 | 0.2159090909090909 |
3 | 0.16190546834847355 | hello mars! | 9 | 0.26555198431015015 | 10 | 0.2111111111111111 |
10 | 0.05778716802233709 | belching is a bad habit | 12 | 0.0 | 12 | 0.16666666666666666 |
Huh the results got WORSE!!
What happened!?
Well the BM25 results kind of suck for this query. Actually contradicting the already really good vector search results:
index | texts | bm25_sim | bm25_rank |
---|---|---|---|
1 | the planet says hello to bees | 1.2901966571807861 | 1 |
5 | hello to the planets in my empire | 1.2078437805175781 | 2 |
6 | hello to the planet where I keep my stuff, a beautiful place with trees. | 0.8348331451416016 | 3 |
9 | greetings to the people of the earth | 0.8085092902183533 | 4 |
2 | hello to my planet, where I lost my keys. | 0.7465024590492249 | 5 |
8 | Hello to Terra! | 0.6388745307922363 | 6 |
7 | hi Planet Earth | 0.49154359102249146 | 7 |
11 | tomorrow is the first day of the rest of your life | 0.4355449378490448 | 8 |
0 | hello world | 0.26555198431015015 | 9 |
3 | hello mars! | 0.26555198431015015 | 10 |
4 | hello my world | 0.24279040098190308 | 11 |
10 | belching is a bad habit | 0.0 | 12 |
We’re getting the worst-case scenarios for bag of words results. The first result literally has nothing to do with.
RRF’ing bad search into good search will just drag down the good search. You actually have to give care that both sets of results deliver relevant search results to improve search.
How to use RRF
Use RRF, however, when you actually have distinct, disjoint sources of relevant search results. Each tuned to high precision.
If we change our BM25 solution to do phrase search instead of a bag of words query, we improve the precision of those results, and improve the overall experience.
index | vector_sim | texts | vector_rank | bm25_sim | bm25_rank | rrf_score |
---|---|---|---|---|---|---|
5 | 0.18714326530195094 | hello to the planets in my empire | 2 | 1.2078437805175781 | 1 | 1.5 |
9 | 0.19054140351577573 | greetings to the people of the earth | 1 | 0.0 | 3 | 1.3333333333333333 |
6 | 0.16345108862553018 | hello to the planet where I keep my stuff, a beautiful place with trees. | 7 | 0.8348331451416016 | 2 | 0.6428571428571428 |
0 | 0.18575998354351458 | hello world | 3 | 0.0 | 4 | 0.5833333333333333 |
4 | 0.176119269155595 | hello my world | 4 | 0.0 | 5 | 0.45 |
7 | 0.17393706389759572 | hi Planet Earth | 5 | 0.0 | 6 | 0.3666666666666667 |
2 | 0.16546218247899153 | hello to my planet, where I lost my keys. | 6 | 0.0 | 7 | 0.30952380952380953 |
1 | 0.16196139040721674 | the planet says hello to bees | 8 | 0.0 | 8 | 0.25 |
3 | 0.16190546834847355 | hello mars! | 9 | 0.0 | 9 | 0.2222222222222222 |
8 | 0.1486092230250017 | Hello to Terra! | 10 | 0.0 | 10 | 0.2 |
11 | 0.09394807204762956 | tomorrow is the first day of the rest of your life | 11 | 0.0 | 11 | 0.18181818181818182 |
10 | 0.05778716802233709 | belching is a bad habit | 12 | 0.0 | 12 | 0.16666666666666666 |
When we have different retrieval sources, of very different technologies, we increase the likelihood of disjoint results. Now if we bias BOTH to give their highest degree of precision, and intentionally remove weird results, and let each focus in on a different, plausible use-case, we improve recall AND can trust the RRF score to reflect true definitions of overall relevance.
This is a bit counter to the conventional wisdom when combining retrieval sources. We usually say we want to cast a wide net at these early retrieval layers. But maybe, in the end, RRF is a great way to combine two precise retrieval sources into two precise result sets with a bit higher recall?
In this way, RRF improves recall and not precision?
Instead of RRF, first understand intent, then choose the best solution
In my opinion, a better path is to redefine the problem.
What’s the users intent with this query? Do they
(a) Want text similar to the “hello world” text? (b) Lookup a piece of text that uses this phrase?
Based on historical data, it’d be better to probabilistically decide which intent is more likely, then route the query accordingly to the best system to handle that query.
Perhaps we decide it’s 80% (a) vs 20% (b). Then we dedicate roughly 80% of our screen space to (a) and 20% to the other. We can now weight RRF accordingly
RRF_score = (80/vector_rank) + (20/bm25_rank)
We can keep going, why should we think of “vector search” and “bm25 search”? We ought to think in terms of intent:
RRF_score = (80/user_wants_semantically_similar_text) + (20/user_wants_to_closely_match_the_words)
That, we might think to generalize AWAY from thinking in terms of vector search and lexical search to systems solving the user’s specific problems. Towards query understanding. And we’ve always done this in search. Perhaps hybrid search simply means ‘choosing the right ranking solution for the job’.
Perhaps the REAL hybrid search has been inside of us all along ❤️