SearchArray adds boring, old BM25 full text search to the PyData stack.
A full-text search index comes with one underaprreciated problem: phrase search. In this article, I want to recap a roaringish phrase search algorithm, encoding positions in a roaring-like numpy array for fast bit intersections.
But before such magic, I’m ashamed to say, SearchArray started with very primitive approaches. SearchArray took almost 5 minutes(!!) for a phrase search to complete on 100k MSMarco docs. Luckily now I’ve gotten those numbers down quite a bit:
SearchArray refresher
Let’s remember how SearchArray works. It’s built as a Pandas extension array that represents an inverted index, demoed below.
In [1]: import pandas as pd
from searcharray import SearchArray
In [2]: df = pd.DataFrame({"text": ["mary had a little lamb the lamb ate mary",
...: "uhoh little mary dont eat the lamb it will get revenge",
...: "the cute little lamb ran past the little lazy sheep",
...: "little mary ate mutton then ran to the barn yard"]})
In [3]: df['text_indexed'] = SearchArray.index(df['text'])
In [4]: df
Out[4]:
text text_indexed
0 mary had a little lamb the lamb ate mary Terms({'little', 'mary', 'a', 'lamb', 'had', '...
1 uhoh little mary dont eat the lamb it will get... Terms({'eat', 'it', 'will', 'little', 'revenge...
2 the cute little lamb ran past the little lazy ... Terms({'little', 'sheep', 'lazy', 'cute', 'lam...
3 little mary ate mutton then ran to the barn yard Terms({'barn', 'little', 'mary', 'mutton', 'ya...
Now we can get a BM25 score for each row of text_indexed
, looking for the phrase “little lamb”:
In [7]: df['text_indexed'].array.score(["little", "lamb"])
Out[7]: array([0.19216356, -0. , 0.18430232, -0. ])
Under the hood of text_indexed
, we’ve got an inverted index that looks like:
mary:
docs:
- 0:
posns: [0, 8]
- 1:
posns: [2]
little:
docs:
- 0:
posns: [3]
- 1:
posns: [1]
- 2:
posns: [2, 7]
- 3:
posns: [0]
lamb:
docs:
- 0:
posns: [4, 6]
- 1:
posns: [6]
- 2:
posns: [3]
If a user searches text_indexed
for like lamb
, we can recover its term frequencies (doc 0 has 2 occurences, doc 1, 2, etc), document frequency (lamb
occurs in 3 docs), everything we need for a BM25 score.
But what about phrase search? How do we compute not just term frequencies, but phrase frequencies?
How would we count all the places where “little” comes one position before “lamb”?Looking at doc 2 above - if little
occurs at positions [2, 7]
(2nd and 7th location) and lamb
at position [3]
, how do we detect that lamb
’s has one case of occuring exactly one location after little
?
Bigram by bigram search
To implement phrase search, we connect the dots, one bigram at a time.
If a user searches for “mary had a little lamb” we first connect mary
-> had
. We then then repeat the process, except now we find where mary+had
bigram positions occur immediately before a
’s. If mary+had
bigram ends at positions [1, 6]
in a document, and that document has an a
at position [2]
, then now mary+had+a
trigram occurs ending at position [2]
.
Now we continue to finding the quadram mary+had+a+little
. Maybe little
occurs at position [3]
in this doc. Now we have narrowed to a quadgram at mary+had+a+little
ending at [3]
.
Now we connect mary+had+a+little
positions to lamb
, getting to the end of our phrase.
The number of times mary+had+a+little
occurs before lamb
is our phrase frequency.
It’s this connecting detail that concerns us. How do we find where one set of position occurs immediately before another set of position (ie mary
before had
, etc)? How do we do this fast enough to make SearchArray usable?
From Boring to Roaring
Let’s dig into how to intersect two sets of positions.
Early, naive versions of SearchArray, stored each doc/terms position integers directly as raw integer arrays. I had many different ways of finding bigram adjacencies - from subtracting one terms positions from another to intersecting the positions.
Storing the integers directly takes up way too much memory. For a meaty search dataset, storing giant positional arrays, per term, is simply a non starter to do meaningful work.
Time to rethink things.
I knew a bit about roaring bitmaps. Roaring bitmaps store sets of integers as bitmaps. I won’t take you through a tutorial of roaring bitmaps themselves. What I did was roaring-adjacent, or roaringish. Let’s walk through that.
The main intuition I took from roaring bitmaps was the following: follow along in this notebook
Let’s say we want to pack positions into a 32 bit array. Looking just at one term, we have
import numpy as np
posns = np.array([1, 5, 20, 21, 31, 100, 340], dtype=np.uint32)
groups = posns // 16
values = (posns % 16)
groups, values
Output:
(array([ 0, 0, 1, 1, 1, 6, 21], dtype=uint32),
array([ 1, 5, 4, 5, 15, 4, 4], dtype=uint32))
By dividing by 16 (groups
), and modulo by 16 (values
), we’ve segmented up the positions into groups. Let’s zoom in group-by-group:
Positions grouped by posn // 16:
-------------------------
Group 0: | Group 1: | Group 6: | Group 21:
values: 1, 5 | 4, 5, 15 | 4 | 4
We can bitwise OR each groups values together into 16 bit values per group:
Group 0 Group 1 ... Group 21
0000000000100010 1000000000110000 ... 0000000000010000
(bits 1 and 5 set) (bits 4, 5, 15 set) (bit 4 set)
Moving now to a 32 bit value, we can store the group number in the upper 16 bits:
Group 0 | Group 1 ... | Group 21
0000000000000000 0000000000100010 | 0000000000000001 1000000000110000 ... | 0000000000010101 0000000000010000
(group 0) (bits 1 and 5 set) | (group 1) (bits 4, 5, 15 set) ... | (group 21) (bit 4 set)
We can do this in numpy, pretty easily, continuing the code from above:
# Find the boundaries between MSB groups
change_indices = np.nonzero(np.diff(groups))
change_indices = np.insert(change_indices, 0, 0)
# Set 'values' bit
values = (1 << values)
# Pack into 7 32 bit integers
packed = (groups << 16) | values
# Combine into groups by oring at specific index boundaries
packed = np.bitwise_or.reduceat(packed, change_indices)
The key here is doing a bitwise_or
, but over a specific set of indices change_indices
. This packs only where the MSBs differ, at their boundary.
And now… 7 original positions are now packed into 4 32 bit values. Shrinkage!
Finding bigrams through bitwise intersection
Now, given two sets of term posns, we might find their bigram frequency by simple bit arithmetic.
Given two sets of positions for little
and lamb
- Intersect the two term’s MSBs (ie the 16 MSBs)
- Shift the left term’s lower 16 LSBs (the posns) to the right by 1
- Bitwise AND the LSB’s
- Count the number of set bits in the 16 LSB == phrase match!
Walking through an example:
Say we have term “little”, with positions encoded
Group 0 ... | Group 21
0000000000000000 0000000000100010 ... | 0001000000100101 0000000000010100
And given “lamb”
Group 1 ... | Group 21
0000000000000000 0000000000100010 ... | 0000000000010101 0000000000101000
We note they share a “group 21” set of posns, corresponding to posns >= 336 but < 352 where adjacent positions might be found:
Group 21, little
0001000000100101 0000000000010100
Group 21, lamb
0000000000010101 0000000000101000
We now zero in on just the LSB, to see if any little
s occur before lamb
s in group 21 (posns 336-352). We do that simply with bit twiddling:
(0000000000010100 << 1) & (0000000000101000) = 0000000000001000
(shift 1 to see if adjacent) = number of bits == num terms adjacent
Or one phrase match (one bit set).
All in numpy Python we can just do this:
# Find where common MSBs occur between terms
_, lamb_indices, little_indices = np.intersect1d(lamb >> 16, little >> 16, return_indices=True)
# Just the lower 16 where they intersect
lamb_intersect_lsb = (lamb[lamb_indices] & 0x0000FFFF)
little_intersect_lsb = (little[little_indices] & 0x0000FFFF)
# Find direct phrase match
(little_intersect_lsb << 1) & lamb_intersect_lsb
Counting the bits of each value in the output gives us the phrase frequency.
Wait there’s more! Also packing the doc id
Here’s where roaringish gets really fast.
So far we’ve looked at phrase search one document at a time. What if we could phrase match the entire corpus for two terms in a handful of numpy operations?
Indeed, we can do just that by just also packing the doc id… and just concatting every docs packed posns together.
# Using 64 bits, put doc id in upper 32
doc1_encoded = doc_1_id << 32 | packed_doc_1
doc2_encoded = doc_2_id << 32 | packed_doc_2
# For every doc with 'lamb' store all positions
term_posns["lamb"] = np.concatenate([doc1_encoded, doc2_encoded])
Now the 48 MSB lets us intersect docs + bit group MSBs. At search time we do the exact same as before:
lamb = term_posns["lamb"] # Every doc's posns for "lamb"
little = term_posns["little"] # Every doc's posns for "little"
# Find where common MSBs _and documents_ occur between terms (doc ID AND bit groups)
_, lamb_indices, little_indices = np.intersect1d(lamb >> 16, little >> 16, return_indices=True)
# Get every doc's phrase match fol 'little lamb'
matches = (little_intersect_lsb << 1) & lamb_intersect_lsb
This now finds phrase matches over all documents for the two terms.
To count term matches by doc id, we just add a little step to count bits, then sum the phrase freq at the doc id boundary.
# Grab the intersected values so we can get doc ids (of either, they're identical)
intersections = lamb[lamb_indices]
# Demarcate the indices where doc ids differ
doc_id_change = np.argwhere(np.diff(intersections >> 32) != 0).flatten() + 1
doc_id_change = np.insert(doc_id_change, 0, 0)
# Sum set bits per doc id
counted_bits = bit_count64( (little_intersect_lsb << 1) & lamb_intersect_lsb )
phrase_freqs = np.add.reduceat(counted_bits, doc_id_change)
We’ve now got bigram phrase frequencies for “little->lamb”. If we had a trigram “little->lamb->tacos”, we’d now repeat the same algorithm, just with little+lamb
posns -> taco
.
Optimizations and annoying little issues
Our data, with doc ids in the MSB, can be stored sorted. Making intersections faster.
Numpy’s own intersect1d
doesn’t particularly know about sorted data. Our data can be stored sorted (by doc id, then MSBs), theoretically making the intersection much faster. Luckily Frank Sauerburger has us covered with a handy sortednp library for intersections, merging, and other operations on sorted numpy arrays. Allowing us to find intersections with exponential search.
Another issue, of course: the algorithm I describe will miss out on phrase matches where the phrase matches straddle two MSB groups. IE:
“little” before “lamb” but across two MSB groups:
"little" group 21
0000000000010101 1000000000000000
"lamb" group 22
0000000000010110 000000000000001
This isn’t that hard to handle, we just need to create a second bit twiddling pass. Intersecting the first term with the second term’s group + 1, and looking for this exact situation.
Additionally, my bigram intersection above walked bigram by bigram mary
-> mary+had
etc. We might be smart and actually intersect the two rarest terms first. This dramatically limits the search space for the other bigrams. Maybe, for example, we note lamb
and little
are rarest. So we can intersect first those two terms, and limit the other intersections to a reasonable search space.
Finally this algorithm focuses on direct phrase matching. What do we do when a user asks for slop? As in, the user indicates that set of terms can be within N positions of a direct phrase match? “little bobby lamb” still matches “little” “lamb” with a slop of 1?. This algorithm doesn’t (yet) handle this case (and all the little edge cases :) ). That’s very much WIP.
Results
The speedup results from this were impressive. On 100K very text heavy MSMarco docs, I took searches on my local machine from some several seconds on a stopwordish search (what is the
) to 10s of ms. The primary bottleneck now being the intersection (snp.intersect
).
I can see now, the next step would be to optimize this layer. Sorted intersections on id sets can also be done with skiplists, which I believe is how most search engines implement this. But for now, this performance is good enough for my needs!
Additionally memory usage of 100K MSMarco docs shrunk by an order of magnitude. Making it closer to reasonable to do meaty, medium-data, search relevance experimentation with SearchArray
Roaringish taking over search array
A nice side effect of all this, is in addition to optimizing positions, I can use the packed positions to implement the inverted index, as follows:
- Term freq - count all of a term’s the set positions for a document. Get
term_posns["lamb"]
, get just the doc I want (via intersecting), and count set bits. - Doc freq - how many unique doc ids exist in a roaringish data structure for a term? Get
term_posns["lamb"]
, get the unique doc ids (32 MSB).
This shrinks memory usage even further, and still maintains the needed speed.
That’s it for now and phrase search, but if you’re curious to dig deeper into roaringish, and see all the itty-bitty optimizations, its been pulled out into a utility within SearchArray. Check it out and give me some feedback.