My Pandas text search extension, SearchArray, encodes term positions as roaring-like sparse bitsets. This allows for phrase search in a memory efficient and fast manner using some basic bittwiddling.

Given these two documents:

Posn. 0    1      2 3      4     5      6      7      8
doc1: Mary had    a little lamb, little lamb,  little lamb.
doc2: Tom  hugged a little lamb  at     the    farm   yesterday.

SearchArray packs these positions into a bitset, marking position 1 where the term occurs, and 0 where it does not.

IE, assuming sane tokenization, given lamb and little, for our 9 docs, we map the terms positions to specific locations in the bits:

Posn.   0    1      2 3      4     5      6      7      8
doc1:   Mary had    a little lamb, little lamb,  little lamb.
little  0    0      0 1      0     1      0      1      0    
lamb:   0    0      0 0      1     0      1      0      1
doc2:   Tom  hugged a little lamb  at     the    farm   yesterday.
lamb:   0    0      0 0      1     0      0      0      0
little: 0    0      0 1      0     0      0      0      0   

Or just the bitset, looking at doc1:

doc1
little: 000101010
  lamb: 000010101    

Now counting phrase matchs for “little lamb” is as simple as taking lamb’s bitset, left shifting it by 1, and AND’ing lamb with little. Finally we can count the number of bits (aka CPU instruction popcount)… then naturally pass this onto BM25 for scoring.

doc1
little     : 000101010
  lamb << 1: 000101010  <- lamb left shifted 1 bit
             ---------
    ANDed  : 000101010  <- (lamb << 1) & little
   popcount: 3          <- this is the phrase frequency -> for BM25 scoring

As these are sparse bitmasks, we can us a roaring-bitmap like algorithm to encode them. As this is an inverted index, given a set of terms, we want to look across all of the documents with our terms to collect the entire index’s phrase freqs. But that’s a topic for my previous article on the roaringish algorithm.

Getting sloppy

SearchArray has not had functional phrase slop. That is allowing edit distance between the terms. The normal phrase “little lamb” matching “little lamb” directly has slop=0. It’s just a boring phrase. But “lamb little” would be an edit distance of 1 (slop=1). A term in between - “little ___ lamb” - would also be slop 1.

To make your head hurt, think of searching for the phrase “what is the purpose of life” but with slop 5. There are so many combinations, edge cases, and considerations.

First of all, if you look at this document below, you’ll see there are two candidate phrase matches of slop=5.

What do you think? What is the purpose of life? What! It's 42?
  1. Candidate one: the direct “what is the purpose of life”, slop = 0
  2. Candidate two: the sloppier “what __ __ __ ___ is the purpose of life”, slop = 4.
  3. Candidate three: the sloppier “is the purpose of life? What!”

You can see the term is from the document could participate in multiple variations!

We call these term groupings spans as in they span multiple terms and might participate in a position-aware match.

Starting small, building up - min spans

We might assume, reasonably, for most use cases, we want to first count the lower slop matches, and then ensure that those terms no longer can participate in higher slop matches.

What if our document instead was the following, with a slighly lower slop match embedded:

doc1:    What do you think? What the? Is the purpose of life 42? What!
   what: 1    0  0   0      1    0    0  0   0       0  0    0   1
     is: 0    0  0   0      0    0    1  0   0       0  0    0   0
    the: 0    0  0   0      0    1    0  1   0       0  0    0   0
purpose: 0    0  0   0      0    0    0  0   1       0  0    0   0 
     of: 0    0  0   0      0    0    0  0   0       1  0    0   0 
   life: 0    0  0   0      0    0    0  0   0       0  1    0   0 

One guess might be we’d want to find spans by collecting the smallest possible regions where every term’s bit is set - we’ll call this a min span.

For example the region of 11 bits underlined below is NOT a min span, there are spans within it.

   doc1: What do you think? What the? Is the purpose of life 42? What!
   what: 1    0  0   0      1    0    0  0   0       0  0    0   1
     is: 0    0  0   0      0    0    1  0   0       0  0    0   0
    the: 0    0  0   0      0    1    0  1   0       0  0    0   0
purpose: 0    0  0   0      0    0    0  0   1       0  0    0   0 
     of: 0    0  0   0      0    0    0  0   0       1  0    0   0 
   life: 0    0  0   0      0    0    0  0   0       0  1    0   0 
         ------------------------------------------------
                     (11 bits)
                  --not a min span--

We know this is not a min span, because we can find a smaller span within it:

   doc1: What do you think? What the? Is the purpose of life 42? What!
   what: 1    0  0   0      1    0    0  0   0       0  0    0   1
     is: 0    0  0   0      0    0    1  0   0       0  0    0   0
    the: 0    0  0   0      0    1    0  1   0       0  0    0   0
purpose: 0    0  0   0      0    0    0  0   1       0  0    0   0 
     of: 0    0  0   0      0    0    0  0   0       1  0    0   0 
   life: 0    0  0   0      0    0    0  0   0       0  1    0   0 
                            -----------------------------
                                   ( 7 bits)
                               --actually a min span--

Also the following 8 bits are a min-span, we could not shrink it further and have at least one set bit/position for each term.

   doc1: What do you think? What the? Is the purpose of life 42? What!
   what: 1    0  0   0      1    0    0  0   0       0  0    0   1
     is: 0    0  0   0      0    0    1  0   0       0  0    0   0
    the: 0    0  0   0      0    1    0  1   0       0  0    0   0
purpose: 0    0  0   0      0    0    0  0   1       0  0    0   0 
     of: 0    0  0   0      0    0    0  0   0       1  0    0   0 
   life: 0    0  0   0      0    0    0  0   0       0  1    0   0 
                                 ---------------------------------
                                         (8 bits)

We can’t know yet which minspan is best. It’s possible the tiny span of 7 positions span has every term out of order, while the larger 8 position span has every term in order with two unrelated terms in-between.

Searching for a phrase of length 6 (what is the purpose of life), for a min-span of size 7:

The best case is that slop = 1. All terms are in order, one term is in between. The worst case is (I believe) slop len - 1 + slep len - 2 + ... slop = 6 + 5 + 4 + 3 + 2 + 1 = 21. Every term is in reverse order and one term is in between.

In general, when we find these minimal spans, the range of possible slops ranges from span_length - phrase_len to span_length - 1.

So if we score the 3 candidate spans we’ve found so far:

Span len Min slop Max slop Min Span?
11 bits 5 55 No
8 bits 2 28 Yes
7 bits 1 21 Yes

Since our search is slop = 5, we need to compute slops for the latter two min spans before deciding which is best.

It turns out actual slop of the 7 bit span is the best one (slop=1). If we eliminate these positions now, the other span no longer exists, and we move on with a match for this document.

   doc1: What do you think? What the? Is the purpose of life 42? What!
   what: 1    0  0   0      0    0    0  0   0       0  0    0   1
     is: 0    0  0   0      0    0    0  0   0       0  0    0   0
    the: 0    0  0   0      0    1    0  0   0       0  0    0   0
purpose: 0    0  0   0      0    0    0  0   0       0  0    0   0 
     of: 0    0  0   0      0    0    0  0   0       0  0    0   0 
   life: 0    0  0   0      0    0    0  0   0       0  0    0   0 
                            -----------------------------
                            ( span collected - bits 0d)
                                 ---------------------------------
                                         ( no longer a span)

Then we repeat with the next min span until we’ve exhausted the min-spans in this document.

To sum up, a rough sketch of the algorithm, given phrase of N terms, and a slop of s

  1. Find all candidate min-spans from length N to N + s
  2. Compute slops of each min span, if slop < s then:
  3. Collect the smallest slops as a match
  4. Clear bits of those spans
  5. Remove any min spans that are no longer spans (or if bits have been removed)
  6. GOTO 1.

Each of these, in the context of a large sparse bitset, have complications and challenges to do in a fast way.

  1. How do we find minspans efficiently?
  2. How do we score a minspans slop, accounting for both order and terms being placed in between?
  3. Does deleting bits in the collect span create any new (possibly larger) min spans that should now be candidates for collection?
  4. How do we do all this, over an entire corpus, in a small set of numpy bit twiddling operations?
  5. How do we find minspans have terms overlapping positions (ie synonyms) or are identical (ie “to be or not to be”)?

I’ll have to leave those for a future post. My hacking+headaches on this with SearchArray are certain to have just begun 🫣.


Doug Turnbull

More from Doug
Twitter | LinkedIn | Mastodon
Doug's articles at OpenSource Connections | Shopify Eng Blog