Elasticsearch has a match_phrase
query which lets you search for a phrase, and match anything that exactly matches that phrase.
However matching a phrase is not the same as matching the full, entire field with your phrase.
For example, if I search for:
{
"query": {
"match_phrase": {
"title": "Steam deck review"
}
}
}
I’ll match BOTH these document, as they both contain the phrase “steam deck reviews”
Title: Steam deck reviews
Title: Steam deck review from a PC gamer
While BM25 scoring will prioritize shorter matches, you sometimes want to ONLY match the first one here.
As I write about in Relevant Search lexical search is all about engineering signals. Or, in a pure machine learning context, its about feature engineering. Knowing confidently that a field fully matches the query or not is a useful ranking signal, useful feature for a reranker/learning to rank, or just something to manually use to prioritize the match.
Why not a keyword field?
Elasticsearch lets you just disable tokenization alltogether when indexing and searching documents. So we can take the full string (with a bit of lowercasing) for matching, like:
{
"analyzer": {
"as_keyword": {
"tokenizer": "keyword",
"filter": ["lowercase"]
}
}
}
Title: Steam deck reviews
(token: `steam deck reviews`)
Title: Steam deck review from a PC gamer
(token: `steam deck review from a pc gamer`)
- GOOD: ✅ If a query comes in for
steam deck reviews
, we will match exactly on the indexedsteam deck reviews
token and NOT the longersteam deck review from a pc gamer
token). - BAD: ❌ If a query comes in for slightly different
steam deck review
, we WILL NOT MATCH the indexedsteam deck reviews
.
So when I say full match its NEITHER ‘exact’ NOR is it ‘phrase’. What I want is:
- Tokenize the query and document into individual tokens (allowing for simple variations in word forms, stemming, etc)
- Match the query as a phrase
- AND… match when the first term of my query phrase begins the indexed text, and the final token matches the last token
Moreover, as a bonus, such a system might allow for us to also match on the -edit distance-. Score the document potentially on how far the field’s phrase is from the query phrase. Phrase search has this with the concept of “phrase slop”.
Sentinel tokens before / after the text
The astute Relevant Search fanboy will know that in the book, I mention one technique where we inject some kind of artificial token at the beginning or end of each token stream. Similar to the CLS and SEP tokens you see in BERT.
We add a regex char_filter to insert these special tokens:
"char_filter": {
"sentinel_tokens": {
"type": "pattern_replace",
"pattern": "^(.*)$",
"replacement": "__SENTINEL_BEGIN__ $1 __SENTINEL_END__"
}
},
Then we recreate the Elasticsearch English analyzer, adding in our char_filter
"filter": {
"english_stop": {
"type": "stop",
"stopwords": "_english_"
},
"english_keywords": {
"type": "keyword_marker",
"keywords": []
},
"english_possessive_stemmer": {
"type": "stemmer",
"language": "possessive_english"
},
"english_stemmer": {
"type": "stemmer",
"name": "english"
}
},
"analyzer": {
configured_analyzer: {
"char_filter": ["sentinel_tokens"],
"tokenizer": "standard",
"filter": [
"english_possessive_stemmer",
"lowercase",
"english_stop",
"english_keywords",
"english_stemmer"
]
}
}
Now we should have:
Title: Steam deck reviews
(tokens: __SENTINEL_BEGIN___ steam deck review __SENTINEL_END__)
Title: Steam deck review from a PC gamer
(tokens: __SENTINEL_BEGIN___ steam deck review from a PC gamer __SENTINEL_END__)
We issue a match_phrase
query against this field, and it works!
- GOOD: ✅ We match, including token variants due to stemming etc when our query is
__SENTINEL_BEGIN__ steam deck review __SENTINEL_END__
- BAD: ❌ We have added a token with very high cardinality in our index, possibly creating performance problems
Functionally, this works! And meets all our requiremments. Further, it pretty much leaves the token stream alone and lets us do matches within the token stream, edit distance (phrase slop) matches, etc.
Unfortunately, search engines tend to perform poorly when we ask them to match on high cardinality terms. Our sentinel tokens will have a cardinality == every doc in the index. I’ve heard from some readers this approach causing performance reasons.
Concat all the tokens together after analysis
Usually we just care about doing this on short fields. One thing we could do is try to tokenize, stem, etc, and finally concat the tokens back into one uber token. Like the keyword approach, but allowing for some inner-token stemming, stopwording, whatever else we want to do.
At the end of our token stream we want to generate a giant token with all the terms concat’d. Essentially “untokenize” the text
There is no “concat” token filter in Elasticsearch. However, we can use shingles – term n-grams – to achieve a similar effect. Basically for a reasonably short field like a title (where this usecase matters most) generate all term n-grams from 2..10 or so (10 being the max query length we expect to match).
So we add this token filter:
"filter":{
...
"big_shingles": {
"type": "shingle",
"min_shingle_size": 2,
"max_shingle_size": 10
},
}
For Steam deck review from a PC gamer
this would generate all kinds of shingles like [steam deck]
, [steam deck review]
, [deck review]
…
We really want the shingle that corresponds to ALL the terms concatted together. We can achieve this by first getting all the shingles at position 0 with a predicate filter using position:
"filter":{
...
"only_position_zero": {
"type": "predicate_token_filter",
"script": {
"source": "token.position == 0"
},
},
Now we just have the tokens that started at position 0 [steam deck], [steam deck review]...
and not mid-field shingles like [deck review]
.
Ideally we could now pluck out only the longest token at position 0, but there’s no easy way to do this. So instead we can just fingerprint (take all unique tokens) .
"filter":{
...
"shingle_fingerprint": {
"type": "fingerprint",
"max_output_size": 1000,
"separator": "+"
}
Now we have one big token that looks like
steam deck+steam deck review+ ... +steam deck review from a PC gamer```
Taken together:
"analyzer": {
configured_analyzer: {
"tokenizer": "standard",
"filter": [
"<english steps, see previous>",
# **********
# ADDED:
"big_shingles",
"only_position_zero",
"shingle_fingerprint"
]
}
},
Searching with Steam deck reviews
we will generate the single token at query time steam deck+steam deck review
and need to match on this.
- GOOD: ✅ Full field matching without the high cardinality token
- BAD: ❌ Complex shingles may be performance intensive
- BAD: ❌ Long terms may take longer to match
- BAD: ❌ No ability for phrase slop matching
So while this works, it comes with the downsides mentioned above. Indeed we have to set an index setting, max_shingle_diff
to allow for the long, complex shingles.
Open questions: while there is a limit token filter, we cannot choose to keep the longer token. It only arbitrarilly selects the first tokens in the token stream at a position, without a clear preference.
Conditionally ‘marking’ begin and end tokens
One thing we could do that can help us do a ‘full match’ is somehow modify the first (and last?) tokens to look a bit different.
For example, after our analysis is done, we can easily uppercase the first token:
"analyzer": {
"first_upper": {
"tokenizer": "standard",
"filter": [
# Normal English analysis
"lowercase",
"english_stop",
"english_keywords",
"english_stemmer",
# Uppercase the first term
"upper_begin",
]
}
},
"filter": {
...
"upper_begin": {
"type": "condition",
"filter": "uppercase",
"script": {
"source": "token.position == 0"
}
},
}
Our indexed token stream will look like the following for two docs:
STEAM deck review
STEAM deck review from a pc gamer
And when we search for “steam deck review” we have a query tokenized as [STEAM] [deck] [review]
Not quite what we want though, as this will match both documents.
If we could somehow ALSO uppercase the last term and/or just have a sentinel token for this term, we could get closer.
IE can we do this?
STEAM deck REVIEW
STEAM deck review from a pc GAMER
Or?
STEAM deck review SENTINEL_END
STEAM deck review from a pc gamer SENTINEL_END
The latter is easiest to accomplish. There’s not a clean way to detect the last token, unfortunately. We can append some marker text to the end of the string with a char filter:
"char_filter": {
"append_end_of_data": {
"type": "pattern_replace",
"pattern": "(^.*$)",
"replacement": f"$1END_OF_DATA"
}
},
Then perhaps, later, conditionally uppercase this term:
"mark_begin_end": {
"type": "condition",
"filter": "uppercase",
"script": {
"source": f"""
((token.position == 0) ||
(token.term.length() > {len('END_OF_DATA')}
&& token.term.subSequence(token.term.length() - {len('END_OF_DATA')}, token.term.length()).equals('{'END_OF_DATA'.lower()}')))
"""
},
},
However, this comes with other complications. We have to do this step immediately after lowercasing (so the END_OF_DATA doesn’t get stemmed, and to allow the final term to be stemmed, etc). Sadly, many stemmers, like porter, do not work on upper-cased words. Perhaps there is another way to mark this token? Please let me know if you think of one.
- GOOD: ✅ Full field matching without at least one of the high cardinality tokens
- BAD: ❌ Need to munge/mark last token, which is not straightforward
- BAD: ❌ OR… need to add a high cardinality token to the end
Any ideas?
Let me know your thoughts about this problem. Have you tried the sentinel approach and have it work succesfully? Have you found a way to mitigate the other full matching options?
Contact me and let me know.