Elasticsearch: Handling Multi-Word Phrase and Synonyms

3 min readMay 17, 2021

What is auto phrasing

Auto phrases refer to sequences of tokens that are meant to describe a single thing and should be searched for as such. For example: “lung cancer” represents a single search entity and not “lung”, “cancer” as 2 distinct entities.

Elasticsearch behaviour towards the multi-word phrase

By default, elasticsearch word tokenizes the document and store’s individual words in the inverted index. i.e If your document contains “lung cancer”, it is represented as following in the inverted index

when you search for the phrase “lung cancer”, elasticsearch does the following

filter all documents containing the word “lung”
filter all documents containing the word “cancer”
perform intersection to filter documents “lung” & “cancer” (with position intact)

Notice that elasticsearch performs 3 distinct operations and thus performance is not optimal.

Auto phrasing to rescue

This is where auto phasing jumps in. We could simply tell elasticsearch to interpret and store “lung cancer” as a single term in the inverted index.

Now, when you search for “lung cancer” all elasticsearch has to do is a single lookup in the inverted index.

Also, you might be aware that properly handling multi-word synonyms is hard because Lucene’s general strategy is to break text into single tokens. Auto phrasing tricks elasticsearch to interpret multi-world token as a single token and thus provides the ability to use multi-word synonyms as a part of index time synonyms.

How to achieve auto phrasing

Option 1: Reducing multiple words to canonical form.

You can leverage elasticsearch’s synonyms.txt to achieve this by

Replacing multi-token terms with single token ID’s

2. Replace whitespace with an underscore so that a multi-token is interpreted as a single token.

This is my personal favourite and I myself use this as I find it more intuitive and makes my debugging life easier.

Option 2: Using a tokenizer based on your custom vocabulary

Use your vocabulary to creating an artificial boundary for your document
using an elasticsearch analyzer to tokenize on the boundary

Conclusion

To deal with multi-term synonyms, auto phrasing is an important tool. It helps in solving one of the more difficult problems with Lucene

By shifting the focus more towards phrases rather than words, we are able to improve results based on ‘what’ the user is looking for rather than documents containing words that match the query. We are moving from searching with a “bag of words” to searching a “bag of things”.