Elasticsearch: Handling Multi-Word Phrase and Synonyms
What is auto phrasing
Auto phrases refer to sequences of tokens that are meant to describe a single thing and should be searched for as such. For example: “lung cancer” represents a single search entity and not “lung”, “cancer” as 2 distinct entities.
Elasticsearch behaviour towards the multi-word phrase
By default, elasticsearch word tokenizes the document and store’s individual words in the inverted index. i.e If your document contains “lung cancer”, it is represented as following in the inverted index
when you search for the phrase “lung cancer”, elasticsearch does the following
- filter all documents containing the word “lung”
- filter all documents containing the word “cancer”
- perform intersection to filter documents “lung” & “cancer” (with position intact)
Notice that elasticsearch performs 3 distinct operations and thus performance is not optimal.
Auto phrasing to rescue
This is where auto phasing jumps in. We could simply tell elasticsearch to interpret and store “lung cancer” as a single term in the inverted index.
Now, when you search for “lung cancer” all elasticsearch has to do is a single lookup in the inverted index.
Also, you might be aware that properly handling multi-word synonyms is hard because Lucene’s general strategy is to break text into single tokens. Auto phrasing tricks elasticsearch to interpret multi-world token as a single token and thus provides the ability to use multi-word synonyms as a part of index time synonyms.
How to achieve auto phrasing
Option 1: Reducing multiple words to canonical form.
You can leverage elasticsearch’s synonyms.txt to achieve this by
- Replacing multi-token terms with single token ID’s
2. Replace whitespace with an underscore so that a multi-token is interpreted as a single token.
This is my personal favourite and I myself use this as I find it more intuitive and makes my debugging life easier.
Option 2: Using a tokenizer based on your custom vocabulary
Use your vocabulary to creating an artificial boundary for your document
using an elasticsearch analyzer to tokenize on the boundary
Conclusion
To deal with multi-term synonyms, auto phrasing is an important tool. It helps in solving one of the more difficult problems with Lucene
By shifting the focus more towards phrases rather than words, we are able to improve results based on ‘what’ the user is looking for rather than documents containing words that match the query. We are moving from searching with a “bag of words” to searching a “bag of things”.