Skip to content

Synonyms

Simon edited this page Aug 19, 2016 · 3 revisions

Without mappping and re-indexing

    1. Add in .corc the elasticsearchAdmin property and define the host with a user who can update the settings of the elasticsearch database
  "elasticsearchAdmin": {
    "apiVersion": "2.3",
    "host": ""
  }
    1. run node synonyms/deploy-synonyms.js
    1. Add new synonyms in the file synonyms.json and repeat the step 2.

With mapping (and re-indexing)

How synonyms are defined

This section is based on the elasticsearch synonyms documentation

Synonyms analyzer

Synonyms are defined with analyzers. An analyzer is the process which normalise a string. Analyzer are composed with a tokenizer (create tokens from the string) and filters (transform token). A very basic way to see analyzers is to view them as a composition of the functions split and map:

"Hello There".split(' ').map(token => {
	return token.toLowerCase()
})

Here split is the tokenizer and the function "toLowerCase" is the filter.

So a synonyms analyzer will transform a specific token to the a list of match synonyms.

Define the synonym analyzer

In elasticsearch analyzer are define in the settings:

{
  "settings": {
    "analysis": {
      "filter": {
        "synonym_filter": {
          "type": "synonym",
          "synonyms": [
            "train, locomotive"
          ]
        }
      },
      "analyzer": {
        "synonyms": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "synonym_filter"
          ]
        }
      }
    }
  }
}

Here the filter "synonym_filter" is define with the type "synonym" and with the list of synonyms define in an array. Then the analyzer "synonyms" is using the "lowercase" filter and our "synonym_filter" filter. The filters are applied in order.

To define the analyzer in an index you need:

    1. close the index
    1. update the settings of the index
    1. open the index

The script in synonyms/deploy-synonyms.js is a simple way to update the settings of the synonyms analyzer, so it will execute the step 1, 2 and 3 automatically.

The script is working only at query time so the database doesn't need the be re-indexed. However this first version might struggle to deal with multi words synonyms like "rolling stock". To resolve this problem we might need to understand how to represent synonyms and what is the difference between index-time and query-time.

Three formats to represent synonyms (Expend vs contract)

Simple expansion

"train, locomotive, convoy"

One keyword is match to a list of keywords: If we search for "train", the query will replace "train" by the list "train, locomotive, convoy"

Simple contraction

"train, locomotive, convoy => train"

One keyword is match to only on keyword: If we serach for "locomotive", the query will replace it by "train". So here we reduce a list of keywords to one keyword.

Genre expansion

"cat => "cat,pet". One keyword is match to a more general list of keywords

Index time vs Query time

When a document is added to an index, Elasticsearch will applied any analyzers and add into the inverted index any keywords produced by these analyzers. For example if we have the synonyms analyzer define with "train,locomotive" and if the document added contains "train" elasticsearch will add the keyword "locomotive" and linked it to the new document added. This process is done at index-time

When a query (a search) is executed the search terms will be analyzed and if any synonyms terms are found the query will be expanded to search for all the synonyms in the inverted index. This process is done at query-time

Apply the synonyms analyzer to an index

To apply the synonyms to the documents at index-time we need to tell Elasticsearch on which fields to apply the analyzer. For that we define a mapping which use the analyzer. A "Mapping is the process of defining how a document, and the fields it contains, are stored and indexed". However once we have define how a document id defined Elasticsearch can't change automatically this definition of the documents. The only way to do this is to re-index the database with the new mapping

Solution for multiwords synonyms

To resolve this problem wee need to use the simple contraction format to define the synonyms. We also need to use the synonyms at index-time AND query-time.

"train,locomative => train" In the index train and locomotive will be represented by "train"

A query with "locomotive" or "train" keyword will be transform to a "train" query. This query will then search the index for "train" and then will return any documents matching "train" or "locomotive".

So here we can represent a multiword keyword "rolling stock" to a simplify synonym "train" for example.

Steps to implement the multiwords solution

This solutions needs to be done by an admin as the database need to be re-indexed and the settings of the index need to be updated and modified.

    1. define and update the synonym analyzer in the settings of elasticsearch:
curl -XPUT "host/theindex/_settings" -d
{
  "settings": {
    "analysis": {
      "filter": {
        "synonym_filter": {
          "type": "synonym",
          "synonyms": [
            "train, locomotive"
          ]
        }
      },
      "analyzer": {
        "synonyms": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "synonym_filter"
          ]
        }
      }
    }
  }
}

You will need to repeat this step each time a new synonym is added to the filter

    1. define the mapping where we need to apply the synonyms. We are currently using four fields to search, so we need to define at least the mapping for these fields. To change the mapping of the index you will need to re-index the database with the new mapping. Another way to add the synonyms is to use the multifield feature but it implies to create another field into the database. Re-indexing the database seems to be the clean solution. Elasticsearch guide on how to re-index the database