Skip to content

Latest commit

 

History

History
932 lines (804 loc) · 29.1 KB

multi-match.md

File metadata and controls

932 lines (804 loc) · 29.1 KB
layout title parent grand_parent nav_order
default
Multi-match
Full-text queries
Query DSL
50

Multi-match queries

A multi-match operation functions similarly to the match operation. You can use a multi_match query to search multiple fields.

The ^ "boosts" certain fields. Boosts are multipliers that weigh matches in one field more heavily than matches in other fields. In the following example, a match for "wind" in the title field influences _score four times as much as a match in the plot field:

GET _search
{
  "query": {
    "multi_match": {
      "query": "wind",
      "fields": ["title^4", "plot"]
    }
  }
}

{% include copy-curl.html %}

The result is that films like The Wind Rises and Gone with the Wind are near the top of the search results, and films like Twister, which presumably have "wind" in their plot summaries, are near the bottom.

You can use wildcards in the field name. For example, the following query will search the speaker field and all fields that start with play_, for example, play_name or play_title:

GET _search
{
  "query": {
    "multi_match": {
      "query": "hamlet",
      "fields": ["speaker", "play_*"]
    }
  }
}

{% include copy-curl.html %}

If you don't provide the fields parameter, multi_match query searches the fields specified in the index.query. Default_field setting, which defaults to *. The default behavior is to extract all fields in the mapping that are eligible for term-level queries, filter the metadata fields, and combine all extracted fields to build a query.

The maximum number of clauses in a query is defined in the indices.query.bool.max_clause_count setting, which defaults to 1,024. {: .note}

Multi-match query types

OpenSearch supports the following multi-match query types, which differ in the way the query is executed internally:

  • best_fields (default): Returns documents that match any field. Uses the _score of the best-matching field.
  • most_fields: Returns documents that match any field. Uses a combined score of each matching field.
  • cross_fields: Treats all fields as if they were one field. Processes fields with the same analyzer and matches words in any field.
  • phrase: Runs a match_phrase query on each field. Uses the _score of the best-matching field.
  • phrase_prefix: Runs a match_phrase_prefix query on each field. Uses the _score of the best-matching field.
  • bool_prefix: Runs a match_bool_prefix query on each field. Uses a combined score of each matched field.

Best fields

If you're searching for two words that specify a concept, you want the results where the two words are next to each other to score higher.

For example, consider an index that contains the following scientific articles:

PUT /articles/_doc/1
{
  "title": "Aurora borealis",
  "description": "Northern lights, or aurora borealis, explained"
}

{% include copy-curl.html %}

PUT /articles/_doc/2
{
  "title": "Sun deprivation in the Northern countries",
  "description": "Using fluorescent lights for therapy"
}

{% include copy-curl.html %}

You can search for articles containing northern lights in the title or description:

GET articles/_search
{
  "query": {
    "multi_match" : {
      "query": "northern lights",
      "type": "best_fields",
      "fields": [ "title", "description" ],
      "tie_breaker": 0.3
    }
  }
}

{% include copy-curl.html %}

The preceding query is executed as the following dis_max query with a match query for each field:

GET /articles/_search
{
  "query": {
    "dis_max": {
      "queries": [
        { "match": { "title": "northern lights" }},
        { "match": { "description": "northern lights" }}
      ],
      "tie_breaker": 0.3
    }
  }
}

The results contain both documents, but document 1 is scored higher because both words are in the description field:

{
  "took": 30,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 0.84407747,
    "hits": [
      {
        "_index": "articles",
        "_id": "1",
        "_score": 0.84407747,
        "_source": {
          "title": "Aurora borealis",
          "description": "Northern lights, or aurora borealis, explained"
        }
      },
      {
        "_index": "articles",
        "_id": "2",
        "_score": 0.6322521,
        "_source": {
          "title": "Sun deprivation in the Northern countries",
          "description": "Using fluorescent lights for therapy"
        }
      }
    ]
  }
}

The best_fields query uses the score of the best-matching field. If you specify a tie_breaker, the score is calculated using the following algorithm:

Take the score of the best-matching field and add (tie_breaker * _score) for all other matching fields.

Most fields

Use the most_fields query for multiple fields that contain the same text that is analyzed in different ways. For example, the original field may contain text analyzed with the standard analyzer and another field may contain the same text analyzed with the english analyzer, which performs stemming:

PUT /articles
{
  "mappings": {
    "properties": {
      "title": { 
        "type": "text",
        "fields": {
          "english": { 
            "type": "text",
            "analyzer": "english"
          }
        }
      }
    }
  }
}

{% include copy-curl.html %}

Consider the following two documents that are indexed in the articles index:

PUT /articles/_doc/1
{
  "title": "Buttered toasts"
}

{% include copy-curl.html %}

PUT /articles/_doc/2
{
  "title": "Buttering a toast"
}

{% include copy-curl.html %}

The standard analyzer analyzes the title Buttered toast into [buttered, toasts] and the title Buttering a toast into [buttering, a, toast]. On the other hand, the english analyzer produces the same token list [butter, toast] for both titles because of stemming.

You can use the most_fields query in order to return as many documents as possible:

GET /articles/_search
{
  "query": {
    "multi_match": {
      "query": "buttered toast",
      "fields": [ 
        "title",
        "title.english"
      ],
      "type": "most_fields" 
    }
  }
}

{% include copy-curl.html %}

The preceding query is executed as the following Boolean query:

GET articles/_search
{
  "query": {
    "bool": {
      "should": [
        { "match": { "title": "buttered toasts" }},
        { "match": { "title.english": "buttered toasts" }}
      ]
    }
  }
}

To calculate the relevance score, a document's scores for all match clauses are added together and then the result is divided by the number of match clauses.

Including the title.english field retrieves the second document that matches the stemmed tokens:

{
  "took": 9,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 1.4418206,
    "hits": [
      {
        "_index": "articles",
        "_id": "1",
        "_score": 1.4418206,
        "_source": {
          "title": "Buttered toasts"
        }
      },
      {
        "_index": "articles",
        "_id": "2",
        "_score": 0.09304003,
        "_source": {
          "title": "Buttering a toast"
        }
      }
    ]
  }
}

Because both title and title.english fields match for the first document, it has a higher relevance score.

Operator and minimum should match

The best_fields and most_fields queries generate a match query on a field basis (one per field). Thus, the minimum_should_match and operator parameters are applied to each field, which is normally not the desired behavior.

For example, consider a customers index with the following documents:

PUT customers/_doc/1 
{
  "first_name": "John",
  "last_name": "Doe"
}

{% include copy-curl.html %}

PUT customers/_doc/2 
{
  "first_name": "Jane",
  "last_name": "Doe"
}

{% include copy-curl.html %}

If you're searching for John Doe in the customers index, you might construct the following query:

GET customers/_validate/query?explain
{
  "query": {
    "multi_match" : {
      "query": "John Doe",
      "type": "best_fields",
      "fields": [ "first_name", "last_name" ],
      "operator": "and" 
    }
  }
}

{% include copy-curl.html %}

The intent of the and operator in this query is to find a document that matches John and Doe. However, the query does not return any results. You can learn how the query is executed by running the Validate API:

GET customers/_validate/query?explain
{
  "query": {
    "multi_match" : {
      "query":      "John Doe",
      "type":       "best_fields",
      "fields":     [ "first_name", "last_name" ],
      "operator":   "and" 
    }
  }
}

{% include copy-curl.html %}

From the response, you can see that the query is trying to match both John and Doe to either the first_name or last_name field:

{
  "_shards": {
    "total": 1,
    "successful": 1,
    "failed": 0
  },
  "valid": true,
  "explanations": [
    {
      "index": "customers",
      "valid": true,
      "explanation": "((+first_name:john +first_name:doe) | (+last_name:john +last_name:doe))"
    }
  ]
}

Because neither field contains both words, no results are returned.

A better alternative for searching across fields is to use the cross_fields query. Unlike the field-centric best_fields and most_fields queries, cross_fields query is term-centric.

Cross fields

Use the cross_fields query to search for data across multiple fields. For example, if an index contains customer data, the first name and last name of the customer reside in different fields. Yet, when you search for John Doe, you want to receive documents in which John is in the first_name field and Doe is in the last_name field.

The most_fields query does not work in this case because of the following problems:

  • The operator and minimum_should_match parameters are applied on a field basis instead of on a term basis.
  • Term frequencies in the first_name and last_name fields can lead to unexpected results. For example, if someone's first name happens to be Doe, a document with this name will be presumed a better match because this first name will not appear in any other documents.

The cross_fields query analyzes the query string into individual terms and then searches for each of the terms in any of the fields, as if they were one field.

The following is the cross_fields query for John Doe:

GET /customers/_search
{
  "query": {
    "multi_match" : {
      "query": "John Doe",
      "type": "cross_fields",
      "fields": [ "first_name", "last_name" ],
      "operator": "and"
    }
  }
}

{% include copy-curl.html %}

The response contains the only document in which both John and Doe are present:

{
  "took": 19,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 0.8754687,
    "hits": [
      {
        "_index": "customers",
        "_id": "1",
        "_score": 0.8754687,
        "_source": {
          "first_name": "John",
          "last_name": "Doe"
        }
      }
    ]
  }
}

You can use the Validate API operation to gain insight into how the preceding query is executed:

GET /customers/_validate/query?explain
{
  "query": {
    "multi_match" : {
      "query": "John Doe",
      "type": "cross_fields",
      "fields": [ "first_name", "last_name" ],
      "operator": "and"
    }
  }
}

{% include copy-curl.html %}

From the response, you can see that the query is searching for all terms in at least one field:

{
  "_shards": {
    "total": 1,
    "successful": 1,
    "failed": 0
  },
  "valid": true,
  "explanations": [
    {
      "index": "customers",
      "valid": true,
      "explanation": "+blended(terms:[last_name:john, first_name:john]) +blended(terms:[last_name:doe, first_name:doe])"
    }
  ]
}

Thus, blending the term frequencies for all fields solves the problem of differing term frequencies by correcting for the differences.

The cross_fields query is usually only useful on short string fields with a boost of 1. In other cases, the score does not produce a meaningful blend of term statistics because of the way boosts, term frequencies, and length normalization contribute to the score. {: .note}

The fuzziness parameter is not supported for cross_fields queries. {: .note}

Analysis

The cross_fields query only works as a term-centric query on fields with the same analyzer. Fields with the same analyzer are grouped together and these groups are combined with a Boolean query.

For example, consider an index where the first_name and last_name fields are analyzed with the default standard analyzer and their .edge subfields are analyzed with an edge n-gram analyzer:

Response {: .text-delta}
PUT customers
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "edge_ngram",
          "min_gram": 2,
          "max_gram": 10
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "first_name": { 
        "type": "text",
        "fields": {
          "edge": { 
            "type": "text",
            "analyzer": "my_analyzer"
          }
        }
      },
      "last_name": { 
        "type": "text",
        "fields": {
          "edge": { 
            "type": "text",
            "analyzer": "my_analyzer"
          }
        }
      }
    }
  }
}

{% include copy-curl.html %}

You index one document in the customers index:

PUT /customers/_doc/1
{
  "first": "John",
  "last": "Doe"
}

{% include copy-curl.html %}

You can use a cross_fields query to search across the fields for John Doe:

GET /customers/_search
{
  "query": {
    "multi_match" : {
      "query": "John",
      "type": "cross_fields",
      "fields": [
        "first_name", "first_name.edge",
        "last_name",  "last_name.edge"
      ]
    }
  }
}

{% include copy-curl.html %}

To see how the query is executed, you can run the Validate API:

GET /customers/_validate/query?explain
{
  "query": {
    "multi_match" : {
      "query": "John",
      "type": "cross_fields",
      "fields": [
        "first_name", "first_name.edge",
        "last_name",  "last_name.edge"
      ]
    }
  }
}

{% include copy-curl.html %}

The response shows that the last_name and first_name fields are grouped together and treated as a single field. Similarly, the last_name.edge and first_name.edge fields are grouped together and treated as a single field:

{
  "_shards": {
    "total": 1,
    "successful": 1,
    "failed": 0
  },
  "valid": true,
  "explanations": [
    {
      "index": "customers",
      "valid": true,
      "explanation": "(blended(terms:[last_name:john, first_name:john]) | (blended(terms:[last_name.edge:Jo, first_name.edge:Jo]) blended(terms:[last_name.edge:Joh, first_name.edge:Joh]) blended(terms:[last_name.edge:John, first_name.edge:John])))"
    }
  ]
}

Using the operator or minimum_should_match parameters with multiple field groups like the preceding ones can lead to the problem described in the previous section. To avoid it, you can rewrite the previous query as two cross_fields subqueries combined with a Boolean query and apply the minimum_should_match to one of the subqueries:

GET /customers/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "multi_match": {
            "query": "John Doe",
            "type": "cross_fields",
            "fields": [
              "first_name",
              "last_name"
            ],
            "minimum_should_match": "1"
          }
        },
        {
          "multi_match": {
            "query": "John Doe",
            "type": "cross_fields",
            "fields": [
              "first_name.edge",
              "last_name.edge"
            ]
          }
        }
      ]
    }
  }
}

{% include copy-curl.html %}

To create one group for all fields, specify an analyzer in your query:

GET customers/_search
{
  "query": {
   "multi_match" : {
      "query": "John Doe",
      "type": "cross_fields",
      "analyzer": "standard", 
      "fields": [ "first_name", "last_name", "*.edge" ]
    }
  }
}

{% include copy-curl.html %}

Running the Validate API on the previous query shows how the query is executed:

{
  "_shards": {
    "total": 1,
    "successful": 1,
    "failed": 0
  },
  "valid": true,
  "explanations": [
    {
      "index": "customers",
      "valid": true,
      "explanation": "blended(terms:[last_name.edge:john, last_name:john, first_name:john, first_name.edge:john]) blended(terms:[last_name.edge:doe, last_name:doe, first_name:doe, first_name.edge:doe])"
    }
  ]
}

Phrase

The phrase query behaves similarly to the best_fields query but uses a match_phrase query instead of a match query.

The following is an example phrase query for the index described in the best_fields section:

GET articles/_search
{
  "query": {
    "multi_match" : {
      "query": "northern lights",
      "type": "phrase",
      "fields": [ "title", "description" ]
    }
  }
}

{% include copy-curl.html %}

The preceding query is executed as the following dis_max query with a match_phrase query for each field:

GET articles/_search
{
  "query": {
    "dis_max": {
      "queries": [
        { "match_phrase": { "title": "northern lights" }},
        { "match_phrase": { "description": "northern lights" }}
      ]
    }
  }
}

Because by default a phrase query matches text only when the terms appear in the same order, only document 1 is returned in the results:

Response {: .text-delta}
{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 0.84407747,
    "hits": [
      {
        "_index": "articles",
        "_id": "1",
        "_score": 0.84407747,
        "_source": {
          "title": "Aurora borealis",
          "description": "Northern lights, or aurora borealis, explained"
        }
      }
    ]
  }
}

You can use the slop parameter to allow other words between words in query phrase. For example, the following query accepts text as a match if up to two words are between flourescent and therapy:

GET articles/_search
{
  "query": {
    "multi_match" : {
      "query": "fluorescent therapy",
      "type": "phrase",
      "fields": [ "title", "description" ],
      "slop": 2
    }
  }
}

{% include copy-curl.html %}

The response contains document 2:

Response {: .text-delta}
{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 0.7003825,
    "hits": [
      {
        "_index": "articles",
        "_id": "2",
        "_score": 0.7003825,
        "_source": {
          "title": "Sun deprivation in the Northern countries",
          "description": "Using fluorescent lights for therapy"
        }
      }
    ]
  }
}

For slop values less than 2, no documents are returned.

The fuzziness parameter is not supported for phrase queries. {: .note}

Phrase prefix

The phrase_prefix query behaves similarly to the phrase query but uses a match_phrase_prefix query instead of a match_phrase query.

The following is an example phrase_prefix query for the index described in the best_fields section:

GET articles/_search
{
  "query": {
    "multi_match" : {
      "query": "northern light",
      "type": "phrase_prefix",
      "fields": [ "title", "description" ]
    }
  }
}

{% include copy-curl.html %}

The preceding query is executed as the following dis_max query with a match_phrase_prefix query for each field:

GET articles/_search
{
  "query": {
    "dis_max": {
      "queries": [
        { "match_phrase_prefix": { "title": "northern light" }},
        { "match_phrase_prefix": { "description": "northern light" }}
      ]
    }
  }
}

You can use the slop parameter to allow other words between words in query phrase.

The fuzziness parameter is not supported for phrase_prefix queries. {: .note}

Boolean prefix

The bool_prefix query scores documents similarly to the most_fields query but uses a match_bool_prefix query instead of a match query.

The following is an example bool_prefix query for the index described in the best_fields section:

GET articles/_search
{
  "query": {
    "multi_match" : {
      "query": "li northern",
      "type": "bool_prefix",
      "fields": [ "title", "description" ]
    }
  }
}

{% include copy-curl.html %}

The preceding query is executed as the following dis_max query with a match_bool_prefix query for each field:

GET articles/_search
{
  "query": {
    "dis_max": {
      "queries": [
        { "match_bool_prefix": { "title": "li northern" }},
        { "match_bool_prefix": { "description": "li northern" }}
      ]
    }
  }
}

The fuzziness, prefix_length, max_expansions, fuzzy_rewrite, and fuzzy_transpositions parameters are supported for the terms that are used to construct term queries, but they do not have an effect on the prefix query constructed from the final term. {: .note}

Parameters

The query accepts the following parameters. All parameters except query are optional.

Parameter Data type Description
query String The query string to use for search. Required.
auto_generate_synonyms_phrase_query Boolean Specifies whether to create a match phrase query automatically for multi-term synonyms. For example, if you specify ba,batting average as synonyms and search for ba, OpenSearch searches for ba OR "batting average" (if this option is true) or ba OR (batting AND average) (if this option is false). Default is true.
analyzer String The analyzer used to tokenize the query string text. Default is the index-time analyzer specified for the default_field. If no analyzer is specified for the default_field, the analyzer is the default analyzer for the index.
boost Floating-point Boosts the clause by the given multiplier. Useful for weighing clauses in compound queries. Values in the [0, 1) range decrease relevance, and values greater than 1 increase relevance. Default is 1.
fields Array of strings The list of fields in which to search. If you don't provide the fields parameter, multi_match query searches the fields specified in the index.query. Default_field setting, which defaults to *.
fuzziness String The number of character edits (insert, delete, substitute) that it takes to change one word to another when determining whether a term matched a value. For example, the distance between wined and wind is 1. Valid values are non-negative integers or AUTO. The default, AUTO, chooses a value based on the length of each term and is a good choice for most use cases. Not supported for phrase, phrase_prefix, and cross_fields queries.
fuzzy_rewrite String Determines how OpenSearch rewrites the query. Valid values are constant_score, scoring_boolean, constant_score_boolean, top_terms_N, top_terms_boost_N, and top_terms_blended_freqs_N. If the fuzziness parameter is not 0, the query uses a fuzzy_rewrite method of top_terms_blended_freqs_${max_expansions} by default. Default is constant_score.
fuzzy_transpositions Boolean Setting fuzzy_transpositions to true (default) adds swaps of adjacent characters to the insert, delete, and substitute operations of the fuzziness option. For example, the distance between wind and wnid is 1 if fuzzy_transpositions is true (swap "n" and "i") and 2 if it is false (delete "n", insert "n"). If fuzzy_transpositions is false, rewind and wnid have the same distance (2) from wind, despite the more human-centric opinion that wnid is an obvious typo. The default is a good choice for most use cases.
lenient Boolean Setting lenient to true ignores data type mismatches between the query and the document field. For example, a query string of "8.2" could match a field of type float. Default is false.
max_expansions Positive integer The maximum number of terms to which the query can expand. Fuzzy queries “expand to” a number of matching terms that are within the distance specified in fuzziness. Then OpenSearch tries to match those terms. Default is 50.
minimum_should_match Positive or negative integer, positive or negative percentage, combination If the query string contains multiple search terms and you use the or operator, the number of terms that need to match for the document to be considered a match. For example, if minimum_should_match is 2, wind often rising does not match The Wind Rises. If minimum_should_match is 1, it matches. For details, see Minimum should match.
operator String If the query string contains multiple search terms, whether all terms need to match (AND) or only one term needs to match (OR) for a document to be considered a match. Valid values are:
- OR: The string to be is interpreted as to OR be
- AND: The string to be is interpreted as to AND be
Default is OR.
prefix_length Non-negative integer The number of leading characters that are not considered in fuzziness. Default is 0.
slop 0 (default) or a positive integer Controls the degree to which words in a query can be misordered and still be considered a match. From the Lucene documentation: "The number of other words permitted between words in query phrase. For example, to switch the order of two words requires two moves (the first move places the words atop one another), so to permit reorderings of phrases, the slop must be at least two. A value of zero requires an exact match." Supported for phrase and phrase_prefix query types.
tie_breaker Floating-point A factor between 0 and 1.0 that is used to give more weight to documents that match multiple query clauses. For more information, see The tie_breaker parameter`.
type String The multi-match query type. Valid values are best_fields, most_fields, cross_fields, phrase, phrase_prefix, bool_prefix. Default is best_fields.
zero_terms_query String In some cases, the analyzer removes all terms from a query string. For example, the stop analyzer removes all terms from the string an but this. In those cases, zero_terms_query specifies whether to match no documents (none) or all documents (all). Valid values are none and all. Default is none.

The fuzziness parameter is not supported for phrase, phrase_prefix, and cross_fields queries. {: .note}

The slop parameter is only supported for phrase and phrase_prefix queries. {: .note}

The tie_breaker parameter

Each term-level blended query calculates the document score as the best score returned by any field in a group. The scores from all blended queries are added together to produce the final score. You can change the way the score is calculated by using the tie_breaker parameter. The tie_breaker parameter accepts the following values:

  • 0.0 (default for best_fields, cross_fields, phrase, and phrase_prefix queries): Take the single best score returned by any field in a group.
  • 1.0 (default for most_fields and bool_prefix queries): Add the scores for all fields in a group.
  • A floating-point value in the (0, 1) range: Take the single best score of the best-matching field and add (tie_breaker * _score) for all other matching fields.