From 1d6baf8c8aa1049dfbd819ea7c2c63b70677782e Mon Sep 17 00:00:00 2001 From: Carlos Delgado <6339205+carlosdelest@users.noreply.github.com> Date: Thu, 18 Jul 2024 10:20:26 +0200 Subject: [PATCH] Clarify synonyms docs (#110822) --- .../synonym-graph-tokenfilter.asciidoc | 135 +++++++++++------ .../tokenfilters/synonym-tokenfilter.asciidoc | 139 ++++++++++++------ .../tokenfilters/synonyms-format.asciidoc | 2 +- .../search-with-synonyms.asciidoc | 13 ++ .../synonyms/apis/synonyms-apis.asciidoc | 17 +++ 5 files changed, 220 insertions(+), 86 deletions(-) diff --git a/docs/reference/analysis/tokenfilters/synonym-graph-tokenfilter.asciidoc b/docs/reference/analysis/tokenfilters/synonym-graph-tokenfilter.asciidoc index 3efb8f6de9b3e..e37118019a55c 100644 --- a/docs/reference/analysis/tokenfilters/synonym-graph-tokenfilter.asciidoc +++ b/docs/reference/analysis/tokenfilters/synonym-graph-tokenfilter.asciidoc @@ -85,45 +85,45 @@ Additional settings are: <> search analyzers to pick up changes to synonym files. Only to be used for search analyzers. * `expand` (defaults to `true`). -* `lenient` (defaults to `false`). If `true` ignores exceptions while parsing the synonym configuration. It is important -to note that only those synonym rules which cannot get parsed are ignored. For instance consider the following request: - -[source,console] --------------------------------------------------- -PUT /test_index -{ - "settings": { - "index": { - "analysis": { - "analyzer": { - "synonym": { - "tokenizer": "standard", - "filter": [ "my_stop", "synonym_graph" ] - } - }, - "filter": { - "my_stop": { - "type": "stop", - "stopwords": [ "bar" ] - }, - "synonym_graph": { - "type": "synonym_graph", - "lenient": true, - "synonyms": [ "foo, bar => baz" ] - } - } - } - } - } -} --------------------------------------------------- +Expands definitions for equivalent synonym rules. +See <>. +* `lenient` (defaults to `false`). +If `true` ignores errors while parsing the synonym configuration. +It is important to note that only those synonym rules which cannot get parsed are ignored. +See <> for an example of `lenient` behaviour for invalid synonym rules. + +[discrete] +[[synonym-graph-tokenizer-expand-equivalent-synonyms]] +===== `expand` equivalent synonym rules + +The `expand` parameter controls whether to expand equivalent synonym rules. +Consider a synonym defined like: + +`foo, bar, baz` + +Using `expand: true`, the synonym rule would be expanded into: -With the above request the word `bar` gets skipped but a mapping `foo => baz` is still added. However, if the mapping -being added was `foo, baz => bar` nothing would get added to the synonym list. This is because the target word for the -mapping is itself eliminated because it was a stop word. Similarly, if the mapping was "bar, foo, baz" and `expand` was -set to `false` no mapping would get added as when `expand=false` the target mapping is the first word. However, if -`expand=true` then the mappings added would be equivalent to `foo, baz => foo, baz` i.e, all mappings other than the -stop word. +``` +foo => foo +foo => bar +foo => baz +bar => foo +bar => bar +bar => baz +baz => foo +baz => bar +baz => baz +``` + +When `expand` is set to `false`, the synonym rule is not expanded and the first synonym is treated as the canonical representation. The synonym would be equivalent to: + +``` +foo => foo +bar => foo +baz => foo +``` + +The `expand` parameter does not affect explicit synonym rules, like `foo, bar => baz`. [discrete] [[synonym-graph-tokenizer-ignore_case-deprecated]] @@ -160,12 +160,65 @@ Text will be processed first through filters preceding the synonym filter before {es} will also use the token filters preceding the synonym filter in a tokenizer chain to parse the entries in a synonym file or synonym set. In the above example, the synonyms graph token filter is placed after a stemmer. The stemmer will also be applied to the synonym entries. -The synonym rules should not contain words that are removed by a filter that appears later in the chain (like a `stop` filter). -Removing a term from a synonym rule means there will be no matching for it at query time. - Because entries in the synonym map cannot have stacked positions, some token filters may cause issues here. Token filters that produce multiple versions of a token may choose which version of the token to emit when parsing synonyms. For example, `asciifolding` will only produce the folded version of the token. Others, like `multiplexer`, `word_delimiter_graph` or `ngram` will throw an error. If you need to build analyzers that include both multi-token filters and synonym filters, consider using the <> filter, with the multi-token filters in one branch and the synonym filter in the other. + +[discrete] +[[synonym-graph-tokenizer-stop-token-filter]] +===== Synonyms and `stop` token filters + +Synonyms and <> interact with each other in the following ways: + +[discrete] +====== Stop token filter *before* synonym token filter + +Stop words will be removed from the synonym rule definition. +This can can cause errors on the synonym rule. + +[WARNING] +==== +Invalid synonym rules can cause errors when applying analyzer changes. +For reloadable analyzers, this prevents reloading and applying changes. +You must correct errors in the synonym rules and reload the analyzer. + +An index with invalid synonym rules cannot be reopened, making it inoperable when: + +* A node containing the index starts +* The index is opened from a closed state +* A node restart occurs (which reopens the node assigned shards) +==== + +For *explicit synonym rules* like `foo, bar => baz` with a stop filter that removes `bar`: + +- If `lenient` is set to `false`, an error will be raised as `bar` would be removed from the left hand side of the synonym rule. +- If `lenient` is set to `true`, the rule `foo => baz` will be added and `bar => baz` will be ignored. + +If the stop filter removed `baz` instead: + +- If `lenient` is set to `false`, an error will be raised as `baz` would be removed from the right hand side of the synonym rule. +- If `lenient` is set to `true`, the synonym will have no effect as the target word is removed. + +For *equivalent synonym rules* like `foo, bar, baz` and `expand: true, with a stop filter that removes `bar`: + +- If `lenient` is set to `false`, an error will be raised as `bar` would be removed from the synonym rule. +- If `lenient` is set to `true`, the synonyms added would be equivalent to the following synonym rules, which do not contain the removed word: + +``` +foo => foo +foo => baz +baz => foo +baz => baz +``` + +[discrete] +====== Stop token filter *after* synonym token filter + +The stop filter will remove the terms from the resulting synonym expansion. + +For example, a synonym rule like `foo, bar => baz` and a stop filter that removes `baz` will get no matches for `foo` or `bar`, as both would get expanded to `baz` which is removed by the stop filter. + +If the stop filter removed `foo` instead, then searching for `foo` would get expanded to `baz`, which is not removed by the stop filter thus potentially providing matches for `baz`. diff --git a/docs/reference/analysis/tokenfilters/synonym-tokenfilter.asciidoc b/docs/reference/analysis/tokenfilters/synonym-tokenfilter.asciidoc index 046cd297b5092..1658f016db60b 100644 --- a/docs/reference/analysis/tokenfilters/synonym-tokenfilter.asciidoc +++ b/docs/reference/analysis/tokenfilters/synonym-tokenfilter.asciidoc @@ -73,47 +73,45 @@ Additional settings are: <> search analyzers to pick up changes to synonym files. Only to be used for search analyzers. * `expand` (defaults to `true`). -* `lenient` (defaults to `false`). If `true` ignores exceptions while parsing the synonym configuration. It is important -to note that only those synonym rules which cannot get parsed are ignored. For instance consider the following request: - - -[source,console] --------------------------------------------------- -PUT /test_index -{ - "settings": { - "index": { - "analysis": { - "analyzer": { - "synonym": { - "tokenizer": "standard", - "filter": [ "my_stop", "synonym" ] - } - }, - "filter": { - "my_stop": { - "type": "stop", - "stopwords": [ "bar" ] - }, - "synonym": { - "type": "synonym", - "lenient": true, - "synonyms": [ "foo, bar => baz" ] - } - } - } - } - } -} --------------------------------------------------- +Expands definitions for equivalent synonym rules. +See <>. +* `lenient` (defaults to `false`). +If `true` ignores errors while parsing the synonym configuration. +It is important to note that only those synonym rules which cannot get parsed are ignored. +See <> for an example of `lenient` behaviour for invalid synonym rules. + +[discrete] +[[synonym-tokenizer-expand-equivalent-synonyms]] +===== `expand` equivalent synonym rules + +The `expand` parameter controls whether to expand equivalent synonym rules. +Consider a synonym defined like: + +`foo, bar, baz` + +Using `expand: true`, the synonym rule would be expanded into: -With the above request the word `bar` gets skipped but a mapping `foo => baz` is still added. However, if the mapping -being added was `foo, baz => bar` nothing would get added to the synonym list. This is because the target word for the -mapping is itself eliminated because it was a stop word. Similarly, if the mapping was "bar, foo, baz" and `expand` was -set to `false` no mapping would get added as when `expand=false` the target mapping is the first word. However, if -`expand=true` then the mappings added would be equivalent to `foo, baz => foo, baz` i.e, all mappings other than the -stop word. +``` +foo => foo +foo => bar +foo => baz +bar => foo +bar => bar +bar => baz +baz => foo +baz => bar +baz => baz +``` +When `expand` is set to `false`, the synonym rule is not expanded and the first synonym is treated as the canonical representation. The synonym would be equivalent to: + +``` +foo => foo +bar => foo +baz => foo +``` + +The `expand` parameter does not affect explicit synonym rules, like `foo, bar => baz`. [discrete] [[synonym-tokenizer-ignore_case-deprecated]] @@ -135,7 +133,7 @@ To apply synonyms, you will need to include a synonym token filters into an anal "my_analyzer": { "type": "custom", "tokenizer": "standard", - "filter": ["stemmer", "synonym_graph"] + "filter": ["stemmer", "synonym"] } } ---- @@ -148,10 +146,7 @@ Order is important for your token filters. Text will be processed first through filters preceding the synonym filter before being processed by the synonym filter. {es} will also use the token filters preceding the synonym filter in a tokenizer chain to parse the entries in a synonym file or synonym set. -In the above example, the synonyms graph token filter is placed after a stemmer. The stemmer will also be applied to the synonym entries. - -The synonym rules should not contain words that are removed by a filter that appears later in the chain (like a `stop` filter). -Removing a term from a synonym rule means there will be no matching for it at query time. +In the above example, the synonyms token filter is placed after a stemmer. The stemmer will also be applied to the synonym entries. Because entries in the synonym map cannot have stacked positions, some token filters may cause issues here. Token filters that produce multiple versions of a token may choose which version of the token to emit when parsing synonyms. @@ -159,3 +154,59 @@ For example, `asciifolding` will only produce the folded version of the token. Others, like `multiplexer`, `word_delimiter_graph` or `ngram` will throw an error. If you need to build analyzers that include both multi-token filters and synonym filters, consider using the <> filter, with the multi-token filters in one branch and the synonym filter in the other. + +[discrete] +[[synonym-tokenizer-stop-token-filter]] +===== Synonyms and `stop` token filters + +Synonyms and <> interact with each other in the following ways: + +[discrete] +====== Stop token filter *before* synonym token filter + +Stop words will be removed from the synonym rule definition. +This can can cause errors on the synonym rule. + +[WARNING] +==== +Invalid synonym rules can cause errors when applying analyzer changes. +For reloadable analyzers, this prevents reloading and applying changes. +You must correct errors in the synonym rules and reload the analyzer. + +An index with invalid synonym rules cannot be reopened, making it inoperable when: + +* A node containing the index starts +* The index is opened from a closed state +* A node restart occurs (which reopens the node assigned shards) +==== + +For *explicit synonym rules* like `foo, bar => baz` with a stop filter that removes `bar`: + +- If `lenient` is set to `false`, an error will be raised as `bar` would be removed from the left hand side of the synonym rule. +- If `lenient` is set to `true`, the rule `foo => baz` will be added and `bar => baz` will be ignored. + +If the stop filter removed `baz` instead: + +- If `lenient` is set to `false`, an error will be raised as `baz` would be removed from the right hand side of the synonym rule. +- If `lenient` is set to `true`, the synonym will have no effect as the target word is removed. + +For *equivalent synonym rules* like `foo, bar, baz` and `expand: true, with a stop filter that removes `bar`: + +- If `lenient` is set to `false`, an error will be raised as `bar` would be removed from the synonym rule. +- If `lenient` is set to `true`, the synonyms added would be equivalent to the following synonym rules, which do not contain the removed word: + +``` +foo => foo +foo => baz +baz => foo +baz => baz +``` + +[discrete] +====== Stop token filter *after* synonym token filter + +The stop filter will remove the terms from the resulting synonym expansion. + +For example, a synonym rule like `foo, bar => baz` and a stop filter that removes `baz` will get no matches for `foo` or `bar`, as both would get expanded to `baz` which is removed by the stop filter. + +If the stop filter removed `foo` instead, then searching for `foo` would get expanded to `baz`, which is not removed by the stop filter thus potentially providing matches for `baz`. diff --git a/docs/reference/analysis/tokenfilters/synonyms-format.asciidoc b/docs/reference/analysis/tokenfilters/synonyms-format.asciidoc index 63dd72dade8d0..e780c24963312 100644 --- a/docs/reference/analysis/tokenfilters/synonyms-format.asciidoc +++ b/docs/reference/analysis/tokenfilters/synonyms-format.asciidoc @@ -15,7 +15,7 @@ This format uses two different definitions: ipod, i-pod, i pod computer, pc, laptop ---- -* Explicit mappings: Matches a group of words to other words. Words on the left hand side of the rule definition are expanded into all the possibilities described on the right hand side. Example: +* Explicit synonyms: Matches a group of words to other words. Words on the left hand side of the rule definition are expanded into all the possibilities described on the right hand side. Example: + [source,synonyms] ---- diff --git a/docs/reference/search/search-your-data/search-with-synonyms.asciidoc b/docs/reference/search/search-your-data/search-with-synonyms.asciidoc index 596af695b7910..61d3a1d8f925b 100644 --- a/docs/reference/search/search-your-data/search-with-synonyms.asciidoc +++ b/docs/reference/search/search-your-data/search-with-synonyms.asciidoc @@ -82,6 +82,19 @@ If an index is created referencing a nonexistent synonyms set, the index will re The only way to recover from this scenario is to ensure the synonyms set exists then either delete and re-create the index, or close and re-open the index. ====== +[WARNING] +==== +Invalid synonym rules can cause errors when applying analyzer changes. +For reloadable analyzers, this prevents reloading and applying changes. +You must correct errors in the synonym rules and reload the analyzer. + +An index with invalid synonym rules cannot be reopened, making it inoperable when: + +* A node containing the index starts +* The index is opened from a closed state +* A node restart occurs (which reopens the node assigned shards) +==== + {es} uses synonyms as part of the <>. You can use two types of <> to include synonyms: diff --git a/docs/reference/synonyms/apis/synonyms-apis.asciidoc b/docs/reference/synonyms/apis/synonyms-apis.asciidoc index c9de52939b2fe..dbbc26c36d3df 100644 --- a/docs/reference/synonyms/apis/synonyms-apis.asciidoc +++ b/docs/reference/synonyms/apis/synonyms-apis.asciidoc @@ -21,6 +21,23 @@ These filters are applied as part of the <> process by the << NOTE: Synonyms sets are limited to a maximum of 10,000 synonym rules per set. If you need to manage more synonym rules, you can create multiple synonyms sets. +WARNING: Synonyms sets must exist before they can be added to indices. +If an index is created referencing a nonexistent synonyms set, the index will remain in a partially created and inoperable state. +The only way to recover from this scenario is to ensure the synonyms set exists then either delete and re-create the index, or close and re-open the index. + +[WARNING] +==== +Invalid synonym rules can cause errors when applying analyzer changes. +For reloadable analyzers, this prevents reloading and applying changes. +You must correct errors in the synonym rules and reload the analyzer. + +An index with invalid synonym rules cannot be reopened, making it inoperable when: + +* A node containing the index starts +* The index is opened from a closed state +* A node restart occurs (which reopens the node assigned shards) +==== + [discrete] [[synonyms-sets-apis]] === Synonyms sets APIs