Resolve merge conflicts (#8847)

Signed-off-by: Fanit Kolchina <[email protected]> Signed-off-by: kolchfa-aws <[email protected]> Co-authored-by: AntonEliatra <[email protected]>
opensearch-project · Dec 2, 2024 · d812a72 · d812a72
1 parent 5213bf3
commit d812a72
Show file tree

Hide file tree

Showing 2 changed files with 140 additions and 1 deletion.
diff --git a/_analyzers/token-filters/index.md b/_analyzers/token-filters/index.md
@@ -55,7 +55,7 @@ Token filter | Underlying Lucene token filter|  Description
 [`shingle`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/shingle/) | [ShingleFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/shingle/ShingleFilter.html) | Generates shingles of lengths between `min_shingle_size` and `max_shingle_size` for tokens in the token stream. Shingles are similar to n-grams but are generated using words instead of letters. For example, two-word shingles added to the list of unigrams [`contribute`, `to`, `opensearch`] are [`contribute to`, `to opensearch`].
 [`snowball`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/snowball/) | N/A | Stems words using a [Snowball-generated stemmer](https://snowballstem.org/). The `snowball` token filter supports using the following languages in the `language` field: `Arabic`, `Armenian`, `Basque`, `Catalan`, `Danish`, `Dutch`, `English`, `Estonian`, `Finnish`, `French`, `German`, `German2`, `Hungarian`, `Irish`, `Italian`, `Kp`, `Lithuanian`, `Lovins`, `Norwegian`, `Porter`, `Portuguese`, `Romanian`, `Russian`, `Spanish`, `Swedish`, `Turkish`.
 [`stemmer`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/stemmer/) | N/A | Provides algorithmic stemming for the following languages used in the `language` field: `arabic`, `armenian`, `basque`, `bengali`, `brazilian`, `bulgarian`, `catalan`, `czech`, `danish`, `dutch`, `dutch_kp`, `english`, `light_english`, `lovins`, `minimal_english`, `porter2`, `possessive_english`, `estonian`, `finnish`, `light_finnish`, `french`, `light_french`, `minimal_french`, `galician`, `minimal_galician`, `german`, `german2`, `light_german`, `minimal_german`, `greek`, `hindi`, `hungarian`, `light_hungarian`, `indonesian`, `irish`, `italian`, `light_italian`, `latvian`, `Lithuanian`, `norwegian`, `light_norwegian`, `minimal_norwegian`, `light_nynorsk`, `minimal_nynorsk`, `portuguese`, `light_portuguese`, `minimal_portuguese`, `portuguese_rslp`, `romanian`, `russian`, `light_russian`, `sorani`, `spanish`, `light_spanish`, `swedish`, `light_swedish`, `turkish`.
-`stemmer_override` | N/A | Overrides stemming algorithms by applying a custom mapping so that the provided terms are not stemmed.
+[`stemmer_override`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/stemmer-override/) | N/A | Overrides stemming algorithms by applying a custom mapping so that the provided terms are not stemmed.
 `stop` | [StopFilter](https://lucene.apache.org/core/8_7_0/core/org/apache/lucene/analysis/StopFilter.html) | Removes stop words from a token stream.
 [`synonym`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/synonym/) | N/A | Supplies a synonym list for the analysis process. The synonym list is provided using a configuration file.
 [`synonym_graph`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/synonym-graph/) | N/A | Supplies a synonym list, including multiword synonyms, for the analysis process.

diff --git a/_analyzers/token-filters/stemmer-override.md b/_analyzers/token-filters/stemmer-override.md
@@ -0,0 +1,139 @@
+---
+layout: default
+title: Stemmer override
+parent: Token filters
+nav_order: 400
+---
+
+# Stemmer override token filter
+
+The `stemmer_override` token filter allows you to define custom stemming rules that override the behavior of default stemmers like Porter or Snowball. This can be useful when you want to apply specific stemming behavior to certain words that might not be modified correctly by the standard stemming algorithms.
+
+## Parameters
+
+The `stemmer_override` token filter must be configured with exactly one of the following parameters.
+
+Parameter | Data type | Description
+:--- | :--- | :--- 
+`rules` | String | Defines the override rules directly in the settings.
+`rules_path` | String | Specifies the path to the file containing custom rules (mappings). The path can be either an absolute path or a path relative to the config directory.
+
+## Example
+
+The following example request creates a new index named `my-index` and configures an analyzer with a `stemmer_override` filter:
+
+```json
+PUT /my-index
+{
+  "settings": {
+    "analysis": {
+      "filter": {
+        "my_stemmer_override_filter": {
+          "type": "stemmer_override",
+          "rules": [
+            "running, runner => run",
+            "bought => buy",
+            "best => good"
+          ]
+        }
+      },
+      "analyzer": {
+        "my_custom_analyzer": {
+          "type": "custom",
+          "tokenizer": "standard",
+          "filter": [
+            "lowercase",
+            "my_stemmer_override_filter"
+          ]
+        }
+      }
+    }
+  }
+}
+```
+{% include copy-curl.html %}
+
+## Generated tokens
+
+Use the following request to examine the tokens generated using the analyzer:
+
+```json
+GET /my-index/_analyze
+{
+  "analyzer": "my_custom_analyzer",
+  "text": "I am a runner and bought the best shoes"
+}
+```
+{% include copy-curl.html %}
+
+The response contains the generated tokens:
+
+```json
+{
+  "tokens": [
+    {
+      "token": "i",
+      "start_offset": 0,
+      "end_offset": 1,
+      "type": "<ALPHANUM>",
+      "position": 0
+    },
+    {
+      "token": "am",
+      "start_offset": 2,
+      "end_offset": 4,
+      "type": "<ALPHANUM>",
+      "position": 1
+    },
+    {
+      "token": "a",
+      "start_offset": 5,
+      "end_offset": 6,
+      "type": "<ALPHANUM>",
+      "position": 2
+    },
+    {
+      "token": "run",
+      "start_offset": 7,
+      "end_offset": 13,
+      "type": "<ALPHANUM>",
+      "position": 3
+    },
+    {
+      "token": "and",
+      "start_offset": 14,
+      "end_offset": 17,
+      "type": "<ALPHANUM>",
+      "position": 4
+    },
+    {
+      "token": "buy",
+      "start_offset": 18,
+      "end_offset": 24,
+      "type": "<ALPHANUM>",
+      "position": 5
+    },
+    {
+      "token": "the",
+      "start_offset": 25,
+      "end_offset": 28,
+      "type": "<ALPHANUM>",
+      "position": 6
+    },
+    {
+      "token": "good",
+      "start_offset": 29,
+      "end_offset": 33,
+      "type": "<ALPHANUM>",
+      "position": 7
+    },
+    {
+      "token": "shoes",
+      "start_offset": 34,
+      "end_offset": 39,
+      "type": "<ALPHANUM>",
+      "position": 8
+    }
+  ]
+}
+```