Add Cjk width token filter (#7917) (#8258)

opensearch-project · Sep 13, 2024 · 7c2b136 · 7c2b136
1 parent 8137ba7
commit 7c2b136
Show file tree

Hide file tree

Showing 2 changed files with 97 additions and 1 deletion.
diff --git a/_analyzers/token-filters/cjk-width.md b/_analyzers/token-filters/cjk-width.md
@@ -0,0 +1,96 @@
+---
+layout: default
+title: CJK width
+parent: Token filters
+nav_order: 40
+---
+
+# CJK width token filter
+
+The `cjk_width` token filter normalizes Chinese, Japanese, and Korean (CJK) tokens by converting full-width ASCII characters to their standard (half-width) ASCII equivalents and half-width katakana characters to their full-width equivalents.
+
+### Converting full-width ASCII characters
+
+In CJK texts, ASCII characters (such as letters and numbers) can appear in full-width form, occupying the space of two half-width characters. Full-width ASCII characters are typically used in East Asian typography for alignment with the width of CJK characters. However, for the purposes of indexing and searching, these full-width characters need to be normalized to their standard (half-width) ASCII equivalents.
+
+The following example illustrates ASCII character normalization:
+
+```
+        Full-Width:              ＡＢＣＤＥ １２３４５
+        Normalized (half-width): ABCDE 12345
+```
+
+### Converting half-width katakana characters
+
+The `cjk_width` token filter converts half-width katakana characters to their full-width counterparts, which are the standard form used in Japanese text. This normalization, illustrated in the following example, is important for consistency in text processing and searching:
+
+
+```
+        Half-Width katakana:               ｶﾀｶﾅ
+        Normalized (full-width) katakana:  カタカナ
+```
+
+## Example
+
+The following example request creates a new index named `cjk_width_example_index` and defines an analyzer with the `cjk_width` filter:
+
+```json
+PUT /cjk_width_example_index
+{
+  "settings": {
+    "analysis": {
+      "analyzer": {
+        "cjk_width_analyzer": {
+          "type": "custom",
+          "tokenizer": "standard",
+          "filter": ["cjk_width"]
+        }
+      }
+    }
+  }
+}
+```
+{% include copy-curl.html %}
+
+## Generated tokens
+
+Use the following request to examine the tokens generated using the analyzer:
+
+```json
+POST /cjk_width_example_index/_analyze
+{
+  "analyzer": "cjk_width_analyzer",
+  "text": "Ｔｏｋｙｏ 2024 ｶﾀｶﾅ"
+}
+```
+{% include copy-curl.html %}
+
+The response contains the generated tokens:
+
+```json
+{
+  "tokens": [
+    {
+      "token": "Tokyo",
+      "start_offset": 0,
+      "end_offset": 5,
+      "type": "<ALPHANUM>",
+      "position": 0
+    },
+    {
+      "token": "2024",
+      "start_offset": 6,
+      "end_offset": 10,
+      "type": "<NUM>",
+      "position": 1
+    },
+    {
+      "token": "カタカナ",
+      "start_offset": 11,
+      "end_offset": 15,
+      "type": "<KATAKANA>",
+      "position": 2
+    }
+  ]
+}
+```
diff --git a/_analyzers/token-filters/index.md b/_analyzers/token-filters/index.md
@@ -16,7 +16,7 @@ Token filter | Underlying Lucene token filter|  Description
 [`apostrophe`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/apostrophe/) | [ApostropheFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/tr/ApostropheFilter.html) | In each token containing an apostrophe, the `apostrophe` token filter removes the apostrophe itself and all characters following it. 
 [`asciifolding`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/asciifolding/) | [ASCIIFoldingFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.html) | Converts alphabetic, numeric, and symbolic characters.
 `cjk_bigram` | [CJKBigramFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/cjk/CJKBigramFilter.html) | Forms bigrams of Chinese, Japanese, and Korean (CJK) tokens. 
-`cjk_width` | [CJKWidthFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/cjk/CJKWidthFilter.html) | Normalizes Chinese, Japanese, and Korean (CJK) tokens according to the following rules: <br> - Folds full-width ASCII character variants into the equivalent basic Latin characters. <br> - Folds half-width Katakana character variants into the equivalent Kana characters.
+[`cjk_width`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/cjk-width/) | [CJKWidthFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/cjk/CJKWidthFilter.html) | Normalizes Chinese, Japanese, and Korean (CJK) tokens according to the following rules: <br> - Folds full-width ASCII character variants into their equivalent basic Latin characters. <br> - Folds half-width katakana character variants into their equivalent kana characters.
 `classic` | [ClassicFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/classic/ClassicFilter.html) | Performs optional post-processing on the tokens generated by the classic tokenizer. Removes possessives (`'s`) and removes `.` from acronyms.
 `common_grams` | [CommonGramsFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/commongrams/CommonGramsFilter.html) | Generates bigrams for a list of frequently occurring terms. The output contains both single terms and bigrams.
 `conditional` | [ConditionalTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/ConditionalTokenFilter.html) | Applies an ordered list of token filters to tokens that match the conditions provided in a script.