From f0897ae9b2f5fa3ba0fc1d385f4ae59e7118b14f Mon Sep 17 00:00:00 2001 From: AntonEliatra Date: Fri, 9 Aug 2024 10:38:07 +0100 Subject: [PATCH 1/8] adding condition token filter docs #7923 Signed-off-by: AntonEliatra --- _analyzers/token-filters/condition.md | 134 ++++++++++++++++++++++++++ _analyzers/token-filters/index.md | 2 +- 2 files changed, 135 insertions(+), 1 deletion(-) create mode 100644 _analyzers/token-filters/condition.md diff --git a/_analyzers/token-filters/condition.md b/_analyzers/token-filters/condition.md new file mode 100644 index 0000000000..3a89fe031a --- /dev/null +++ b/_analyzers/token-filters/condition.md @@ -0,0 +1,134 @@ +--- +layout: default +title: condition +parent: Token filters +nav_order: 70 +--- + +# Condition token filter + +The `condition` token filter in OpenSearch is a special type of filter that allows you to apply other token filters conditionally based on certain criteria. This provides more control over when certain token filters should be applied during text analysis. +Multiple filters can be configured and only applied based on the conditions you define. +This token filter can be very useful for language-specific processing and handling of special characters. + + +## Parameters + +There are two *required* parameters that have to be configured to use `condition` token filter: + +`filter`: specifies which token filters should be applied to the tokens when the specified condition (defined by the script parameter) is met. + +`script`: defines the condition that needs to be met for the filters specified in the filter parameter to be applied. This condition is expressed in the form of an inline script. + + +## Example + +The following example request creates a new index named `my_conditional_index` and configures an analyzer with `condition` filter. This filter applies `lowercase` filter to any tokens which contain characters "UM": + +```json +PUT /my_conditional_index +{ + "settings": { + "analysis": { + "filter": { + "my_conditional_filter": { + "type": "condition", + "filter": ["lowercase"], + "script": { + "source": "token.getTerm().toString().contains('UM')" + } + } + }, + "analyzer": { + "my_analyzer": { + "type": "custom", + "tokenizer": "standard", + "filter": [ + "my_conditional_filter" + ] + } + } + } + } +} +``` +{% include copy-curl.html %} + +## Generated tokens + +Use the following request to examine the tokens generated using the created analyzer: + +```json +GET /my_conditional_index/_analyze +{ + "analyzer": "my_analyzer", + "text": "THE BLACK CAT JUMPS OVER A LAZY DOG" +} +``` +{% include copy-curl.html %} + +The response contains the generated tokens: + +```json +{ + "tokens": [ + { + "token": "THE", + "start_offset": 0, + "end_offset": 3, + "type": "", + "position": 0 + }, + { + "token": "BLACK", + "start_offset": 4, + "end_offset": 9, + "type": "", + "position": 1 + }, + { + "token": "CAT", + "start_offset": 10, + "end_offset": 13, + "type": "", + "position": 2 + }, + { + "token": "jumps", + "start_offset": 14, + "end_offset": 19, + "type": "", + "position": 3 + }, + { + "token": "OVER", + "start_offset": 20, + "end_offset": 24, + "type": "", + "position": 4 + }, + { + "token": "A", + "start_offset": 25, + "end_offset": 26, + "type": "", + "position": 5 + }, + { + "token": "LAZY", + "start_offset": 27, + "end_offset": 31, + "type": "", + "position": 6 + }, + { + "token": "DOG", + "start_offset": 32, + "end_offset": 35, + "type": "", + "position": 7 + } + ] +} +``` + diff --git a/_analyzers/token-filters/index.md b/_analyzers/token-filters/index.md index f4e9c434e7..9730e25b18 100644 --- a/_analyzers/token-filters/index.md +++ b/_analyzers/token-filters/index.md @@ -19,7 +19,7 @@ Token filter | Underlying Lucene token filter| Description `cjk_width` | [CJKWidthFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/cjk/CJKWidthFilter.html) | Normalizes Chinese, Japanese, and Korean (CJK) tokens according to the following rules:
- Folds full-width ASCII character variants into the equivalent basic Latin characters.
- Folds half-width Katakana character variants into the equivalent Kana characters. `classic` | [ClassicFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/classic/ClassicFilter.html) | Performs optional post-processing on the tokens generated by the classic tokenizer. Removes possessives (`'s`) and removes `.` from acronyms. `common_grams` | [CommonGramsFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/commongrams/CommonGramsFilter.html) | Generates bigrams for a list of frequently occurring terms. The output contains both single terms and bigrams. -`conditional` | [ConditionalTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/ConditionalTokenFilter.html) | Applies an ordered list of token filters to tokens that match the conditions provided in a script. +[`conditional`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/condition/) | [ConditionalTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/ConditionalTokenFilter.html) | Applies an ordered list of token filters to tokens that match the conditions provided in a script. `decimal_digit` | [DecimalDigitFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/DecimalDigitFilter.html) | Converts all digits in the Unicode decimal number general category to basic Latin digits (0--9). `delimited_payload` | [DelimitedPayloadTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/payloads/DelimitedPayloadTokenFilter.html) | Separates a token stream into tokens with corresponding payloads, based on a provided delimiter. A token consists of all characters before the delimiter, and a payload consists of all characters after the delimiter. For example, if the delimiter is `|`, then for the string `foo|bar`, `foo` is the token and `bar` is the payload. [`delimited_term_freq`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/delimited-term-frequency/) | [DelimitedTermFrequencyTokenFilter](https://lucene.apache.org/core/9_7_0/analysis/common/org/apache/lucene/analysis/miscellaneous/DelimitedTermFrequencyTokenFilter.html) | Separates a token stream into tokens with corresponding term frequencies, based on a provided delimiter. A token consists of all characters before the delimiter, and a term frequency is the integer after the delimiter. For example, if the delimiter is `|`, then for the string `foo|5`, `foo` is the token and `5` is the term frequency. From b4deec0132f161c52af2fdc103abafffb6450915 Mon Sep 17 00:00:00 2001 From: AntonEliatra Date: Thu, 12 Sep 2024 11:08:00 +0100 Subject: [PATCH 2/8] Update condition.md Signed-off-by: AntonEliatra --- _analyzers/token-filters/condition.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_analyzers/token-filters/condition.md b/_analyzers/token-filters/condition.md index 3a89fe031a..890f5b3ae5 100644 --- a/_analyzers/token-filters/condition.md +++ b/_analyzers/token-filters/condition.md @@ -56,7 +56,7 @@ PUT /my_conditional_index ## Generated tokens -Use the following request to examine the tokens generated using the created analyzer: +Use the following request to examine the tokens generated using the analyzer: ```json GET /my_conditional_index/_analyze From 441bc26f8c2961181cc196db97221488e09bb2b5 Mon Sep 17 00:00:00 2001 From: Anton Rubin Date: Wed, 16 Oct 2024 19:38:30 +0100 Subject: [PATCH 3/8] updating parameter table Signed-off-by: Anton Rubin --- _analyzers/token-filters/condition.md | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/_analyzers/token-filters/condition.md b/_analyzers/token-filters/condition.md index 890f5b3ae5..20914b9707 100644 --- a/_analyzers/token-filters/condition.md +++ b/_analyzers/token-filters/condition.md @@ -7,18 +7,19 @@ nav_order: 70 # Condition token filter -The `condition` token filter in OpenSearch is a special type of filter that allows you to apply other token filters conditionally based on certain criteria. This provides more control over when certain token filters should be applied during text analysis. +The `condition` token filter is a special type of filter that allows you to apply other token filters conditionally based on certain criteria. This provides more control over when certain token filters should be applied during text analysis. Multiple filters can be configured and only applied based on the conditions you define. This token filter can be very useful for language-specific processing and handling of special characters. ## Parameters -There are two *required* parameters that have to be configured to use `condition` token filter: +There are two parameters that need to be configured to use `condition` token filter. -`filter`: specifies which token filters should be applied to the tokens when the specified condition (defined by the script parameter) is met. - -`script`: defines the condition that needs to be met for the filters specified in the filter parameter to be applied. This condition is expressed in the form of an inline script. +Parameter | Required/Optional | Data type | Description +:--- | :--- | :--- | :--- +`filter` | Required | String | Specifies which token filters should be applied to the tokens when the specified condition (defined by the script parameter) is met. +`script` | Required | String | Defines the condition that needs to be met for the filters specified in the filter parameter to be applied. This condition is expressed in the form of an inline script. ## Example From 7f88fe43ecc080ec1f86b46c1e4b11d0722a481d Mon Sep 17 00:00:00 2001 From: Anton Rubin Date: Mon, 21 Oct 2024 11:21:18 +0100 Subject: [PATCH 4/8] addressing PR comments Signed-off-by: Anton Rubin --- _analyzers/token-filters/condition.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_analyzers/token-filters/condition.md b/_analyzers/token-filters/condition.md index 20914b9707..56902e1635 100644 --- a/_analyzers/token-filters/condition.md +++ b/_analyzers/token-filters/condition.md @@ -19,7 +19,7 @@ There are two parameters that need to be configured to use `condition` token fil Parameter | Required/Optional | Data type | Description :--- | :--- | :--- | :--- `filter` | Required | String | Specifies which token filters should be applied to the tokens when the specified condition (defined by the script parameter) is met. -`script` | Required | String | Defines the condition that needs to be met for the filters specified in the filter parameter to be applied. This condition is expressed in the form of an inline script. +`script` | Required | String | Configures an [inline script]({{site.url}}{{site.baseurl}}/api-reference/script-apis/exec-script/) which defines the condition that needs to be met for the filters specified in the `filter` parameter to be applied. ## Example From ac0fc11f9132366d90abc33cc00eed4cb3e79f0c Mon Sep 17 00:00:00 2001 From: Anton Rubin Date: Mon, 21 Oct 2024 11:23:21 +0100 Subject: [PATCH 5/8] addressing PR comments Signed-off-by: Anton Rubin --- _analyzers/token-filters/condition.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_analyzers/token-filters/condition.md b/_analyzers/token-filters/condition.md index 56902e1635..e94f96ed9a 100644 --- a/_analyzers/token-filters/condition.md +++ b/_analyzers/token-filters/condition.md @@ -19,7 +19,7 @@ There are two parameters that need to be configured to use `condition` token fil Parameter | Required/Optional | Data type | Description :--- | :--- | :--- | :--- `filter` | Required | String | Specifies which token filters should be applied to the tokens when the specified condition (defined by the script parameter) is met. -`script` | Required | String | Configures an [inline script]({{site.url}}{{site.baseurl}}/api-reference/script-apis/exec-script/) which defines the condition that needs to be met for the filters specified in the `filter` parameter to be applied. +`script` | Required | String | Configures an [inline script]({{site.url}}{{site.baseurl}}/api-reference/script-apis/exec-script/) which defines the condition that needs to be met for the filters specified in the `filter` parameter to be applied. (Only inline scripts can be accepted). ## Example From 368208df9e8f30bf186ad40b4e29c258499e74a4 Mon Sep 17 00:00:00 2001 From: AntonEliatra Date: Fri, 1 Nov 2024 11:06:54 +0000 Subject: [PATCH 6/8] Apply suggestions from code review Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Signed-off-by: AntonEliatra --- _analyzers/token-filters/condition.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/_analyzers/token-filters/condition.md b/_analyzers/token-filters/condition.md index e94f96ed9a..5325d95692 100644 --- a/_analyzers/token-filters/condition.md +++ b/_analyzers/token-filters/condition.md @@ -14,17 +14,17 @@ This token filter can be very useful for language-specific processing and handli ## Parameters -There are two parameters that need to be configured to use `condition` token filter. +There are two parameters that must be configured in order to use `condition` token filter. Parameter | Required/Optional | Data type | Description :--- | :--- | :--- | :--- -`filter` | Required | String | Specifies which token filters should be applied to the tokens when the specified condition (defined by the script parameter) is met. -`script` | Required | String | Configures an [inline script]({{site.url}}{{site.baseurl}}/api-reference/script-apis/exec-script/) which defines the condition that needs to be met for the filters specified in the `filter` parameter to be applied. (Only inline scripts can be accepted). +`filter` | Required | Array | Specifies which token filters should be applied to the tokens when the specified condition (defined by the `script` parameter) is met. +`script` | Required | Object | Configures an [inline script]({{site.url}}{{site.baseurl}}/api-reference/script-apis/exec-script/) that defines the condition that needs to be met for the filters specified in the `filter` parameter to be applied. (Only inline scripts are accepted). ## Example -The following example request creates a new index named `my_conditional_index` and configures an analyzer with `condition` filter. This filter applies `lowercase` filter to any tokens which contain characters "UM": +The following example request creates a new index named `my_conditional_index` and configures an analyzer with a `condition` filter. This filter applies a `lowercase` filter to any tokens which contain the character sequence "UM": ```json PUT /my_conditional_index From 5fd73e76cac56603c0a50054ed3dbc3bbe4ceb72 Mon Sep 17 00:00:00 2001 From: AntonEliatra Date: Fri, 1 Nov 2024 11:09:36 +0000 Subject: [PATCH 7/8] Update condition.md Signed-off-by: AntonEliatra --- _analyzers/token-filters/condition.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/_analyzers/token-filters/condition.md b/_analyzers/token-filters/condition.md index 5325d95692..5586c6ffc9 100644 --- a/_analyzers/token-filters/condition.md +++ b/_analyzers/token-filters/condition.md @@ -24,7 +24,7 @@ Parameter | Required/Optional | Data type | Description ## Example -The following example request creates a new index named `my_conditional_index` and configures an analyzer with a `condition` filter. This filter applies a `lowercase` filter to any tokens which contain the character sequence "UM": +The following example request creates a new index named `my_conditional_index` and configures an analyzer with a `condition` filter. This filter applies a `lowercase` filter to any tokens which contain the character sequence "um": ```json PUT /my_conditional_index @@ -36,7 +36,7 @@ PUT /my_conditional_index "type": "condition", "filter": ["lowercase"], "script": { - "source": "token.getTerm().toString().contains('UM')" + "source": "token.getTerm().toString().contains('um')" } } }, From 748a85c840314811f9304f35fb0b3c31d4255d7e Mon Sep 17 00:00:00 2001 From: AntonEliatra Date: Tue, 5 Nov 2024 11:02:58 +0000 Subject: [PATCH 8/8] Apply suggestions from code review Co-authored-by: Nathan Bower Signed-off-by: AntonEliatra --- _analyzers/token-filters/condition.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/_analyzers/token-filters/condition.md b/_analyzers/token-filters/condition.md index 5586c6ffc9..eb3c348728 100644 --- a/_analyzers/token-filters/condition.md +++ b/_analyzers/token-filters/condition.md @@ -8,23 +8,23 @@ nav_order: 70 # Condition token filter The `condition` token filter is a special type of filter that allows you to apply other token filters conditionally based on certain criteria. This provides more control over when certain token filters should be applied during text analysis. -Multiple filters can be configured and only applied based on the conditions you define. +Multiple filters can be configured and only applied when they meet the conditions you define. This token filter can be very useful for language-specific processing and handling of special characters. ## Parameters -There are two parameters that must be configured in order to use `condition` token filter. +There are two parameters that must be configured in order to use the `condition` token filter. Parameter | Required/Optional | Data type | Description :--- | :--- | :--- | :--- `filter` | Required | Array | Specifies which token filters should be applied to the tokens when the specified condition (defined by the `script` parameter) is met. -`script` | Required | Object | Configures an [inline script]({{site.url}}{{site.baseurl}}/api-reference/script-apis/exec-script/) that defines the condition that needs to be met for the filters specified in the `filter` parameter to be applied. (Only inline scripts are accepted). +`script` | Required | Object | Configures an [inline script]({{site.url}}{{site.baseurl}}/api-reference/script-apis/exec-script/) that defines the condition that needs to be met in order for the filters specified in the `filter` parameter to be applied (only inline scripts are accepted). ## Example -The following example request creates a new index named `my_conditional_index` and configures an analyzer with a `condition` filter. This filter applies a `lowercase` filter to any tokens which contain the character sequence "um": +The following example request creates a new index named `my_conditional_index` and configures an analyzer with a `condition` filter. This filter applies a `lowercase` filter to any tokens that contain the character sequence "um": ```json PUT /my_conditional_index