Skip to content

Commit

Permalink
Add predicate token filter docs #8272 (#8279)
Browse files Browse the repository at this point in the history
* adding predicate token filter docs #8272

Signed-off-by: Anton Rubin <[email protected]>

* Doc review

Signed-off-by: Fanit Kolchina <[email protected]>

* Apply suggestions from code review

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>

---------

Signed-off-by: Anton Rubin <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>
Co-authored-by: Fanit Kolchina <[email protected]>
Co-authored-by: kolchfa-aws <[email protected]>
Co-authored-by: Nathan Bower <[email protected]>
(cherry picked from commit 3f6fe1c)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
  • Loading branch information
4 people committed Nov 25, 2024
1 parent 0e5b6aa commit 3645c49
Show file tree
Hide file tree
Showing 2 changed files with 83 additions and 1 deletion.
2 changes: 1 addition & 1 deletion _analyzers/token-filters/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ Normalization | `arabic_normalization`: [ArabicNormalizer](https://lucene.apache
`pattern_replace` | N/A | Matches a pattern in the provided regular expression and replaces matching substrings. Uses [Java regular expression syntax](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html).
`phonetic` | N/A | Uses a phonetic encoder to emit a metaphone token for each token in the token stream. Requires installing the `analysis-phonetic` plugin.
`porter_stem` | [PorterStemFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/en/PorterStemFilter.html) | Uses the [Porter stemming algorithm](https://tartarus.org/martin/PorterStemmer/) to perform algorithmic stemming for the English language.
`predicate_token_filter` | N/A | Removes tokens that don’t match the specified predicate script. Supports inline Painless scripts only.
[`predicate_token_filter`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/predicate-token-filter/) | N/A | Removes tokens that do not match the specified predicate script. Supports only inline Painless scripts.
`remove_duplicates` | [RemoveDuplicatesTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/RemoveDuplicatesTokenFilter.html) | Removes duplicate tokens that are in the same position.
`reverse` | [ReverseStringFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/reverse/ReverseStringFilter.html) | Reverses the string corresponding to each token in the token stream. For example, the token `dog` becomes `god`.
`shingle` | [ShingleFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/shingle/ShingleFilter.html) | Generates shingles of lengths between `min_shingle_size` and `max_shingle_size` for tokens in the token stream. Shingles are similar to n-grams but apply to words instead of letters. For example, two-word shingles added to the list of unigrams [`contribute`, `to`, `opensearch`] are [`contribute to`, `to opensearch`].
Expand Down
82 changes: 82 additions & 0 deletions _analyzers/token-filters/predicate-token-filter.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
---
layout: default
title: Predicate token filter
parent: Token filters
nav_order: 340
---

# Predicate token filter

The `predicate_token_filter` evaluates whether tokens should be kept or discarded, depending on the conditions defined in a custom script. The tokens are evaluated in the analysis predicate context. This filter supports only inline Painless scripts.

## Parameters

The `predicate_token_filter` has one required parameter: `script`. This parameter provides a condition that is used to evaluate whether the token should be kept.

## Example

The following example request creates a new index named `predicate_index` and configures an analyzer with a `predicate_token_filter`. The filter specifies to only output tokens if they are longer than 7 characters:

```json
PUT /predicate_index
{
"settings": {
"analysis": {
"filter": {
"my_predicate_filter": {
"type": "predicate_token_filter",
"script": {
"source": "token.term.length() > 7"
}
}
},
"analyzer": {
"predicate_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_predicate_filter"
]
}
}
}
}
}
```
{% include copy-curl.html %}

## Generated tokens

Use the following request to examine the tokens generated using the analyzer:

```json
POST /predicate_index/_analyze
{
"text": "The OpenSearch community is growing rapidly",
"analyzer": "predicate_analyzer"
}
```
{% include copy-curl.html %}

The response contains the generated tokens:

```json
{
"tokens": [
{
"token": "opensearch",
"start_offset": 4,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "community",
"start_offset": 15,
"end_offset": 24,
"type": "<ALPHANUM>",
"position": 2
}
]
}
```

0 comments on commit 3645c49

Please sign in to comment.