Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add keep words token filter docs #8064 #8124

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion _analyzers/token-filters/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ Token filter | Underlying Lucene token filter| Description
`hunspell` | [HunspellStemFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/hunspell/HunspellStemFilter.html) | Uses [Hunspell](https://en.wikipedia.org/wiki/Hunspell) rules to stem tokens. Because Hunspell supports a word having multiple stems, this filter can emit multiple tokens for each consumed token. Requires you to configure one or more language-specific Hunspell dictionaries.
`hyphenation_decompounder` | [HyphenationCompoundWordTokenFilter](https://lucene.apache.org/core/9_8_0/analysis/common/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilter.html) | Uses XML-based hyphenation patterns to find potential subwords in compound words and checks the subwords against the specified word list. The token output contains only the subwords found in the word list.
[`keep_types`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/keep-types/) | [TypeTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/TypeTokenFilter.html) | Keeps or removes tokens of a specific type.
`keep_word` | [KeepWordFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/KeepWordFilter.html) | Checks the tokens against the specified word list and keeps only those that are in the list.
[`keep_words`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/keep-words/) | [KeepWordFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/KeepWordFilter.html) | Checks the tokens against the specified word list and keeps only those that are in the list.
`keyword_marker` | [KeywordMarkerFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/KeywordMarkerFilter.html) | Marks specified tokens as keywords, preventing them from being stemmed.
`keyword_repeat` | [KeywordRepeatFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/KeywordRepeatFilter.html) | Emits each incoming token twice: once as a keyword and once as a non-keyword.
`kstem` | [KStemFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/en/KStemFilter.html) | Provides kstem-based stemming for the English language. Combines algorithmic stemming with a built-in dictionary.
Expand Down
92 changes: 92 additions & 0 deletions _analyzers/token-filters/keep-words.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
---
layout: default
title: Keep words
parent: Token filters
nav_order: 190
---

# Keep words token filter

The `keep_words` token filter is designed to keep only certain words during the analysis process. This filter is useful if you have a large body of text but are only interested in certain keywords or terms.

## Parameters

The `keep_words` token filter can be configured with the following parameters.

Parameter | Required/Optional | Data type | Description
:--- | :--- | :--- | :---
`keep_words` | Required if `keep_words_path` is not configured | List of strings | The list of words to keep.
`keep_words_path` | Required if `keep_words` is not configured | String | The path to the file containing the list of words to keep.
`keep_words_case` | Optional | Boolean | Whether to lowercase all words during comparison. Default is `false`.


## Example

The following example request creates a new index named `my_index` and configures an analyzer with a `keep_words` filter:

```json
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"custom_keep_word": {
"tokenizer": "standard",
"filter": [ "keep_words_filter" ]
}
},
"filter": {
"keep_words_filter": {
"type": "keep",
"keep_words": ["example", "world", "opensearch"],
"keep_words_case": true
}
}
}
}
}
```
{% include copy-curl.html %}

## Generated tokens

Use the following request to examine the tokens generated using the analyzer:

```json
GET /my_index/_analyze
{
"analyzer": "custom_keep_word",
"text": "Hello, world! This is an OpenSearch example."
}
```
{% include copy-curl.html %}

The response contains the generated tokens:

```json
{
"tokens": [
{
"token": "world",
"start_offset": 7,
"end_offset": 12,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "OpenSearch",
"start_offset": 25,
"end_offset": 35,
"type": "<ALPHANUM>",
"position": 5
},
{
"token": "example",
"start_offset": 36,
"end_offset": 43,
"type": "<ALPHANUM>",
"position": 6
}
]
}
```
Loading