Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
* adding keep type docs opensearch-project#8063

Signed-off-by: Anton Rubin <[email protected]>

* Update keep-types.md

Signed-off-by: AntonEliatra <[email protected]>

* updating parameter table

Signed-off-by: Anton Rubin <[email protected]>

* Update keep-types.md

Signed-off-by: AntonEliatra <[email protected]>

* Apply suggestions from code review

Co-authored-by: kolchfa-aws <[email protected]>
Signed-off-by: AntonEliatra <[email protected]>

* fixing types

Signed-off-by: Anton Rubin <[email protected]>

* Apply suggestions from code review

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>

---------

Signed-off-by: Anton Rubin <[email protected]>
Signed-off-by: AntonEliatra <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>
Co-authored-by: kolchfa-aws <[email protected]>
Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: Eric Pugh <[email protected]>
  • Loading branch information
3 people authored and epugh committed Nov 23, 2024
1 parent 4aa7636 commit 9582102
Show file tree
Hide file tree
Showing 2 changed files with 116 additions and 1 deletion.
2 changes: 1 addition & 1 deletion _analyzers/token-filters/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ Token filter | Underlying Lucene token filter| Description
`flatten_graph` | [FlattenGraphFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/FlattenGraphFilter.html) | Flattens a token graph produced by a graph token filter, such as `synonym_graph` or `word_delimiter_graph`, making the graph suitable for indexing.
`hunspell` | [HunspellStemFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/hunspell/HunspellStemFilter.html) | Uses [Hunspell](https://en.wikipedia.org/wiki/Hunspell) rules to stem tokens. Because Hunspell supports a word having multiple stems, this filter can emit multiple tokens for each consumed token. Requires you to configure one or more language-specific Hunspell dictionaries.
`hyphenation_decompounder` | [HyphenationCompoundWordTokenFilter](https://lucene.apache.org/core/9_8_0/analysis/common/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilter.html) | Uses XML-based hyphenation patterns to find potential subwords in compound words and checks the subwords against the specified word list. The token output contains only the subwords found in the word list.
`keep_types` | [TypeTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/TypeTokenFilter.html) | Keeps or removes tokens of a specific type.
[`keep_types`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/keep-types/) | [TypeTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/TypeTokenFilter.html) | Keeps or removes tokens of a specific type.
`keep_word` | [KeepWordFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/KeepWordFilter.html) | Checks the tokens against the specified word list and keeps only those that are in the list.
`keyword_marker` | [KeywordMarkerFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/KeywordMarkerFilter.html) | Marks specified tokens as keywords, preventing them from being stemmed.
`keyword_repeat` | [KeywordRepeatFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/KeywordRepeatFilter.html) | Emits each incoming token twice: once as a keyword and once as a non-keyword.
Expand Down
115 changes: 115 additions & 0 deletions _analyzers/token-filters/keep-types.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
---
layout: default
title: Keep types
parent: Token filters
nav_order: 180
---

# Keep types token filter

The `keep_types` token filter is a type of token filter used in text analysis to control which token types are kept or discarded. Different tokenizers produce different token types, for example, `<HOST>`, `<NUM>`, or `<ALPHANUM>`.

The `keyword`, `simple_pattern`, and `simple_pattern_split` tokenizers do not support the `keep_types` token filter because these tokenizers do not support token type attributes.
{: .note}

## Parameters

The `keep_types` token filter can be configured with the following parameters.

Parameter | Required/Optional | Data type | Description
:--- | :--- | :--- | :---
`types` | Required | List of strings | List of token types to be kept or discarded (determined by the `mode`).
`mode`| Optional | String | Whether to `include` or `exclude` the token types specified in `types`. Default is `include`.


## Example

The following example request creates a new index named `test_index` and configures an analyzer with a `keep_types` filter:

```json
PUT /test_index
{
"settings": {
"analysis": {
"analyzer": {
"custom_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "keep_types_filter"]
}
},
"filter": {
"keep_types_filter": {
"type": "keep_types",
"types": ["<ALPHANUM>"]
}
}
}
}
}
```
{% include copy-curl.html %}

## Generated tokens

Use the following request to examine the tokens generated using the analyzer:

```json
GET /test_index/_analyze
{
"analyzer": "custom_analyzer",
"text": "Hello 2 world! This is an example."
}
```
{% include copy-curl.html %}

The response contains the generated tokens:

```json
{
"tokens": [
{
"token": "hello",
"start_offset": 0,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "world",
"start_offset": 8,
"end_offset": 13,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "this",
"start_offset": 15,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "is",
"start_offset": 20,
"end_offset": 22,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "an",
"start_offset": 23,
"end_offset": 25,
"type": "<ALPHANUM>",
"position": 5
},
{
"token": "example",
"start_offset": 26,
"end_offset": 33,
"type": "<ALPHANUM>",
"position": 6
}
]
}
```

0 comments on commit 9582102

Please sign in to comment.