Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Classic token filter docs #7918

Merged
93 changes: 93 additions & 0 deletions _analyzers/token-filters/classic.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
---
layout: default
title: Classic
parent: Token filters
nav_order: 50
---

# Classic token filter

The primary function of the classic token filter is to work alongside the classic tokenizer. It processes tokens by applying the following common transformations, which aid in text analysis and search:
- Removal of possessive endings such as *'s*. For example, *John's* becomes *John*.
- Removal of periods from acronyms. For example, *D.A.R.P.A.* becomes *DARPA*.


## Example

The following example request creates a new index named `custom_classic_filter` and configures an analyzer with the `classic` filter:

```json
PUT /custom_classic_filter
{
"settings": {
"analysis": {
"analyzer": {
"custom_classic": {
"type": "custom",
"tokenizer": "classic",
"filter": ["classic"]
}
}
}
}
}
```
{% include copy-curl.html %}

## Generated tokens

Use the following request to examine the tokens generated using the analyzer:

```json
POST /custom_classic_filter/_analyze
{
"analyzer": "custom_classic",
"text": "John's co-operate was excellent."
}
```
{% include copy-curl.html %}

The response contains the generated tokens:

```json
{
"tokens": [
{
"token": "John",
"start_offset": 0,
"end_offset": 6,
"type": "<APOSTROPHE>",
"position": 0
},
{
"token": "co",
"start_offset": 7,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "operate",
"start_offset": 10,
"end_offset": 17,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "was",
"start_offset": 18,
"end_offset": 21,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "excellent",
"start_offset": 22,
"end_offset": 31,
"type": "<ALPHANUM>",
"position": 4
}
]
}
```

2 changes: 1 addition & 1 deletion _analyzers/token-filters/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
[`asciifolding`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/asciifolding/) | [ASCIIFoldingFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.html) | Converts alphabetic, numeric, and symbolic characters.
`cjk_bigram` | [CJKBigramFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/cjk/CJKBigramFilter.html) | Forms bigrams of Chinese, Japanese, and Korean (CJK) tokens.
[`cjk_width`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/cjk-width/) | [CJKWidthFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/cjk/CJKWidthFilter.html) | Normalizes Chinese, Japanese, and Korean (CJK) tokens according to the following rules: <br> - Folds full-width ASCII character variants into their equivalent basic Latin characters. <br> - Folds half-width katakana character variants into their equivalent kana characters.
`classic` | [ClassicFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/classic/ClassicFilter.html) | Performs optional post-processing on the tokens generated by the classic tokenizer. Removes possessives (`'s`) and removes `.` from acronyms.
[`classic`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/classic) | [ClassicFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/classic/ClassicFilter.html) | Performs optional post-processing on the tokens generated by the classic tokenizer. Removes possessives (`'s`) and removes `.` from acronyms.

Check failure on line 22 in _analyzers/token-filters/index.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.LinksEndSlash] Add a trailing slash to the link '({{site.url}}{{site.baseurl}}/analyzers/token-filters/classic)'. Raw Output: {"message": "[OpenSearch.LinksEndSlash] Add a trailing slash to the link '({{site.url}}{{site.baseurl}}/analyzers/token-filters/classic)'.", "location": {"path": "_analyzers/token-filters/index.md", "range": {"start": {"line": 22, "column": 12}}}, "severity": "ERROR"}
`common_grams` | [CommonGramsFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/commongrams/CommonGramsFilter.html) | Generates bigrams for a list of frequently occurring terms. The output contains both single terms and bigrams.
`conditional` | [ConditionalTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/ConditionalTokenFilter.html) | Applies an ordered list of token filters to tokens that match the conditions provided in a script.
`decimal_digit` | [DecimalDigitFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/DecimalDigitFilter.html) | Converts all digits in the Unicode decimal number general category to basic Latin digits (0--9).
Expand Down
Loading