Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[7.6] [DOCS] Reformat kstem token filter (#55823) #55929

Merged
merged 1 commit into from
Apr 29, 2020
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
112 changes: 109 additions & 3 deletions docs/reference/analysis/tokenfilters/kstem-tokenfilter.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,112 @@
<titleabbrev>KStem</titleabbrev>
++++

The `kstem` token filter is a high performance filter for english. All
terms must already be lowercased (use `lowercase` filter) for this
filter to work correctly.
Provides http://ciir.cs.umass.edu/pubfiles/ir-35.pdf[KStem]-based stemming for
the English language. The `kstem` filter combines
<<algorithmic-stemmers,algorithmic stemming>> with a built-in
<<dictionary-stemmers,dictionary>>.

The `kstem` filter tends to stem less aggressively than other English stemmer
filters, such as the <<analysis-porterstem-tokenfilter,`porter_stem`>> filter.

The `kstem` filter is equivalent to the
<<analysis-stemmer-tokenfilter,`stemmer`>> filter's
<<analysis-stemmer-tokenfilter-language-parm,`light_english`>> variant.

This filter uses Lucene's
{lucene-analysis-docs}s/en/KStemFilter.html[KStemFilter].

[[analysis-kstem-tokenfilter-analyze-ex]]
==== Example

The following analyze API request uses the `kstem` filter to stem `the foxes
jumping quickly` to `the fox jump quick`:

[source,console]
----
GET /_analyze
{
"tokenizer": "standard",
"filter": [ "kstem" ],
"text": "the foxes jumping quickly"
}
----

The filter produces the following tokens:

[source,text]
----
[ the, fox, jump, quick ]
----

////
[source,console-result]
----
{
"tokens": [
{
"token": "the",
"start_offset": 0,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "fox",
"start_offset": 4,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "jump",
"start_offset": 10,
"end_offset": 17,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "quick",
"start_offset": 18,
"end_offset": 25,
"type": "<ALPHANUM>",
"position": 3
}
]
}
----
////

[[analysis-kstem-tokenfilter-analyzer-ex]]
==== Add to an analyzer

The following <<indices-create-index,create index API>> request uses the
`kstem` filter to configure a new <<analysis-custom-analyzer,custom
analyzer>>.

[IMPORTANT]
====
To work properly, the `kstem` filter requires lowercase tokens. To ensure tokens
are lowercased, add the <<analysis-lowercase-tokenfilter,`lowercase`>> filter
before the `kstem` filter in the analyzer configuration.
====

[source,console]
----
PUT /my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "whitespace",
"filter": [
"lowercase",
"kstem"
]
}
}
}
}
}
----