Skip to content

Commit

Permalink
[DOCS] Reformat remove_duplicates token filter (#53608)
Browse files Browse the repository at this point in the history
Makes the following changes to the `remove_duplicates` token filter
docs:

* Rewrites description and adds Lucene link
* Adds detailed analyze example
* Adds custom analyzer example
  • Loading branch information
jrodewig authored Mar 16, 2020
1 parent f5f925c commit e8ed337
Showing 1 changed file with 148 additions and 2 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -4,5 +4,151 @@
<titleabbrev>Remove duplicates</titleabbrev>
++++

A token filter of type `remove_duplicates` that drops identical tokens at the
same position.
Removes duplicate tokens in the same position.

The `remove_duplicates` filter uses Lucene's
{lucene-analysis-docs}/miscellaneous/RemoveDuplicatesTokenFilter.html[RemoveDuplicatesTokenFilter].

[[analysis-remove-duplicates-tokenfilter-analyze-ex]]
==== Example

To see how the `remove_duplicates` filter works, you first need to produce a
token stream containing duplicate tokens in the same position.

The following <<indices-analyze,analyze API>> request uses the
<<analysis-keyword-repeat-tokenfilter,`keyword_repeat`>> and
<<analysis-stemmer-tokenfilter,`stemmer`>> filters to create stemmed and
unstemmed tokens for `jumping dog`.

[source,console]
----
GET _analyze
{
"tokenizer": "whitespace",
"filter": [
"keyword_repeat",
"stemmer"
],
"text": "jumping dog"
}
----

The API returns the following response. Note that the `dog` token in position
`1` is duplicated.

[source,console-result]
----
{
"tokens": [
{
"token": "jumping",
"start_offset": 0,
"end_offset": 7,
"type": "word",
"position": 0
},
{
"token": "jump",
"start_offset": 0,
"end_offset": 7,
"type": "word",
"position": 0
},
{
"token": "dog",
"start_offset": 8,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "dog",
"start_offset": 8,
"end_offset": 11,
"type": "word",
"position": 1
}
]
}
----

To remove one of the duplicate `dog` tokens, add the `remove_duplicates` filter
to the previous analyze API request.

[source,console]
----
GET _analyze
{
"tokenizer": "whitespace",
"filter": [
"keyword_repeat",
"stemmer",
"remove_duplicates"
],
"text": "jumping dog"
}
----

The API returns the following response. There is now only one `dog` token in
position `1`.

[source,console-result]
----
{
"tokens": [
{
"token": "jumping",
"start_offset": 0,
"end_offset": 7,
"type": "word",
"position": 0
},
{
"token": "jump",
"start_offset": 0,
"end_offset": 7,
"type": "word",
"position": 0
},
{
"token": "dog",
"start_offset": 8,
"end_offset": 11,
"type": "word",
"position": 1
}
]
}
----

[[analysis-remove-duplicates-tokenfilter-analyzer-ex]]
==== Add to an analyzer

The following <<indices-create-index,create index API>> request uses the
`remove_duplicates` filter to configure a new <<analysis-custom-analyzer,custom
analyzer>>.

This custom analyzer uses the `keyword_repeat` and `stemmer` filters to create a
stemmed and unstemmed version of each token in a stream. The `remove_duplicates`
filter then removes any duplicate tokens in the same position.

[source,console]
----
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"tokenizer": "standard",
"filter": [
"keyword_repeat",
"stemmer",
"remove_duplicates"
]
}
}
}
}
}
----

0 comments on commit e8ed337

Please sign in to comment.