Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOCS] Reformat delimited payload token filter docs #49380

Merged
merged 6 commits into from
Nov 25, 2019
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,21 +1,331 @@
[[analysis-delimited-payload-tokenfilter]]
=== Delimited Payload Token Filter

Named `delimited_payload`. Splits tokens into tokens and payload whenever a delimiter character is found.
=== Delimited payload token filter
++++
<titleabbrev>Delimited payload</titleabbrev>
++++

[WARNING]
============================================
====
The older name `delimited_payload_filter` is deprecated and should not be used
with new indices. Use `delimited_payload` instead.
====

Separates a token stream into tokens and payloads based on a specified
delimiter.

For example, you can use the `delimited_payload` filter with a `|` delimiter to
split `the|1 quick|2 fox|3` into the tokens `the`, `quick`, and `fox`
with respective payloads of `1`, `2`, and `3`.

This filter uses Lucene's
https://lucene.apache.org/core/{lucene_version_path}/analyzers-common/org/apache/lucene/analysis/payloads/DelimitedPayloadTokenFilter.html[DelimitedPayloadTokenFilter].

[NOTE]
.Payloads
====
A payload is user-defined binary data associated with a token position and
stored as base64-encoded bytes. Payloads are often used with the
<<query-dsl-script-score-query,`script_score`>> query to calculate custom scores
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess would be nice to do this some time, but currently there is no way for script_score to access payloads. I don't know any other ES query that can deal with payloads either. The only way to access them is through _termvectors API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for catching this. I've updated this note to state that you can view stored payloads using the term vectors API API.

for documents during a search.

{es} does not store token payloads by default. To store payloads, you must:


* Set mapping parameters as follows for any field storing payloads:
** <<mapping-store,`store`>> to `true`
** <<term-vector,`term_vector`>> to `with_positions_payloads` or
`with_positions_offsets_payloads`

* Use an index analyzer that includes the `delimted_payload` filter
====
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about the correctness of this paragraph. So, for a text/keyword field ES can create 3 Lucene fields: 1) usual indexed field broken into terms and used for search. Here we don't sore payloads (I think) and definitely never use them in any search query. 2) a stored field with index option store:true. But I am not sure why you mentioned it here (it seems to me that it doesn't have to do anything with payloads). 3) a term vectors field with index option of term_vector. these are primarily used for highlighting, but can be used just for retrieval purpose as well (as in your examples). And here we can store payloads but again we don't use payloads for anything intelligent besides just retrieval.

nit: delimted_payload -> delimited_payload

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for another great catch. I've removed references to the store requirement throughout. I also fixed the typo.


[[analysis-delimited-payload-tokenfilter-analyze-ex]]
==== Example

The following <<indices-analyze,analyze API>> request uses the
`delimited_payload` filter with the default `|` delimiter to split
`the|0 brown|10 fox|5 is|0 quick|10` into tokens and payloads.

[source,console]
--------------------------------------------------
GET _analyze
{
"tokenizer": "whitespace",
"filter": ["delimited_payload"],
"text": "the|0 brown|10 fox|5 is|0 quick|10"
}
--------------------------------------------------

The filter produces the following tokens:

[source,text]
--------------------------------------------------
[ the, brown, fox, is, quick ]
--------------------------------------------------

Note that the analyze API does not return stored payloads. For an example that
includes returned payloads, see
<<analysis-delimited-payload-tokenfilter-return-stored-payloads>>.

/////////////////////
[source,console-result]
--------------------------------------------------
{
"tokens": [
{
"token": "the",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
},
{
"token": "brown",
"start_offset": 6,
"end_offset": 14,
"type": "word",
"position": 1
},
{
"token": "fox",
"start_offset": 15,
"end_offset": 20,
"type": "word",
"position": 2
},
{
"token": "is",
"start_offset": 21,
"end_offset": 25,
"type": "word",
"position": 3
},
{
"token": "quick",
"start_offset": 26,
"end_offset": 34,
"type": "word",
"position": 4
}
]
}
--------------------------------------------------
/////////////////////

[[analysis-delimited-payload-tokenfilter-analyzer-ex]]
==== Add to an analyzer

The following <<indices-create-index,create index API>> request uses the
`delimited-payload` filter to configure a new <<analysis-custom-analyzer,custom
analyzer>>.

[source,console]
--------------------------------------------------
PUT delimited_payload
{
"settings": {
"analysis": {
"analyzer": {
"whitespace_delimited_payload": {
"tokenizer": "whitespace",
"filter": [ "delimited_payload" ]
}
}
}
}
}
--------------------------------------------------

[[analysis-delimited-payload-tokenfilter-configure-parms]]
==== Configurable parameters

`delimiter`::
(Optional, string)
Character used to separate tokens from payloads. Defaults to `|`.

`encoding`::
+
--
(Optional, string)
Datatype for the stored payload. Valid values are:

`float`:::
(Default) Float

`identity`:::
Characters

`int`:::
Integer
--

[[analysis-delimited-payload-tokenfilter-customize]]
==== Customize and add to an analyzer

To customize the `delimited_payload` filter, duplicate it to create the basis
for a new custom token filter. You can modify the filter using its configurable
parameters.

For example, the following <<indices-create-index,create index API>> request
uses a custom `delimited_payload` filter to configure a new
<<analysis-custom-analyzer,custom analyzer>>. The custom `delimited_payload`
filter uses the `+` delimiter to separate tokens from payloads. Payloads are
encoded as integers.

[source,console]
--------------------------------------------------
PUT delimited_payload_example
{
"settings": {
"analysis": {
"analyzer": {
"whitespace_plus_delimited": {
"tokenizer": "whitespace",
"filter": [ "plus_delimited" ]
}
},
"filter": {
"plus_delimited": {
"type": "delimited_payload",
"delimiter": "+",
"encoding": "int"
}
}
}
}
}
--------------------------------------------------

[[analysis-delimited-payload-tokenfilter-return-stored-payloads]]
==== Return stored payloads

Use the <<indices-create-index,create index API>> to create an index that:

* Includes a field that stores payloads. For this field, set the
<<mapping-store,`store`>> mapping parameter to `true` and the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure why store : true is necessary?

Includes a field that stores payloads -> I would reformulate it to something like "stores term vectors with payloads", as it not a usual indexed field with payloads.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not. I've rephrased this bullet to "Includes a field that stores term vectors with payloads." as you suggested. Thanks!

<<term-vector,`term_vector`>> mapping parameter to `with_positions_payloads`
or `with_positions_offsets_payloads`.

* Uses a <<analysis-custom-analyzer,custom index analyzer>> with the
`delimited_payload` filter.

[source,console]
--------------------------------------------------
PUT text_payloads
{
"mappings": {
"properties": {
"text": {
"type": "text",
"term_vector": "with_positions_payloads",
"store": true,
"analyzer": "payload_delimiter"
}
}
},
"settings": {
"analysis": {
"analyzer": {
"payload_delimiter": {
"tokenizer": "whitespace",
"filter": [ "delimited_payload" ]
}
}
}
}
}
--------------------------------------------------

The older name `delimited_payload_filter` is deprecated and should not be used for new indices. Use `delimited_payload` instead.
Add a document containing payloads to the index.

============================================
[source,console]
--------------------------------------------------
POST text_payloads/_doc/1
{
"text": "the|0 brown|3 fox|4 is|0 quick|10"
}
--------------------------------------------------
// TEST[continued]

Example: "the|1 quick|2 fox|3" is split by default into tokens `the`, `quick`, and `fox` with payloads `1`, `2`, and `3` respectively.
Use the <<docs-termvectors,term vectors API>> to return the document's tokens
and base64-encoded payloads.

Parameters:
[source,console]
--------------------------------------------------
GET text_payloads/_termvectors/1
{
"fields": [ "text" ],
"payloads": true
}
--------------------------------------------------
// TEST[continued]

`delimiter`::
Character used for splitting the tokens. Default is `|`.
The API returns the following response:

`encoding`::
The type of the payload. `int` for integer, `float` for float and `identity` for characters. Default is `float`.
[source,console-result]
--------------------------------------------------
{
"_index": "text_payloads",
"_id": "1",
"_version": 1,
"found": true,
"took": 8,
"term_vectors": {
"text": {
"field_statistics": {
"sum_doc_freq": 5,
"doc_count": 1,
"sum_ttf": 5
},
"terms": {
"brown": {
"term_freq": 1,
"tokens": [
{
"position": 1,
"payload": "QEAAAA=="
}
]
},
"fox": {
"term_freq": 1,
"tokens": [
{
"position": 2,
"payload": "QIAAAA=="
}
]
},
"is": {
"term_freq": 1,
"tokens": [
{
"position": 3,
"payload": "AAAAAA=="
}
]
},
"quick": {
"term_freq": 1,
"tokens": [
{
"position": 4,
"payload": "QSAAAA=="
}
]
},
"the": {
"term_freq": 1,
"tokens": [
{
"position": 0,
"payload": "AAAAAA=="
}
]
}
}
}
}
}
--------------------------------------------------
// TESTRESPONSE[s/"took": 8/"took": "$body.took"/]