Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOCS] Add token graph concept docs #53339

Merged
merged 11 commits into from
Mar 19, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions docs/reference/analysis/anatomy.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ blocks into analyzers suitable for different languages and types of text.
Elasticsearch also exposes the individual building blocks so that they can be
combined to define new <<analysis-custom-analyzer,`custom`>> analyzers.

[[analyzer-anatomy-character-filters]]
==== Character filters

A _character filter_ receives the original text as a stream of characters and
Expand All @@ -21,6 +22,7 @@ elements like `<b>` from the stream.
An analyzer may have *zero or more* <<analysis-charfilters,character filters>>,
which are applied in order.

[[analyzer-anatomy-tokenizer]]
==== Tokenizer

A _tokenizer_ receives a stream of characters, breaks it up into individual
Expand All @@ -35,6 +37,7 @@ the term represents.

An analyzer must have *exactly one* <<analysis-tokenizers,tokenizer>>.

[[analyzer-anatomy-token-filters]]
==== Token filters

A _token filter_ receives the token stream and may add, remove, or change
Expand Down
4 changes: 3 additions & 1 deletion docs/reference/analysis/concepts.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@ This section explains the fundamental concepts of text analysis in {es}.

* <<analyzer-anatomy>>
* <<analysis-index-search-time>>
* <<token-graphs>>

include::anatomy.asciidoc[]
include::index-search-time.asciidoc[]
include::index-search-time.asciidoc[]
include::token-graphs.asciidoc[]
104 changes: 104 additions & 0 deletions docs/reference/analysis/token-graphs.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
[[token-graphs]]
=== Token graphs

When a <<analyzer-anatomy-tokenizer,tokenizer>> converts a text into a stream of
tokens, it also records the following:

* The `position` of each token in the stream
* The `positionLength`, the number of positions that a token spans

Using these, you can create a
https://en.wikipedia.org/wiki/Directed_acyclic_graph[directed acyclic graph],
called a _token graph_, for a stream. In a token graph, each position represents
a node. Each token represents an edge or arc, pointing to the next position.

image::images/analysis/token-graph-qbf-ex.svg[align="center"]

[[token-graphs-synonyms]]
==== Synonyms

Some <<analyzer-anatomy-token-filters,token filters>> can add new tokens, like
synonyms, to an existing token stream. These synonyms often span the same
positions as existing tokens.

In the following graph, `quick` and its synonym `fast` both have a position of
`0`. They span the same positions.

image::images/analysis/token-graph-qbf-synonym-ex.svg[align="center"]

[[token-graphs-multi-position-tokens]]
==== Multi-position tokens

Some token filters can add tokens that span multiple positions. These can
include tokens for multi-word synonyms, such as using "atm" as a synonym for
"automatic teller machine."

However, only some token filters, known as _graph token filters_, accurately
record the `positionLength` for multi-position tokens. This filters include:

* <<analysis-synonym-graph-tokenfilter,`synonym_graph`>>
* <<analysis-word-delimiter-graph-tokenfilter,`word_delimiter_graph`>>

In the following graph, `domain name system` and its synonym, `dns`, both have a
position of `0`. However, `dns` has a `positionLength` of `3`. Other tokens in
the graph have a default `positionLength` of `1`.

image::images/analysis/token-graph-dns-synonym-ex.svg[align="center"]

[[token-graphs-token-graphs-search]]
===== Using token graphs for search

<<analysis-index-search-time,Indexing>> ignores the `positionLength` attribute
and does not support token graphs containing multi-position tokens.

However, queries, such as the <<query-dsl-match-query,`match`>> or
<<query-dsl-match-query-phrase,`match_phrase`>> query, can use these graphs to
generate multiple sub-queries from a single query string.

.*Example*
[%collapsible]
====

A user runs a search for the following phrase using the `match_phrase` query:

`domain name system is fragile`

During <<analysis-index-search-time,search analysis>>, `dns`, a synonym for
`domain name system`, is added to the query string's token stream. The `dns`
token has a `positionLength` of `3`.

image::images/analysis/token-graph-dns-synonym-ex.svg[align="center"]

The `match_phrase` query uses this graph to generate sub-queries for the
following phrases:

[source,text]
------
dns is fragile
domain name system is fragile
------

This means the query matches documents containing either `dns is fragile` _or_
`domain name system is fragile`.
====

[[token-graphs-invalid-token-graphs]]
===== Invalid token graphs

The following token filters can add tokens that span multiple positions but
only record a default `positionLength` of `1`:

* <<analysis-synonym-tokenfilter,`synonym`>>
* <<analysis-word-delimiter-tokenfilter,`word_delimiter`>>

This means these filters will produce invalid token graphs for streams
containing such tokens.

In the following graph, `dns` is a multi-position synonym for `domain name
system`. However, `dns` has the default `positionLength` value of `1`, resulting
in an invalid graph.

image::images/analysis/token-graph-dns-invalid-ex.svg[align="center"]

Avoid using invalid token graphs for search. Invalid graphs can cause unexpected
search results.
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,8 @@ The `synonym_graph` token filter allows to easily handle synonyms,
including multi-word synonyms correctly during the analysis process.

In order to properly handle multi-word synonyms this token filter
creates a "graph token stream" during processing. For more information
on this topic and its various complexities, please read the
creates a <<token-graphs,graph token stream>> during processing. For more
information on this topic and its various complexities, please read the
http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html[Lucene's TokenStreams are actually graphs] blog post.

["NOTE",id="synonym-graph-index-note"]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -440,8 +440,8 @@ that span multiple positions when any of the following parameters are `true`:

However, only the `word_delimiter_graph` filter assigns multi-position tokens a
`positionLength` attribute, which indicates the number of positions a token
spans. This ensures the `word_delimiter_graph` filter always produces valid token
https://en.wikipedia.org/wiki/Directed_acyclic_graph[graphs].
spans. This ensures the `word_delimiter_graph` filter always produces valid
<<token-graphs,token graphs>>.

The `word_delimiter` filter does not assign multi-position tokens a
`positionLength` attribute. This means it produces invalid graphs for streams
Expand Down
65 changes: 65 additions & 0 deletions docs/reference/images/analysis/token-graph-dns-ex.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
72 changes: 72 additions & 0 deletions docs/reference/images/analysis/token-graph-dns-invalid-ex.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
72 changes: 72 additions & 0 deletions docs/reference/images/analysis/token-graph-dns-synonym-ex.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
45 changes: 45 additions & 0 deletions docs/reference/images/analysis/token-graph-qbf-ex.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
52 changes: 52 additions & 0 deletions docs/reference/images/analysis/token-graph-qbf-synonym-ex.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.