Skip to content

Commit

Permalink
[DOCS] Add token graph concept docs (#53339)
Browse files Browse the repository at this point in the history
Adds conceptual docs for token graphs.
These docs cover:

* How a token graph is constructed from a token stream
* How synonyms and multi-position tokens impact token graphs
* How token graphs are used during search
* Why some token filters produce invalid token graphs

Also makes the following supporting changes:
* Adds anchors to the 'Anatomy of an Analyzer' docs for cross-linking
* Adds several SVGs for token graph diagrams
  • Loading branch information
jrodewig committed Mar 19, 2020
1 parent 4b0ae15 commit 8f4a3eb
Show file tree
Hide file tree
Showing 10 changed files with 420 additions and 5 deletions.
3 changes: 3 additions & 0 deletions docs/reference/analysis/anatomy.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ blocks into analyzers suitable for different languages and types of text.
Elasticsearch also exposes the individual building blocks so that they can be
combined to define new <<analysis-custom-analyzer,`custom`>> analyzers.

[[analyzer-anatomy-character-filters]]
==== Character filters

A _character filter_ receives the original text as a stream of characters and
Expand All @@ -21,6 +22,7 @@ elements like `<b>` from the stream.
An analyzer may have *zero or more* <<analysis-charfilters,character filters>>,
which are applied in order.

[[analyzer-anatomy-tokenizer]]
==== Tokenizer

A _tokenizer_ receives a stream of characters, breaks it up into individual
Expand All @@ -35,6 +37,7 @@ the term represents.

An analyzer must have *exactly one* <<analysis-tokenizers,tokenizer>>.

[[analyzer-anatomy-token-filters]]
==== Token filters

A _token filter_ receives the token stream and may add, remove, or change
Expand Down
4 changes: 3 additions & 1 deletion docs/reference/analysis/concepts.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@ This section explains the fundamental concepts of text analysis in {es}.

* <<analyzer-anatomy>>
* <<analysis-index-search-time>>
* <<token-graphs>>

include::anatomy.asciidoc[]
include::index-search-time.asciidoc[]
include::index-search-time.asciidoc[]
include::token-graphs.asciidoc[]
104 changes: 104 additions & 0 deletions docs/reference/analysis/token-graphs.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
[[token-graphs]]
=== Token graphs

When a <<analyzer-anatomy-tokenizer,tokenizer>> converts a text into a stream of
tokens, it also records the following:

* The `position` of each token in the stream
* The `positionLength`, the number of positions that a token spans

Using these, you can create a
https://en.wikipedia.org/wiki/Directed_acyclic_graph[directed acyclic graph],
called a _token graph_, for a stream. In a token graph, each position represents
a node. Each token represents an edge or arc, pointing to the next position.

image::images/analysis/token-graph-qbf-ex.svg[align="center"]

[[token-graphs-synonyms]]
==== Synonyms

Some <<analyzer-anatomy-token-filters,token filters>> can add new tokens, like
synonyms, to an existing token stream. These synonyms often span the same
positions as existing tokens.

In the following graph, `quick` and its synonym `fast` both have a position of
`0`. They span the same positions.

image::images/analysis/token-graph-qbf-synonym-ex.svg[align="center"]

[[token-graphs-multi-position-tokens]]
==== Multi-position tokens

Some token filters can add tokens that span multiple positions. These can
include tokens for multi-word synonyms, such as using "atm" as a synonym for
"automatic teller machine."

However, only some token filters, known as _graph token filters_, accurately
record the `positionLength` for multi-position tokens. This filters include:

* <<analysis-synonym-graph-tokenfilter,`synonym_graph`>>
* <<analysis-word-delimiter-graph-tokenfilter,`word_delimiter_graph`>>

In the following graph, `domain name system` and its synonym, `dns`, both have a
position of `0`. However, `dns` has a `positionLength` of `3`. Other tokens in
the graph have a default `positionLength` of `1`.

image::images/analysis/token-graph-dns-synonym-ex.svg[align="center"]

[[token-graphs-token-graphs-search]]
===== Using token graphs for search

<<analysis-index-search-time,Indexing>> ignores the `positionLength` attribute
and does not support token graphs containing multi-position tokens.

However, queries, such as the <<query-dsl-match-query,`match`>> or
<<query-dsl-match-query-phrase,`match_phrase`>> query, can use these graphs to
generate multiple sub-queries from a single query string.

.*Example*
[%collapsible]
====
A user runs a search for the following phrase using the `match_phrase` query:
`domain name system is fragile`
During <<analysis-index-search-time,search analysis>>, `dns`, a synonym for
`domain name system`, is added to the query string's token stream. The `dns`
token has a `positionLength` of `3`.
image::images/analysis/token-graph-dns-synonym-ex.svg[align="center"]
The `match_phrase` query uses this graph to generate sub-queries for the
following phrases:
[source,text]
------
dns is fragile
domain name system is fragile
------
This means the query matches documents containing either `dns is fragile` _or_
`domain name system is fragile`.
====

[[token-graphs-invalid-token-graphs]]
===== Invalid token graphs

The following token filters can add tokens that span multiple positions but
only record a default `positionLength` of `1`:

* <<analysis-synonym-tokenfilter,`synonym`>>
* <<analysis-word-delimiter-tokenfilter,`word_delimiter`>>

This means these filters will produce invalid token graphs for streams
containing such tokens.

In the following graph, `dns` is a multi-position synonym for `domain name
system`. However, `dns` has the default `positionLength` value of `1`, resulting
in an invalid graph.

image::images/analysis/token-graph-dns-invalid-ex.svg[align="center"]

Avoid using invalid token graphs for search. Invalid graphs can cause unexpected
search results.
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,8 @@ The `synonym_graph` token filter allows to easily handle synonyms,
including multi-word synonyms correctly during the analysis process.

In order to properly handle multi-word synonyms this token filter
creates a "graph token stream" during processing. For more information
on this topic and its various complexities, please read the
creates a <<token-graphs,graph token stream>> during processing. For more
information on this topic and its various complexities, please read the
http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html[Lucene's TokenStreams are actually graphs] blog post.

["NOTE",id="synonym-graph-index-note"]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -440,8 +440,8 @@ that span multiple positions when any of the following parameters are `true`:

However, only the `word_delimiter_graph` filter assigns multi-position tokens a
`positionLength` attribute, which indicates the number of positions a token
spans. This ensures the `word_delimiter_graph` filter always produces valid token
https://en.wikipedia.org/wiki/Directed_acyclic_graph[graphs].
spans. This ensures the `word_delimiter_graph` filter always produces valid
<<token-graphs,token graphs>>.

The `word_delimiter` filter does not assign multi-position tokens a
`positionLength` attribute. This means it produces invalid graphs for streams
Expand Down
65 changes: 65 additions & 0 deletions docs/reference/images/analysis/token-graph-dns-ex.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
72 changes: 72 additions & 0 deletions docs/reference/images/analysis/token-graph-dns-invalid-ex.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
72 changes: 72 additions & 0 deletions docs/reference/images/analysis/token-graph-dns-synonym-ex.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
45 changes: 45 additions & 0 deletions docs/reference/images/analysis/token-graph-qbf-ex.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
52 changes: 52 additions & 0 deletions docs/reference/images/analysis/token-graph-qbf-synonym-ex.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 8f4a3eb

Please sign in to comment.