From 8f4a3eb07fbb482e0b05d055f1fae27dbcc910cd Mon Sep 17 00:00:00 2001 From: James Rodewig Date: Thu, 19 Mar 2020 07:42:26 -0400 Subject: [PATCH] [DOCS] Add token graph concept docs (#53339) Adds conceptual docs for token graphs. These docs cover: * How a token graph is constructed from a token stream * How synonyms and multi-position tokens impact token graphs * How token graphs are used during search * Why some token filters produce invalid token graphs Also makes the following supporting changes: * Adds anchors to the 'Anatomy of an Analyzer' docs for cross-linking * Adds several SVGs for token graph diagrams --- docs/reference/analysis/anatomy.asciidoc | 3 + docs/reference/analysis/concepts.asciidoc | 4 +- docs/reference/analysis/token-graphs.asciidoc | 104 ++++++++++++++++++ .../synonym-graph-tokenfilter.asciidoc | 4 +- .../word-delimiter-graph-tokenfilter.asciidoc | 4 +- .../images/analysis/token-graph-dns-ex.svg | 65 +++++++++++ .../analysis/token-graph-dns-invalid-ex.svg | 72 ++++++++++++ .../analysis/token-graph-dns-synonym-ex.svg | 72 ++++++++++++ .../images/analysis/token-graph-qbf-ex.svg | 45 ++++++++ .../analysis/token-graph-qbf-synonym-ex.svg | 52 +++++++++ 10 files changed, 420 insertions(+), 5 deletions(-) create mode 100644 docs/reference/analysis/token-graphs.asciidoc create mode 100644 docs/reference/images/analysis/token-graph-dns-ex.svg create mode 100644 docs/reference/images/analysis/token-graph-dns-invalid-ex.svg create mode 100644 docs/reference/images/analysis/token-graph-dns-synonym-ex.svg create mode 100644 docs/reference/images/analysis/token-graph-qbf-ex.svg create mode 100644 docs/reference/images/analysis/token-graph-qbf-synonym-ex.svg diff --git a/docs/reference/analysis/anatomy.asciidoc b/docs/reference/analysis/anatomy.asciidoc index 1db14e787a54a..22e7ffda667d4 100644 --- a/docs/reference/analysis/anatomy.asciidoc +++ b/docs/reference/analysis/anatomy.asciidoc @@ -10,6 +10,7 @@ blocks into analyzers suitable for different languages and types of text. Elasticsearch also exposes the individual building blocks so that they can be combined to define new <> analyzers. +[[analyzer-anatomy-character-filters]] ==== Character filters A _character filter_ receives the original text as a stream of characters and @@ -21,6 +22,7 @@ elements like `` from the stream. An analyzer may have *zero or more* <>, which are applied in order. +[[analyzer-anatomy-tokenizer]] ==== Tokenizer A _tokenizer_ receives a stream of characters, breaks it up into individual @@ -35,6 +37,7 @@ the term represents. An analyzer must have *exactly one* <>. +[[analyzer-anatomy-token-filters]] ==== Token filters A _token filter_ receives the token stream and may add, remove, or change diff --git a/docs/reference/analysis/concepts.asciidoc b/docs/reference/analysis/concepts.asciidoc index 2468286e3a719..2e431efcd5fec 100644 --- a/docs/reference/analysis/concepts.asciidoc +++ b/docs/reference/analysis/concepts.asciidoc @@ -8,6 +8,8 @@ This section explains the fundamental concepts of text analysis in {es}. * <> * <> +* <> include::anatomy.asciidoc[] -include::index-search-time.asciidoc[] \ No newline at end of file +include::index-search-time.asciidoc[] +include::token-graphs.asciidoc[] \ No newline at end of file diff --git a/docs/reference/analysis/token-graphs.asciidoc b/docs/reference/analysis/token-graphs.asciidoc new file mode 100644 index 0000000000000..ab1dc52f5131b --- /dev/null +++ b/docs/reference/analysis/token-graphs.asciidoc @@ -0,0 +1,104 @@ +[[token-graphs]] +=== Token graphs + +When a <> converts a text into a stream of +tokens, it also records the following: + +* The `position` of each token in the stream +* The `positionLength`, the number of positions that a token spans + +Using these, you can create a +https://en.wikipedia.org/wiki/Directed_acyclic_graph[directed acyclic graph], +called a _token graph_, for a stream. In a token graph, each position represents +a node. Each token represents an edge or arc, pointing to the next position. + +image::images/analysis/token-graph-qbf-ex.svg[align="center"] + +[[token-graphs-synonyms]] +==== Synonyms + +Some <> can add new tokens, like +synonyms, to an existing token stream. These synonyms often span the same +positions as existing tokens. + +In the following graph, `quick` and its synonym `fast` both have a position of +`0`. They span the same positions. + +image::images/analysis/token-graph-qbf-synonym-ex.svg[align="center"] + +[[token-graphs-multi-position-tokens]] +==== Multi-position tokens + +Some token filters can add tokens that span multiple positions. These can +include tokens for multi-word synonyms, such as using "atm" as a synonym for +"automatic teller machine." + +However, only some token filters, known as _graph token filters_, accurately +record the `positionLength` for multi-position tokens. This filters include: + +* <> +* <> + +In the following graph, `domain name system` and its synonym, `dns`, both have a +position of `0`. However, `dns` has a `positionLength` of `3`. Other tokens in +the graph have a default `positionLength` of `1`. + +image::images/analysis/token-graph-dns-synonym-ex.svg[align="center"] + +[[token-graphs-token-graphs-search]] +===== Using token graphs for search + +<> ignores the `positionLength` attribute +and does not support token graphs containing multi-position tokens. + +However, queries, such as the <> or +<> query, can use these graphs to +generate multiple sub-queries from a single query string. + +.*Example* +[%collapsible] +==== + +A user runs a search for the following phrase using the `match_phrase` query: + +`domain name system is fragile` + +During <>, `dns`, a synonym for +`domain name system`, is added to the query string's token stream. The `dns` +token has a `positionLength` of `3`. + +image::images/analysis/token-graph-dns-synonym-ex.svg[align="center"] + +The `match_phrase` query uses this graph to generate sub-queries for the +following phrases: + +[source,text] +------ +dns is fragile +domain name system is fragile +------ + +This means the query matches documents containing either `dns is fragile` _or_ +`domain name system is fragile`. +==== + +[[token-graphs-invalid-token-graphs]] +===== Invalid token graphs + +The following token filters can add tokens that span multiple positions but +only record a default `positionLength` of `1`: + +* <> +* <> + +This means these filters will produce invalid token graphs for streams +containing such tokens. + +In the following graph, `dns` is a multi-position synonym for `domain name +system`. However, `dns` has the default `positionLength` value of `1`, resulting +in an invalid graph. + +image::images/analysis/token-graph-dns-invalid-ex.svg[align="center"] + +Avoid using invalid token graphs for search. Invalid graphs can cause unexpected +search results. \ No newline at end of file diff --git a/docs/reference/analysis/tokenfilters/synonym-graph-tokenfilter.asciidoc b/docs/reference/analysis/tokenfilters/synonym-graph-tokenfilter.asciidoc index e6bc76e408f23..582ce99b20bf7 100644 --- a/docs/reference/analysis/tokenfilters/synonym-graph-tokenfilter.asciidoc +++ b/docs/reference/analysis/tokenfilters/synonym-graph-tokenfilter.asciidoc @@ -8,8 +8,8 @@ The `synonym_graph` token filter allows to easily handle synonyms, including multi-word synonyms correctly during the analysis process. In order to properly handle multi-word synonyms this token filter -creates a "graph token stream" during processing. For more information -on this topic and its various complexities, please read the +creates a <> during processing. For more +information on this topic and its various complexities, please read the http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html[Lucene's TokenStreams are actually graphs] blog post. ["NOTE",id="synonym-graph-index-note"] diff --git a/docs/reference/analysis/tokenfilters/word-delimiter-graph-tokenfilter.asciidoc b/docs/reference/analysis/tokenfilters/word-delimiter-graph-tokenfilter.asciidoc index 8581d8cb7ec17..1f2f61a5071dc 100644 --- a/docs/reference/analysis/tokenfilters/word-delimiter-graph-tokenfilter.asciidoc +++ b/docs/reference/analysis/tokenfilters/word-delimiter-graph-tokenfilter.asciidoc @@ -440,8 +440,8 @@ that span multiple positions when any of the following parameters are `true`: However, only the `word_delimiter_graph` filter assigns multi-position tokens a `positionLength` attribute, which indicates the number of positions a token -spans. This ensures the `word_delimiter_graph` filter always produces valid token -https://en.wikipedia.org/wiki/Directed_acyclic_graph[graphs]. +spans. This ensures the `word_delimiter_graph` filter always produces valid +<>. The `word_delimiter` filter does not assign multi-position tokens a `positionLength` attribute. This means it produces invalid graphs for streams diff --git a/docs/reference/images/analysis/token-graph-dns-ex.svg b/docs/reference/images/analysis/token-graph-dns-ex.svg new file mode 100644 index 0000000000000..0eda4fa54bd20 --- /dev/null +++ b/docs/reference/images/analysis/token-graph-dns-ex.svg @@ -0,0 +1,65 @@ + + + + Slice 1 + Created with Sketch. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/docs/reference/images/analysis/token-graph-dns-invalid-ex.svg b/docs/reference/images/analysis/token-graph-dns-invalid-ex.svg new file mode 100644 index 0000000000000..5614f39bfe35c --- /dev/null +++ b/docs/reference/images/analysis/token-graph-dns-invalid-ex.svg @@ -0,0 +1,72 @@ + + + + Slice 1 + Created with Sketch. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/docs/reference/images/analysis/token-graph-dns-synonym-ex.svg b/docs/reference/images/analysis/token-graph-dns-synonym-ex.svg new file mode 100644 index 0000000000000..cff5b1306b73b --- /dev/null +++ b/docs/reference/images/analysis/token-graph-dns-synonym-ex.svg @@ -0,0 +1,72 @@ + + + + Slice 1 + Created with Sketch. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/docs/reference/images/analysis/token-graph-qbf-ex.svg b/docs/reference/images/analysis/token-graph-qbf-ex.svg new file mode 100644 index 0000000000000..63970673092d4 --- /dev/null +++ b/docs/reference/images/analysis/token-graph-qbf-ex.svg @@ -0,0 +1,45 @@ + + + + Slice 1 + Created with Sketch. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/docs/reference/images/analysis/token-graph-qbf-synonym-ex.svg b/docs/reference/images/analysis/token-graph-qbf-synonym-ex.svg new file mode 100644 index 0000000000000..2baa3d9e63cb5 --- /dev/null +++ b/docs/reference/images/analysis/token-graph-qbf-synonym-ex.svg @@ -0,0 +1,52 @@ + + + + Slice 1 + Created with Sketch. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file