-
Notifications
You must be signed in to change notification settings - Fork 24.9k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[DOCS] Add token graph concept docs (#53339)
Adds conceptual docs for token graphs. These docs cover: * How a token graph is constructed from a token stream * How synonyms and multi-position tokens impact token graphs * How token graphs are used during search * Why some token filters produce invalid token graphs Also makes the following supporting changes: * Adds anchors to the 'Anatomy of an Analyzer' docs for cross-linking * Adds several SVGs for token graph diagrams
- Loading branch information
Showing
10 changed files
with
420 additions
and
5 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,104 @@ | ||
[[token-graphs]] | ||
=== Token graphs | ||
|
||
When a <<analyzer-anatomy-tokenizer,tokenizer>> converts a text into a stream of | ||
tokens, it also records the following: | ||
|
||
* The `position` of each token in the stream | ||
* The `positionLength`, the number of positions that a token spans | ||
|
||
Using these, you can create a | ||
https://en.wikipedia.org/wiki/Directed_acyclic_graph[directed acyclic graph], | ||
called a _token graph_, for a stream. In a token graph, each position represents | ||
a node. Each token represents an edge or arc, pointing to the next position. | ||
|
||
image::images/analysis/token-graph-qbf-ex.svg[align="center"] | ||
|
||
[[token-graphs-synonyms]] | ||
==== Synonyms | ||
|
||
Some <<analyzer-anatomy-token-filters,token filters>> can add new tokens, like | ||
synonyms, to an existing token stream. These synonyms often span the same | ||
positions as existing tokens. | ||
|
||
In the following graph, `quick` and its synonym `fast` both have a position of | ||
`0`. They span the same positions. | ||
|
||
image::images/analysis/token-graph-qbf-synonym-ex.svg[align="center"] | ||
|
||
[[token-graphs-multi-position-tokens]] | ||
==== Multi-position tokens | ||
|
||
Some token filters can add tokens that span multiple positions. These can | ||
include tokens for multi-word synonyms, such as using "atm" as a synonym for | ||
"automatic teller machine." | ||
|
||
However, only some token filters, known as _graph token filters_, accurately | ||
record the `positionLength` for multi-position tokens. This filters include: | ||
|
||
* <<analysis-synonym-graph-tokenfilter,`synonym_graph`>> | ||
* <<analysis-word-delimiter-graph-tokenfilter,`word_delimiter_graph`>> | ||
|
||
In the following graph, `domain name system` and its synonym, `dns`, both have a | ||
position of `0`. However, `dns` has a `positionLength` of `3`. Other tokens in | ||
the graph have a default `positionLength` of `1`. | ||
|
||
image::images/analysis/token-graph-dns-synonym-ex.svg[align="center"] | ||
|
||
[[token-graphs-token-graphs-search]] | ||
===== Using token graphs for search | ||
|
||
<<analysis-index-search-time,Indexing>> ignores the `positionLength` attribute | ||
and does not support token graphs containing multi-position tokens. | ||
|
||
However, queries, such as the <<query-dsl-match-query,`match`>> or | ||
<<query-dsl-match-query-phrase,`match_phrase`>> query, can use these graphs to | ||
generate multiple sub-queries from a single query string. | ||
|
||
.*Example* | ||
[%collapsible] | ||
==== | ||
A user runs a search for the following phrase using the `match_phrase` query: | ||
`domain name system is fragile` | ||
During <<analysis-index-search-time,search analysis>>, `dns`, a synonym for | ||
`domain name system`, is added to the query string's token stream. The `dns` | ||
token has a `positionLength` of `3`. | ||
image::images/analysis/token-graph-dns-synonym-ex.svg[align="center"] | ||
The `match_phrase` query uses this graph to generate sub-queries for the | ||
following phrases: | ||
[source,text] | ||
------ | ||
dns is fragile | ||
domain name system is fragile | ||
------ | ||
This means the query matches documents containing either `dns is fragile` _or_ | ||
`domain name system is fragile`. | ||
==== | ||
|
||
[[token-graphs-invalid-token-graphs]] | ||
===== Invalid token graphs | ||
|
||
The following token filters can add tokens that span multiple positions but | ||
only record a default `positionLength` of `1`: | ||
|
||
* <<analysis-synonym-tokenfilter,`synonym`>> | ||
* <<analysis-word-delimiter-tokenfilter,`word_delimiter`>> | ||
|
||
This means these filters will produce invalid token graphs for streams | ||
containing such tokens. | ||
|
||
In the following graph, `dns` is a multi-position synonym for `domain name | ||
system`. However, `dns` has the default `positionLength` value of `1`, resulting | ||
in an invalid graph. | ||
|
||
image::images/analysis/token-graph-dns-invalid-ex.svg[align="center"] | ||
|
||
Avoid using invalid token graphs for search. Invalid graphs can cause unexpected | ||
search results. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
72 changes: 72 additions & 0 deletions
72
docs/reference/images/analysis/token-graph-dns-invalid-ex.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
72 changes: 72 additions & 0 deletions
72
docs/reference/images/analysis/token-graph-dns-synonym-ex.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
52 changes: 52 additions & 0 deletions
52
docs/reference/images/analysis/token-graph-qbf-synonym-ex.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.