From 8daf793cf148eaffb116d875b64041a562d502c5 Mon Sep 17 00:00:00 2001 From: Darren Meiss Date: Wed, 20 Feb 2019 04:36:37 -0500 Subject: [PATCH] Edits to text in Phrase Suggester doc (#38966) --- .../search/suggesters/phrase-suggest.asciidoc | 77 +++++++++---------- 1 file changed, 38 insertions(+), 39 deletions(-) diff --git a/docs/reference/search/suggesters/phrase-suggest.asciidoc b/docs/reference/search/suggesters/phrase-suggest.asciidoc index 9c2c56cc40fec..d92c32eddf033 100644 --- a/docs/reference/search/suggesters/phrase-suggest.asciidoc +++ b/docs/reference/search/suggesters/phrase-suggest.asciidoc @@ -139,21 +139,21 @@ The response contains suggestions scored by the most likely spell correction fir [horizontal] `field`:: - the name of the field used to do n-gram lookups for the + The name of the field used to do n-gram lookups for the language model, the suggester will use this field to gain statistics to score corrections. This field is mandatory. `gram_size`:: - sets max size of the n-grams (shingles) in the `field`. - If the field doesn't contain n-grams (shingles) this should be omitted + Sets max size of the n-grams (shingles) in the `field`. + If the field doesn't contain n-grams (shingles), this should be omitted or set to `1`. Note that Elasticsearch tries to detect the gram size - based on the specified `field`. If the field uses a `shingle` filter the + based on the specified `field`. If the field uses a `shingle` filter, the `gram_size` is set to the `max_shingle_size` if not explicitly set. `real_word_error_likelihood`:: - the likelihood of a term being a + The likelihood of a term being a misspelled even if the term exists in the dictionary. The default is - `0.95` corresponding to 5% of the real words are misspelled. + `0.95`, meaning 5% of the real words are misspelled. `confidence`:: @@ -165,33 +165,33 @@ The response contains suggestions scored by the most likely spell correction fir to `0.0` the top N candidates are returned. The default is `1.0`. `max_errors`:: - the maximum percentage of the terms that at most + The maximum percentage of the terms considered to be misspellings in order to form a correction. This method accepts a float value in the range `[0..1)` as a fraction of the actual query terms or a number `>=1` as an absolute number of query terms. The - default is set to `1.0` which corresponds to that only corrections with - at most 1 misspelled term are returned. Note that setting this too high - can negatively impact performance. Low values like `1` or `2` are recommended + default is set to `1.0`, meaning only corrections with + at most one misspelled term are returned. Note that setting this too high + can negatively impact performance. Low values like `1` or `2` are recommended; otherwise the time spend in suggest calls might exceed the time spend in query execution. `separator`:: - the separator that is used to separate terms in the + The separator that is used to separate terms in the bigram field. If not set the whitespace character is used as a separator. `size`:: - the number of candidates that are generated for each - individual query term Low numbers like `3` or `5` typically produce good + The number of candidates that are generated for each + individual query term. Low numbers like `3` or `5` typically produce good results. Raising this can bring up terms with higher edit distances. The default is `5`. `analyzer`:: - Sets the analyzer to analyse to suggest text with. + Sets the analyzer to analyze to suggest text with. Defaults to the search analyzer of the suggest field passed via `field`. `shard_size`:: - Sets the maximum number of suggested term to be + Sets the maximum number of suggested terms to be retrieved from each individual shard. During the reduce phase, only the top N suggestions are returned based on the `size` option. Defaults to `5`. @@ -202,7 +202,7 @@ The response contains suggestions scored by the most likely spell correction fir `highlight`:: Sets up suggestion highlighting. If not provided then no `highlighted` field is returned. If provided must - contain exactly `pre_tag` and `post_tag` which are + contain exactly `pre_tag` and `post_tag`, which are wrapped around the changed tokens. If multiple tokens in a row are changed the entire phrase of changed tokens is wrapped rather than each token. @@ -217,7 +217,7 @@ The response contains suggestions scored by the most likely spell correction fir variable, which should be used in your query. You can still specify your own template `params` -- the `suggestion` value will be added to the variables you specify. Additionally, you can specify a `prune` to control - if all phrase suggestions will be returned, when set to `true` the suggestions + if all phrase suggestions will be returned; when set to `true` the suggestions will have an additional option `collate_match`, which will be `true` if matching documents for the phrase was found, `false` otherwise. The default value for `prune` is `false`. @@ -271,19 +271,19 @@ the index) and frequent grams (appear at least once in the index). [horizontal] `stupid_backoff`:: - a simple backoff model that backs off to lower + A simple backoff model that backs off to lower order n-gram models if the higher order count is `0` and discounts the lower order n-gram model by a constant factor. The default `discount` is `0.4`. Stupid Backoff is the default model. `laplace`:: - a smoothing model that uses an additive smoothing where a + A smoothing model that uses an additive smoothing where a constant (typically `1.0` or smaller) is added to all counts to balance - weights, The default `alpha` is `0.5`. + weights. The default `alpha` is `0.5`. `linear_interpolation`:: - a smoothing model that takes the weighted - mean of the unigrams, bigrams and trigrams based on user supplied + A smoothing model that takes the weighted + mean of the unigrams, bigrams, and trigrams based on user supplied weights (lambdas). Linear Interpolation doesn't have any default values. All parameters (`trigram_lambda`, `bigram_lambda`, `unigram_lambda`) must be supplied. @@ -294,11 +294,11 @@ The `phrase` suggester uses candidate generators to produce a list of possible terms per term in the given text. A single candidate generator is similar to a `term` suggester called for each individual term in the text. The output of the generators is subsequently scored in combination -with the candidates from the other terms to for suggestion candidates. +with the candidates from the other terms for suggestion candidates. Currently only one type of candidate generator is supported, the `direct_generator`. The Phrase suggest API accepts a list of generators -under the key `direct_generator` each of the generators in the list are +under the key `direct_generator`; each of the generators in the list is called per term in the original text. ==== Direct Generators @@ -320,7 +320,7 @@ The direct generators support the following parameters: as an optimization to generate fewer suggestions to test on each shard and are not rechecked when combining the suggestions generated on each shard. Thus `missing` will generate suggestions for terms on shards that do - not contain them even other shards do contain them. Those should be + not contain them even if other shards do contain them. Those should be filtered out using `confidence`. Three possible values can be specified: ** `missing`: Only generate suggestions for terms that are not in the shard. This is the default. @@ -332,7 +332,7 @@ The direct generators support the following parameters: `max_edits`:: The maximum edit distance candidate suggestions can have in order to be considered as a suggestion. Can only be a value between 1 - and 2. Any other value result in an bad request error being thrown. + and 2. Any other value results in a bad request error being thrown. Defaults to 2. `prefix_length`:: @@ -347,7 +347,7 @@ The direct generators support the following parameters: `max_inspections`:: A factor that is used to multiply with the - `shards_size` in order to inspect more candidate spell corrections on + `shards_size` in order to inspect more candidate spelling corrections on the shard level. Can improve accuracy at the cost of performance. Defaults to 5. @@ -356,32 +356,31 @@ The direct generators support the following parameters: suggestion should appear in. This can be specified as an absolute number or as a relative percentage of number of documents. This can improve quality by only suggesting high frequency terms. Defaults to 0f and is - not enabled. If a value higher than 1 is specified then the number + not enabled. If a value higher than 1 is specified, then the number cannot be fractional. The shard level document frequencies are used for this option. `max_term_freq`:: - The maximum threshold in number of documents a + The maximum threshold in number of documents in which a suggest text token can exist in order to be included. Can be a relative - percentage number (e.g 0.4) or an absolute number to represent document - frequencies. If an value higher than 1 is specified then fractional can + percentage number (e.g., 0.4) or an absolute number to represent document + frequencies. If a value higher than 1 is specified, then fractional can not be specified. Defaults to 0.01f. This can be used to exclude high - frequency terms from being spellchecked. High frequency terms are - usually spelled correctly on top of this also improves the spellcheck + frequency terms -- which are usually spelled correctly -- from being spellchecked. This also improves the spellcheck performance. The shard level document frequencies are used for this option. `pre_filter`:: - a filter (analyzer) that is applied to each of the + A filter (analyzer) that is applied to each of the tokens passed to this candidate generator. This filter is applied to the original token before candidates are generated. `post_filter`:: - a filter (analyzer) that is applied to each of the + A filter (analyzer) that is applied to each of the generated tokens before they are passed to the actual phrase scorer. -The following example shows a `phrase` suggest call with two generators, -the first one is using a field containing ordinary indexed terms and the +The following example shows a `phrase` suggest call with two generators: +the first one is using a field containing ordinary indexed terms, and the second one uses a field that uses terms indexed with a `reverse` filter (tokens are index in reverse order). This is used to overcome the limitation of the direct generators to require a constant prefix to provide @@ -416,6 +415,6 @@ POST _search `pre_filter` and `post_filter` can also be used to inject synonyms after candidates are generated. For instance for the query `captain usq` we -might generate a candidate `usa` for term `usq` which is a synonym for -`america` which allows to present `captain america` to the user if this +might generate a candidate `usa` for the term `usq`, which is a synonym for +`america`. This allows us to present `captain america` to the user if this phrase scores high enough.