Fix threshold frequency computation in Suggesters #34312

jimczi · 2018-10-04T19:14:57Z

The term and phrase suggesters have different options to filter candidates
based on their frequencies. The popular mode for instance filters candidate
terms that occur in less docs than the original term. However when we compute this threshold
we use the total term frequency of a term instead of the document frequency. This is not inline
with the actual filtering which is always based on the document frequency. This change fixes
this discrepancy and clarifies the meaning of the different frequencies in use in the suggesters.
It also ensures that the threshold doesn't overflow the maximum allowed value (Integer.MAX_VALUE).

Closes #34282

The `term` and `phrase` suggesters have different options to filter candidates based on their frequencies. The `popular` mode for instance filters candidate terms that occur in less docs than the original term. However when we compute this threshold we use the total term frequency of a term instead of the document frequency. This is not inline with the actual filtering which is always based on the document frequency. This change fixes this discrepancy and clarifies the meaning of the different frequencies in use in the suggesters. It also ensures that the threshold doesn't overflow the maximum allowed value (Integer.MAX_VALUE). Closes elastic#34282

elasticmachine · 2018-10-04T19:14:59Z

Pinging @elastic/es-search-aggs

nik9000

Thanks for fixing the problem! I don't really know the if it is ok to make this change in 6.x though because it is kind of a breaking change to the "popular" suggest_mode.

For what it is worth I find that the phrase suggester make much better suggestions when you use the "always" suggest_mode.

nik9000 · 2018-10-04T19:33:18Z

...r/src/main/java/org/elasticsearch/search/suggest/phrase/DirectCandidateGeneratorBuilder.java

@@ -268,8 +268,8 @@ Integer maxInspections() {
     * frequencies. If an value higher than 1 is specified then fractional
     * can not be specified. Defaults to {@code 0.01}.
     * <p>
-     * This can be used to exclude high frequency terms from being
-     * suggested. High frequency terms are usually spelled correctly on top
+     * This can be used to exclude high totalTermFrequency terms from being


This might to a bit overzealous copy and replace.

oups, thanks I'll fix

nik9000 · 2018-10-04T19:34:16Z

server/src/main/java/org/elasticsearch/search/suggest/phrase/WordScorer.java

@@ -62,7 +62,8 @@ public WordScorer(IndexReader reader, Terms terms, String field, double realWord
        // division by zero, by scoreUnigram.
        final long nTerms = terms.size();
        this.numTerms = nTerms == -1 ? reader.maxDoc() : nTerms;
-        this.termsEnum = new FreqTermsEnum(reader, field, !useTotalTermFreq, useTotalTermFreq, null, BigArrays.NON_RECYCLING_INSTANCE); // non recycling for now
+        this.termsEnum = new FreqTermsEnum(reader, field, !useTotalTermFreq, useTotalTermFreq, null,
+            BigArrays.NON_RECYCLING_INSTANCE); // non recycling for now


That is an old comment!

nik9000 · 2018-10-04T19:36:12Z

server/src/main/java/org/elasticsearch/search/suggest/phrase/DirectCandidateGenerator.java

-            return max(0, round(termFrequency * (log10(termFrequency - frequencyPlateau) * (1.0 / log10(LOG_BASE))) + 1));
+    protected int thresholdTermFrequency(int docFreq) {
+        if (docFreq > 0) {
+            return (int) min(


I'd love to have a test that calls this with big numbers and validates that it returns Integer.MAX_VALUE.

I'll add one

jimczi · 2018-10-05T17:57:13Z

I don't really know the if it is ok to make this change in 6.x though because it is kind of a breaking change to the "popular" suggest_mode.

Yes I know, a bug is a feature ;). We can merge the switch to docFreq in master and only backport the protection against overflow in 6x, WDYT @nik9000 ?

nik9000 · 2018-10-05T18:09:44Z

We can merge the switch to docFreq in master and only backport the protection against overflow in 6x, WDYT @nik9000 ?

Sounds good to me! Could you add a breaking change note for this then?

jimczi · 2018-10-05T18:14:38Z

Done and I'll open a separate pr for 6.x

jimczi · 2018-10-16T09:09:03Z

I just realized that I forgot to push the breaking change note. @nik9000 can you take another look when you have time ?

nik9000

LGTM

This change ensures that the term frequency threshold computed by the term/phrase suggesters doesn't overflow the maximum allowed value (Integer.MAX_VALUE). Closes #34282 Relates #34312

The `term` and `phrase` suggesters have different options to filter candidates based on their frequencies. The `popular` mode for instance filters candidate terms that occur in less docs than the original term. However when we compute this threshold we use the total term frequency of a term instead of the document frequency. This is not inline with the actual filtering which is always based on the document frequency. This change fixes this discrepancy and clarifies the meaning of the different frequencies in use in the suggesters. It also ensures that the threshold doesn't overflow the maximum allowed value (Integer.MAX_VALUE). Closes #34282

jimczi added >bug :Search Relevance/Suggesters "Did you mean" and suggestions as you type v7.0.0 v6.5.0 labels Oct 4, 2018

jimczi requested a review from nik9000 October 4, 2018 19:14

line len

3abb06c

nik9000 reviewed Oct 4, 2018

View reviewed changes

jimczi added >breaking and removed v6.5.0 labels Oct 5, 2018

jimczi added 4 commits October 5, 2018 20:29

address review

0e1f797

Merge branch 'master' into bug/suggest_threshold

ccbd74c

Merge branch 'master' into bug/suggest_threshold

6c2ab59

add missing entry in breaking change

a16ba99

nik9000 approved these changes Oct 16, 2018

View reviewed changes

jimczi added 3 commits October 17, 2018 11:16

Merge branch 'master' into bug/suggest_threshold

755c6ad

Merge branch 'master' into bug/suggest_threshold

eec73d5

Merge branch 'master' into bug/suggest_threshold

93aa806

jimczi merged commit 7b49beb into elastic:master Oct 19, 2018

jimczi deleted the bug/suggest_threshold branch October 19, 2018 11:33

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix threshold frequency computation in Suggesters #34312

Fix threshold frequency computation in Suggesters #34312

jimczi commented Oct 4, 2018

elasticmachine commented Oct 4, 2018

nik9000 left a comment

nik9000 Oct 4, 2018

jimczi Oct 5, 2018

nik9000 Oct 4, 2018

nik9000 Oct 4, 2018

jimczi Oct 5, 2018

jimczi commented Oct 5, 2018

nik9000 commented Oct 5, 2018

jimczi commented Oct 5, 2018

jimczi commented Oct 16, 2018

nik9000 left a comment

Fix threshold frequency computation in Suggesters #34312

Fix threshold frequency computation in Suggesters #34312

Conversation

jimczi commented Oct 4, 2018

elasticmachine commented Oct 4, 2018

nik9000 left a comment

Choose a reason for hiding this comment

nik9000 Oct 4, 2018

Choose a reason for hiding this comment

jimczi Oct 5, 2018

Choose a reason for hiding this comment

nik9000 Oct 4, 2018

Choose a reason for hiding this comment

nik9000 Oct 4, 2018

Choose a reason for hiding this comment

jimczi Oct 5, 2018

Choose a reason for hiding this comment

jimczi commented Oct 5, 2018

nik9000 commented Oct 5, 2018

jimczi commented Oct 5, 2018

jimczi commented Oct 16, 2018

nik9000 left a comment

Choose a reason for hiding this comment