-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix threshold frequency computation in Suggesters #34312
Conversation
The `term` and `phrase` suggesters have different options to filter candidates based on their frequencies. The `popular` mode for instance filters candidate terms that occur in less docs than the original term. However when we compute this threshold we use the total term frequency of a term instead of the document frequency. This is not inline with the actual filtering which is always based on the document frequency. This change fixes this discrepancy and clarifies the meaning of the different frequencies in use in the suggesters. It also ensures that the threshold doesn't overflow the maximum allowed value (Integer.MAX_VALUE). Closes elastic#34282
Pinging @elastic/es-search-aggs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing the problem! I don't really know the if it is ok to make this change in 6.x though because it is kind of a breaking change to the "popular" suggest_mode.
For what it is worth I find that the phrase suggester make much better suggestions when you use the "always" suggest_mode.
@@ -268,8 +268,8 @@ Integer maxInspections() { | |||
* frequencies. If an value higher than 1 is specified then fractional | |||
* can not be specified. Defaults to {@code 0.01}. | |||
* <p> | |||
* This can be used to exclude high frequency terms from being | |||
* suggested. High frequency terms are usually spelled correctly on top | |||
* This can be used to exclude high totalTermFrequency terms from being |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might to a bit overzealous copy and replace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oups, thanks I'll fix
@@ -62,7 +62,8 @@ public WordScorer(IndexReader reader, Terms terms, String field, double realWord | |||
// division by zero, by scoreUnigram. | |||
final long nTerms = terms.size(); | |||
this.numTerms = nTerms == -1 ? reader.maxDoc() : nTerms; | |||
this.termsEnum = new FreqTermsEnum(reader, field, !useTotalTermFreq, useTotalTermFreq, null, BigArrays.NON_RECYCLING_INSTANCE); // non recycling for now | |||
this.termsEnum = new FreqTermsEnum(reader, field, !useTotalTermFreq, useTotalTermFreq, null, | |||
BigArrays.NON_RECYCLING_INSTANCE); // non recycling for now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is an old comment!
return max(0, round(termFrequency * (log10(termFrequency - frequencyPlateau) * (1.0 / log10(LOG_BASE))) + 1)); | ||
protected int thresholdTermFrequency(int docFreq) { | ||
if (docFreq > 0) { | ||
return (int) min( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd love to have a test that calls this with big numbers and validates that it returns Integer.MAX_VALUE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add one
Yes I know, a bug is a feature ;). We can merge the switch to |
Sounds good to me! Could you add a breaking change note for this then? |
Done and I'll open a separate pr for 6.x |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
The `term` and `phrase` suggesters have different options to filter candidates based on their frequencies. The `popular` mode for instance filters candidate terms that occur in less docs than the original term. However when we compute this threshold we use the total term frequency of a term instead of the document frequency. This is not inline with the actual filtering which is always based on the document frequency. This change fixes this discrepancy and clarifies the meaning of the different frequencies in use in the suggesters. It also ensures that the threshold doesn't overflow the maximum allowed value (Integer.MAX_VALUE). Closes #34282
The
term
andphrase
suggesters have different options to filter candidatesbased on their frequencies. The
popular
mode for instance filters candidateterms that occur in less docs than the original term. However when we compute this threshold
we use the total term frequency of a term instead of the document frequency. This is not inline
with the actual filtering which is always based on the document frequency. This change fixes
this discrepancy and clarifies the meaning of the different frequencies in use in the suggesters.
It also ensures that the threshold doesn't overflow the maximum allowed value (Integer.MAX_VALUE).
Closes #34282