Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix threshold frequency computation in Suggesters #34312

Merged
merged 9 commits into from
Oct 19, 2018

Conversation

jimczi
Copy link
Contributor

@jimczi jimczi commented Oct 4, 2018

The term and phrase suggesters have different options to filter candidates
based on their frequencies. The popular mode for instance filters candidate
terms that occur in less docs than the original term. However when we compute this threshold
we use the total term frequency of a term instead of the document frequency. This is not inline
with the actual filtering which is always based on the document frequency. This change fixes
this discrepancy and clarifies the meaning of the different frequencies in use in the suggesters.
It also ensures that the threshold doesn't overflow the maximum allowed value (Integer.MAX_VALUE).

Closes #34282

The `term` and `phrase` suggesters have different options to filter candidates
based on their frequencies. The `popular` mode for instance filters candidate
terms that occur in less docs than the original term. However when we compute this threshold
we use the total term frequency of a term instead of the document frequency. This is not inline
with the actual filtering which is always based on the document frequency. This change fixes
this discrepancy and clarifies the meaning of the different frequencies in use in the suggesters.
It also ensures that the threshold doesn't overflow the maximum allowed value (Integer.MAX_VALUE).

Closes elastic#34282
@jimczi jimczi added >bug :Search Relevance/Suggesters "Did you mean" and suggestions as you type v7.0.0 v6.5.0 labels Oct 4, 2018
@jimczi jimczi requested a review from nik9000 October 4, 2018 19:14
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search-aggs

Copy link
Member

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing the problem! I don't really know the if it is ok to make this change in 6.x though because it is kind of a breaking change to the "popular" suggest_mode.

For what it is worth I find that the phrase suggester make much better suggestions when you use the "always" suggest_mode.

@@ -268,8 +268,8 @@ Integer maxInspections() {
* frequencies. If an value higher than 1 is specified then fractional
* can not be specified. Defaults to {@code 0.01}.
* <p>
* This can be used to exclude high frequency terms from being
* suggested. High frequency terms are usually spelled correctly on top
* This can be used to exclude high totalTermFrequency terms from being
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might to a bit overzealous copy and replace.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oups, thanks I'll fix

@@ -62,7 +62,8 @@ public WordScorer(IndexReader reader, Terms terms, String field, double realWord
// division by zero, by scoreUnigram.
final long nTerms = terms.size();
this.numTerms = nTerms == -1 ? reader.maxDoc() : nTerms;
this.termsEnum = new FreqTermsEnum(reader, field, !useTotalTermFreq, useTotalTermFreq, null, BigArrays.NON_RECYCLING_INSTANCE); // non recycling for now
this.termsEnum = new FreqTermsEnum(reader, field, !useTotalTermFreq, useTotalTermFreq, null,
BigArrays.NON_RECYCLING_INSTANCE); // non recycling for now
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is an old comment!

return max(0, round(termFrequency * (log10(termFrequency - frequencyPlateau) * (1.0 / log10(LOG_BASE))) + 1));
protected int thresholdTermFrequency(int docFreq) {
if (docFreq > 0) {
return (int) min(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd love to have a test that calls this with big numbers and validates that it returns Integer.MAX_VALUE.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add one

@jimczi
Copy link
Contributor Author

jimczi commented Oct 5, 2018

I don't really know the if it is ok to make this change in 6.x though because it is kind of a breaking change to the "popular" suggest_mode.

Yes I know, a bug is a feature ;). We can merge the switch to docFreq in master and only backport the protection against overflow in 6x, WDYT @nik9000 ?

@nik9000
Copy link
Member

nik9000 commented Oct 5, 2018

We can merge the switch to docFreq in master and only backport the protection against overflow in 6x, WDYT @nik9000 ?

Sounds good to me! Could you add a breaking change note for this then?

@jimczi jimczi added >breaking and removed v6.5.0 labels Oct 5, 2018
@jimczi
Copy link
Contributor Author

jimczi commented Oct 5, 2018

Done and I'll open a separate pr for 6.x

@jimczi
Copy link
Contributor Author

jimczi commented Oct 16, 2018

I just realized that I forgot to push the breaking change note. @nik9000 can you take another look when you have time ?

Copy link
Member

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jimczi jimczi merged commit 7b49beb into elastic:master Oct 19, 2018
@jimczi jimczi deleted the bug/suggest_threshold branch October 19, 2018 11:33
jimczi added a commit that referenced this pull request Oct 19, 2018
This change ensures that the term frequency threshold computed by the term/phrase
suggesters doesn't overflow the maximum allowed value (Integer.MAX_VALUE).

Closes #34282
Relates #34312
kcm pushed a commit that referenced this pull request Oct 30, 2018
The `term` and `phrase` suggesters have different options to filter candidates
based on their frequencies. The `popular` mode for instance filters candidate
terms that occur in less docs than the original term. However when we compute this threshold
we use the total term frequency of a term instead of the document frequency. This is not inline
with the actual filtering which is always based on the document frequency. This change fixes
this discrepancy and clarifies the meaning of the different frequencies in use in the suggesters.
It also ensures that the threshold doesn't overflow the maximum allowed value (Integer.MAX_VALUE).

Closes #34282
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Phrase Suggester: Suggesting on very frequent words can cause request failures
4 participants