-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Phrase Suggester: Suggesting on very frequent words can cause request failures #34282
Comments
Pinging @elastic/es-search-aggs |
I think there is a discrepancy in our usage of the Lucene suggester. |
I think regardless of how we compute the threshold we should clamp it to Integer values so we don't trigger this. |
The `term` and `phrase` suggesters have different options to filter candidates based on their frequencies. The `popular` mode for instance filters candidate terms that occur in less docs than the original term. However when we compute this threshold we use the total term frequency of a term instead of the document frequency. This is not inline with the actual filtering which is always based on the document frequency. This change fixes this discrepancy and clarifies the meaning of the different frequencies in use in the suggesters. It also ensures that the threshold doesn't overflow the maximum allowed value (Integer.MAX_VALUE). Closes elastic#34282
The `term` and `phrase` suggesters have different options to filter candidates based on their frequencies. The `popular` mode for instance filters candidate terms that occur in less docs than the original term. However when we compute this threshold we use the total term frequency of a term instead of the document frequency. This is not inline with the actual filtering which is always based on the document frequency. This change fixes this discrepancy and clarifies the meaning of the different frequencies in use in the suggesters. It also ensures that the threshold doesn't overflow the maximum allowed value (Integer.MAX_VALUE). Closes #34282
The `term` and `phrase` suggesters have different options to filter candidates based on their frequencies. The `popular` mode for instance filters candidate terms that occur in less docs than the original term. However when we compute this threshold we use the total term frequency of a term instead of the document frequency. This is not inline with the actual filtering which is always based on the document frequency. This change fixes this discrepancy and clarifies the meaning of the different frequencies in use in the suggesters. It also ensures that the threshold doesn't overflow the maximum allowed value (Integer.MAX_VALUE). Closes #34282
I have a stack trace that looks like:
I do not have and cannot get the index that causes this failure. But it looks to me like the failure is caused by this series of events:
DirectCandidateGenerator#thresholdFrequency
spits out a frequency that is bigger thanInteger.MAX_VALUE
. This looks to be possible using the default configuration for common words like "the" when the corpus is a couple of million documents and each document is large, like, say, as big as a wikipedia page.DirectSpellChecker#setThresholdFrequency
with that number. The JVM helpfully casts thelong
returned by step 1 into afloat
, losing precision but keeping the magnitude of the number largely intact.float
is either less than 0 or a whole number. The "is it a whole number" check looks likethresholdFrequency != (int) thresholdFrequency
. That will consider floats that don't fit intoint
s as not whole numbers. Most of the time, anyway.There is a work around: set
"suggest_mode": "always"
. We'll skip the math and just pick 0 for the frequency. Which is both less than one and whole number so Lucene is quite happy with it.It looks like we should either clamp the value to
Integer.MAX_VALUE
in Elasticsearch or Lucene should use something else to check for fractional numbers.The text was updated successfully, but these errors were encountered: