Phrase Suggester: Suggesting on very frequent words can cause request failures #34282

nik9000 · 2018-10-03T20:02:13Z

I have a stack trace that looks like:

Caused by: java.lang.IllegalArgumentException: Fractional absolute document frequencies are not allowed
    at org.apache.lucene.search.spell.DirectSpellChecker.setThresholdFrequency(DirectSpellChecker.java:182)
    at org.elasticsearch.search.suggest.phrase.DirectCandidateGenerator.drawCandidates(DirectCandidateGenerator.java:131)
    at org.elasticsearch.search.suggest.phrase.MultiCandidateGeneratorWrapper.drawCandidates(MultiCandidateGeneratorWrapper.java:52)

I do not have and cannot get the index that causes this failure. But it looks to me like the failure is caused by this series of events:

DirectCandidateGenerator#thresholdFrequency spits out a frequency that is bigger than Integer.MAX_VALUE. This looks to be possible using the default configuration for common words like "the" when the corpus is a couple of million documents and each document is large, like, say, as big as a wikipedia page.
We call DirectSpellChecker#setThresholdFrequency with that number. The JVM helpfully casts the long returned by step 1 into a float, losing precision but keeping the magnitude of the number largely intact.
Lucene attempts to validate that the float is either less than 0 or a whole number. The "is it a whole number" check looks like thresholdFrequency != (int) thresholdFrequency. That will consider floats that don't fit into ints as not whole numbers. Most of the time, anyway.

There is a work around: set "suggest_mode": "always". We'll skip the math and just pick 0 for the frequency. Which is both less than one and whole number so Lucene is quite happy with it.

It looks like we should either clamp the value to Integer.MAX_VALUE in Elasticsearch or Lucene should use something else to check for fractional numbers.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2018-10-03T20:02:14Z

Pinging @elastic/es-search-aggs

jimczi · 2018-10-04T06:32:00Z

It looks like we should either clamp the value to Integer.MAX_VALUE in Elasticsearch or Lucene should use something else to check for fractional numbers.

I think there is a discrepancy in our usage of the Lucene suggester. DirectSpellChecker#suggestSimilar uses the docFreq (which is an int) of the candidate to compare with the threshold computed by DirectCandidateGenerator#thresholdFrequency. However in Elasticsearch we compute the threshold from the totalTermFreq and not the docFreq so this explains why the value can be greater than Integer.MAX_VALUE. This also means that we filter some terms that should not be filtered because we don't use the same metric (docFreq vs totalTermFreq). I think we should "just" switch to docFreq when computing the threshold.

nik9000 · 2018-10-04T12:08:58Z

I think regardless of how we compute the threshold we should clamp it to Integer values so we don't trigger this.

The `term` and `phrase` suggesters have different options to filter candidates based on their frequencies. The `popular` mode for instance filters candidate terms that occur in less docs than the original term. However when we compute this threshold we use the total term frequency of a term instead of the document frequency. This is not inline with the actual filtering which is always based on the document frequency. This change fixes this discrepancy and clarifies the meaning of the different frequencies in use in the suggesters. It also ensures that the threshold doesn't overflow the maximum allowed value (Integer.MAX_VALUE). Closes elastic#34282

The `term` and `phrase` suggesters have different options to filter candidates based on their frequencies. The `popular` mode for instance filters candidate terms that occur in less docs than the original term. However when we compute this threshold we use the total term frequency of a term instead of the document frequency. This is not inline with the actual filtering which is always based on the document frequency. This change fixes this discrepancy and clarifies the meaning of the different frequencies in use in the suggesters. It also ensures that the threshold doesn't overflow the maximum allowed value (Integer.MAX_VALUE). Closes #34282

This change ensures that the term frequency threshold computed by the term/phrase suggesters doesn't overflow the maximum allowed value (Integer.MAX_VALUE). Closes #34282 Relates #34312

The `term` and `phrase` suggesters have different options to filter candidates based on their frequencies. The `popular` mode for instance filters candidate terms that occur in less docs than the original term. However when we compute this threshold we use the total term frequency of a term instead of the document frequency. This is not inline with the actual filtering which is always based on the document frequency. This change fixes this discrepancy and clarifies the meaning of the different frequencies in use in the suggesters. It also ensures that the threshold doesn't overflow the maximum allowed value (Integer.MAX_VALUE). Closes #34282

nik9000 added >bug :Search Relevance/Suggesters "Did you mean" and suggestions as you type labels Oct 3, 2018

jimczi self-assigned this Oct 4, 2018

jimczi mentioned this issue Oct 4, 2018

Fix threshold frequency computation in Suggesters #34312

Merged

jimczi closed this as completed in #34312 Oct 19, 2018

javanna added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Jul 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phrase Suggester: Suggesting on very frequent words can cause request failures #34282

Phrase Suggester: Suggesting on very frequent words can cause request failures #34282

nik9000 commented Oct 3, 2018

elasticmachine commented Oct 3, 2018

jimczi commented Oct 4, 2018

nik9000 commented Oct 4, 2018

Phrase Suggester: Suggesting on very frequent words can cause request failures #34282

Phrase Suggester: Suggesting on very frequent words can cause request failures #34282

Comments

nik9000 commented Oct 3, 2018

elasticmachine commented Oct 3, 2018

jimczi commented Oct 4, 2018

nik9000 commented Oct 4, 2018