Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Laplace scorer to multiply by alpha (and not add) #27125

Merged
merged 2 commits into from
Oct 31, 2017

Conversation

shaie
Copy link
Contributor

@shaie shaie commented Oct 26, 2017

Laplace scorer seems to incorrectly apply alpha. According to Wikipedia https://en.wikipedia.org/wiki/Additive_smoothing, the formula should be (C_i + alpha) / (N + alpha * V), where in our case C_i denotes the frequency of the term, and N and V denote the sum_ttf and num_terms respectively.

I've also fixed the implementation to add numTerms and not vocabSize per the additive smoothing formula (numTerms == num_terms and vocabSize == sum_ttf). This is also supported by other research papers I find on the web -- you should add one per unique term, and not one (or alpha) per term occurrence.

In that regard, shouldn't LaplaceScorer also override scoreUnigram? Currently it inherits the implementation from WordScorer which implements add-one smoothing, but in Laplace, seems that we should implement add-k.

@elasticmachine
Copy link
Collaborator

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

@dakrone dakrone added the :Search/Search Search-related issues that do not fall into other categories label Oct 26, 2017
@dakrone dakrone requested a review from jpountz October 26, 2017 16:55
@jpountz
Copy link
Contributor

jpountz commented Oct 27, 2017

@s1monw Do you have an opinion on those changes?

@jimczi
Copy link
Contributor

jimczi commented Oct 27, 2017

Multiply by alpha seems the right thing to do but I don't think that numTerms should be used. The frequency for each term is computed as the number of occurrences of this term across all documents so using the total term frequency for the entire corpus seems appropriate.
Also numTerms is unknown when the shard contains more than one segment so the value will be -1 most of the time (see #27149). I think we should always use the total term frequency and removes numTerms entirely from the formula.
Although I am not sure about the original intentions so I am also interested in what @s1monw thinks of this ;)

@shaie
Copy link
Contributor Author

shaie commented Oct 27, 2017

Multiply by alpha seems the right thing to do but I don't think that numTerms should be used. The frequency for each term is computed as the number of occurrences of this term across all documents so using the total term frequency for the entire corpus seems appropriate.

The idea of this smoothing comes from the MLE (Maximum Likelihood Estimation). We compute the probability of a term P(t1) as ttf(t1) / sum_ttf. In order to account for non existing terms which have some minor probability, we smooth the probability. In Laplace, aka Additive, you can add k where k is between 0 and 1. If you add 1 to a term in the numerator, you have to account that in the divider by adding one to all unique terms (to keep the probabilities the same). If you add k < 1 then you add k*numTerms to the divider. That's the probability theory behind it. If we'll add vocabSize == sum_ttf, we'll add a much larger value to the divider than we should, and therefore lower the probability of all other terms more than we want to.

Also numTerms is unknown when the shard contains more than one segment so the value will be -1 most of the time (see #27149).

I know that, and that's why I've also fixed numTerms to equal reader.maxDoc() as a (very low) approximation. But I think it's better than adding k * vocabSize since it will discount the probability of existing terms less.

@s1monw
Copy link
Contributor

s1monw commented Oct 29, 2017

Although I am not sure about the original intentions so I am also interested in what @s1monw thinks of this ;)

I don't necessarily know much about this anymore. I can totally look into it but that might be as good as your opinions.

Multiply by alpha seems the right thing to do but I don't think that numTerms should be used. The frequency for each term is computed as the number of occurrences of this term across all documents so using the total term frequency for the entire corpus seems appropriate.
Also numTerms is unknown when the shard contains more than one segment so the value will be -1 most of the time (see #27149). I think we should always use the total term frequency and removes numTerms entirely from the formula.

I tend to agree with this. The frequency for the bi-/tri-grams is calculated using a proportional value ie. docfreq or TTF if available. We use a proportional value for the vocabulary size, that seems correct to me.

@s1monw
Copy link
Contributor

s1monw commented Oct 29, 2017

@elasticmachine ok to test

@shaie
Copy link
Contributor Author

shaie commented Oct 29, 2017

Thanks @s1monw. Do you also have an opinion about my other question -- should LaplaceScorer also override scoreUnigram to implement k-additive smoothing (instead of falling back to add-one)?

Regarding whether to smooth by sum_ttf (aka vocabulary size) or numTerms, I still think it's wrong to divide by sum_ttf as it shaves off much more of the probability than you want in an add-k smoothing. E.g. take a look here: https://lagunita.stanford.edu/c4x/Engineering/CS-224N/asset/slp4.pdf#subsection.4.4.2, where V is previously defined as the "total extra observations", in our case the total number of unique terms.

Dividing by vocabSize == sum_ttf is just too high. On a not-so-large Wikipedia index (w/ English analysis, 5.6M docs), there are 5.4 billion term occurrences. I don't know what API exists to get the number of unique terms, but I believe it's much smaller, and I think that dividing by such a large number is not the right thing to do.

That that numTerms cannot always be computed is OK, as this patch also fixes it to default to reader.maxDoc() (which is needed for WordScorer.scoreUnigram() anyway), and it will mean it will shave off less probability.

Thoughts?

@jimczi
Copy link
Contributor

jimczi commented Oct 30, 2017

Regarding whether to smooth by sum_ttf (aka vocabulary size) or numTerms, I still think it's wrong to divide by sum_ttf as it shaves off much more of the probability than you want in an add-k smoothing. E.g. take a look here: https://lagunita.stanford.edu/c4x/Engineering/CS-224N/asset/slp4.pdf#subsection.4.4.2, where V is previously defined as the "total extra observations", in our case the total number of unique terms.

For the bigram/trigram case we use the relative frequency that for the bigram x_y is computed with sum_ttf(x_y) / sum_ttf(x) when ttf is available and tf(x_y) / tf(x) when it's not.
With add-k smoothing the formula is augmented with (sum_ttf(x_y) + k) / (sum_ttf(x) + kV).
For this case I agree that V can be maxDocs or numTerms, it's just an augmentation of the divider but the important part is that it's divided by the total term frequency of the prefix (n-1 gram) which is always greater than the sum_ttf of the entire ngram.
Sorry it was unclear but my comment was about the unigram case where we need to divide by total_term_freq of the entire corpus in order to get the probability of the term: sum_ttf(x) / ttf. It's not about the smoothing which can be any external observations as long as it's the same for all terms. So the formula for unigram would have to keep the ttf divider:
(sum_ttf(x) + k) / (ttf + kV) where V could be numTerms or maxDocs. Since numTerms is only available when a single segment is used I think it would be preferable to always use maxDocs but that's just for clarity. I am ok changing this formula if we fix the bigram/trigram case with your diff but the divider would be: vocabularySize + (k * maxDocs), keeping the total term frequency is important to make sure that the computed value is always <= 1.

The rest tests are failing due to the new formula:

14:40:44 FAILURE 0.09s | DocsClientYamlTestSuiteIT.test {yaml=reference/search/suggesters/phrase-suggest/line_86} <<< FAILURES!
14:40:44    > Throwable #1: java.lang.AssertionError: Failure at [reference/search/suggesters/phrase-suggest:103]: $body didn't match expected value:

Can you adapt these tests to the new formula ?

@shaie
Copy link
Contributor Author

shaie commented Oct 30, 2017

I am pretty sure that we are roughly on the same page, and since I'm not sure if you had typos in your response, I would like to re-iterate what you wrote:

for the bigram x_y is computed with sum_ttf(x_y) / sum_ttf(x) when ttf is available and tf(x_y) / tf(x) when it's not.

The formula that we use is (k + freq(x_y)) / (freq(x) + kV). freq() is computed either as ttf = total_term_freq or df = document freq. Just to be clear, sum_ttf denotes the sum of all term frequencies in a field, and ttf denotes the total term frequency of a single term.

but the important part is that it's divided by the total term frequency of the prefix (n-1 gram) which is always greater than the sum_ttf of the entire ngram.

Agreed. Except that it's divided by the frequency of the (n-1 gram), which is either ttf or df.

With that, I think we are on the same page, and I'll go fix LaplaceScorer to also override scoreUnigram and implement with add-k smoothing.

The rest tests are failing due to the new formula:

Hmm, it didn't fail when I ran gradle test. Should I run something else (maybe gradle itest :) )?

@jimczi
Copy link
Contributor

jimczi commented Oct 30, 2017

The formula that we use is (k + freq(x_y)) / (freq(x) + kV). freq() is computed either as ttf = total_term_freq or df = document freq. Just to be clear, sum_ttf denotes the sum of all term frequencies in a field, and ttf denotes the total term frequency of a single term.

We are on the same page, ok with the naming.

Agreed. Except that it's divided by the frequency of the (n-1 gram), which is either ttf or df.

Same here, naming issue. Thanks for using the right one ;).

Hmm, it didn't fail when I ran gradle test

To run all the tests you should use gradle check. gradle test only runs unit tests.

With that, I think we are on the same page, and I'll go fix LaplaceScorer to also override scoreUnigram and implement with add-k smoothing.

Ok. I am a bit concerned because it's the default scoring that we're changing but I am good if we consider this as a breaking change (same for the bigram/trigram change actually). The phrase_suggester is not easy to tune right so some users may rely on the current formula even though it was a bit buggy.

@shaie shaie force-pushed the fix-laplace-smoothing branch from 5f05f64 to 404ffe8 Compare October 30, 2017 19:24
@shaie
Copy link
Contributor Author

shaie commented Oct 30, 2017

@jimczi I've fixed the code as we discussed. I also hope that I've addressed all test failures.

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @shaie !
As discussed I'll merge the change in master only (7.0).

@shaie
Copy link
Contributor Author

shaie commented Oct 31, 2017

Thanks @jimczi!

@jimczi jimczi merged commit bd02619 into elastic:master Oct 31, 2017
@shaie shaie deleted the fix-laplace-smoothing branch November 1, 2017 07:15
@s1monw
Copy link
Contributor

s1monw commented Nov 1, 2017

Looks good guys thanks for fixing this @shaie

jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Nov 1, 2017
* master:
  Enhances exists queries to reduce need for `_field_names` (elastic#26930)
  Added new terms_set query
  Set request body to required to reflect the code base (elastic#27188)
  Update Docker docs for 6.0.0-rc2 (elastic#27166)
  Add version 6.0.0
  Docs: restore now fails if it encounters incompatible settings (elastic#26933)
  Convert index blocks to cluster block exceptions (elastic#27050)
  [DOCS] Link remote info API in Cross Cluster Search docs page
  Fix Laplace scorer to multiply by alpha (and not add) (elastic#27125)
  [DOCS] Clarify migrate guide and search request validation
  Raise IllegalArgumentException if query validation failed (elastic#26811)
  prevent duplicate fields when mixing parent and root nested includes (elastic#27072)
  TopHitsAggregator must propagate calls to `setScorer`. (elastic#27138)
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Nov 2, 2017
* master:
  Remove checkpoint tracker bit sets setting
  Fix stable BWC branch detection logic
  Fix logic detecting unreleased versions
  Enhances exists queries to reduce need for `_field_names` (elastic#26930)
  Added new terms_set query
  Set request body to required to reflect the code base (elastic#27188)
  Update Docker docs for 6.0.0-rc2 (elastic#27166)
  Add version 6.0.0
  Docs: restore now fails if it encounters incompatible settings (elastic#26933)
  Convert index blocks to cluster block exceptions (elastic#27050)
  [DOCS] Link remote info API in Cross Cluster Search docs page
  Fix Laplace scorer to multiply by alpha (and not add) (elastic#27125)
  [DOCS] Clarify migrate guide and search request validation
  Raise IllegalArgumentException if query validation failed (elastic#26811)
  prevent duplicate fields when mixing parent and root nested includes (elastic#27072)
  TopHitsAggregator must propagate calls to `setScorer`. (elastic#27138)
jasontedor added a commit that referenced this pull request Nov 2, 2017
* master:
  Lazy initialize checkpoint tracker bit sets
  Remove checkpoint tracker bit sets setting
  Fix stable BWC branch detection logic
  Fix logic detecting unreleased versions
  Enhances exists queries to reduce need for `_field_names` (#26930)
  Added new terms_set query
  Set request body to required to reflect the code base (#27188)
  Update Docker docs for 6.0.0-rc2 (#27166)
  Add version 6.0.0
  Docs: restore now fails if it encounters incompatible settings (#26933)
  Convert index blocks to cluster block exceptions (#27050)
  [DOCS] Link remote info API in Cross Cluster Search docs page
  Fix Laplace scorer to multiply by alpha (and not add) (#27125)
  [DOCS] Clarify migrate guide and search request validation
  Raise IllegalArgumentException if query validation failed (#26811)
  prevent duplicate fields when mixing parent and root nested includes (#27072)
  TopHitsAggregator must propagate calls to `setScorer`. (#27138)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>breaking :Search/Search Search-related issues that do not fall into other categories v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants