-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Laplace scorer to multiply by alpha (and not add) #27125
Conversation
Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually? |
@s1monw Do you have an opinion on those changes? |
Multiply by alpha seems the right thing to do but I don't think that |
The idea of this smoothing comes from the MLE (Maximum Likelihood Estimation). We compute the probability of a term
I know that, and that's why I've also fixed |
I don't necessarily know much about this anymore. I can totally look into it but that might be as good as your opinions.
I tend to agree with this. The frequency for the bi-/tri-grams is calculated using a proportional value ie. docfreq or TTF if available. We use a proportional value for the vocabulary size, that seems correct to me. |
@elasticmachine ok to test |
Thanks @s1monw. Do you also have an opinion about my other question -- should Regarding whether to smooth by Dividing by That that Thoughts? |
For the bigram/trigram case we use the The rest tests are failing due to the new formula:
Can you adapt these tests to the new formula ? |
I am pretty sure that we are roughly on the same page, and since I'm not sure if you had typos in your response, I would like to re-iterate what you wrote:
The formula that we use is
Agreed. Except that it's divided by the frequency of the (n-1 gram), which is either With that, I think we are on the same page, and I'll go fix LaplaceScorer to also override
Hmm, it didn't fail when I ran |
We are on the same page, ok with the naming.
Same here, naming issue. Thanks for using the right one ;).
To run all the tests you should use
Ok. I am a bit concerned because it's the default scoring that we're changing but I am good if we consider this as a breaking change (same for the bigram/trigram change actually). The |
5f05f64
to
404ffe8
Compare
@jimczi I've fixed the code as we discussed. I also hope that I've addressed all test failures. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @shaie !
As discussed I'll merge the change in master only (7.0).
Thanks @jimczi! |
Looks good guys thanks for fixing this @shaie |
* master: Enhances exists queries to reduce need for `_field_names` (elastic#26930) Added new terms_set query Set request body to required to reflect the code base (elastic#27188) Update Docker docs for 6.0.0-rc2 (elastic#27166) Add version 6.0.0 Docs: restore now fails if it encounters incompatible settings (elastic#26933) Convert index blocks to cluster block exceptions (elastic#27050) [DOCS] Link remote info API in Cross Cluster Search docs page Fix Laplace scorer to multiply by alpha (and not add) (elastic#27125) [DOCS] Clarify migrate guide and search request validation Raise IllegalArgumentException if query validation failed (elastic#26811) prevent duplicate fields when mixing parent and root nested includes (elastic#27072) TopHitsAggregator must propagate calls to `setScorer`. (elastic#27138)
* master: Remove checkpoint tracker bit sets setting Fix stable BWC branch detection logic Fix logic detecting unreleased versions Enhances exists queries to reduce need for `_field_names` (elastic#26930) Added new terms_set query Set request body to required to reflect the code base (elastic#27188) Update Docker docs for 6.0.0-rc2 (elastic#27166) Add version 6.0.0 Docs: restore now fails if it encounters incompatible settings (elastic#26933) Convert index blocks to cluster block exceptions (elastic#27050) [DOCS] Link remote info API in Cross Cluster Search docs page Fix Laplace scorer to multiply by alpha (and not add) (elastic#27125) [DOCS] Clarify migrate guide and search request validation Raise IllegalArgumentException if query validation failed (elastic#26811) prevent duplicate fields when mixing parent and root nested includes (elastic#27072) TopHitsAggregator must propagate calls to `setScorer`. (elastic#27138)
* master: Lazy initialize checkpoint tracker bit sets Remove checkpoint tracker bit sets setting Fix stable BWC branch detection logic Fix logic detecting unreleased versions Enhances exists queries to reduce need for `_field_names` (#26930) Added new terms_set query Set request body to required to reflect the code base (#27188) Update Docker docs for 6.0.0-rc2 (#27166) Add version 6.0.0 Docs: restore now fails if it encounters incompatible settings (#26933) Convert index blocks to cluster block exceptions (#27050) [DOCS] Link remote info API in Cross Cluster Search docs page Fix Laplace scorer to multiply by alpha (and not add) (#27125) [DOCS] Clarify migrate guide and search request validation Raise IllegalArgumentException if query validation failed (#26811) prevent duplicate fields when mixing parent and root nested includes (#27072) TopHitsAggregator must propagate calls to `setScorer`. (#27138)
Laplace scorer seems to incorrectly apply
alpha
. According to Wikipedia https://en.wikipedia.org/wiki/Additive_smoothing, the formula should be(C_i + alpha) / (N + alpha * V)
, where in our caseC_i
denotes the frequency of the term, andN
andV
denote thesum_ttf
andnum_terms
respectively.I've also fixed the implementation to add
numTerms
and notvocabSize
per the additive smoothing formula (numTerms == num_terms
andvocabSize == sum_ttf
). This is also supported by other research papers I find on the web -- you should add one per unique term, and not one (or alpha) per term occurrence.In that regard, shouldn't
LaplaceScorer
also overridescoreUnigram
? Currently it inherits the implementation fromWordScorer
which implements add-one smoothing, but in Laplace, seems that we should implement add-k.