-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue #37303 - Invalid variance fix #37384
Conversation
vishnugt
commented
Jan 12, 2019
- Fixed Invalid variance computed in extended_stats aggregation Invalid variance computed in extended_stats aggregation #37303
- Problem is due to the non-infinite precision of floating point numbers and due to the way variance is being calculated, it is sometimes obtained as negative numbers, but variances are always non-negative.
- NaN handled - If the variance computed is NaN, the comparison in the code would fail and hence would return NaN(which is what we need)
- This closes Invalid variance computed in extended_stats aggregation #37303
Pinging @elastic/es-analytics-geo |
Thanks @vishnugt! Just had a quick skim, would you mind adding a few tests that shows each aggregator does the right thing in this scenario? That way we don't accidentally break this functionality in the future. After the tests are added I'll kick off a CI build and give it a more thorough review. Thanks again! |
Hey @polyfractal, I have added a test case for the getVariance method in InternalExtendedStats class, which seems to be the core part. Regarding the test case, I had to hardcode a couple of values, because I couldn't reliably generate the expected result in any particular range. |
Great, thanks @vishnugt! I think hard-coding the values is perfectly fine in this case, it's a very specific scenario that's being fixed. Kicking off a CI build! @elasticmachine test this please |
@elasticmachine test this please |
Heya @vishnugt, we're having some test troubles in CI unrelated to this PR. I'll kick off another build once we resolve the CI stuff, because otherwise it's going to just keep failing for unrelated reasons. Sorry for the delay! |
Hey @polyfractal, thanks for letting me know that its an issue unrelated to this PR. |
@elasticmachine test this please |
Heya @vishnugt, CI has cleared up a bit. Would you mind merging master into the PR and I'll kick off another set of builds? Thanks! |
Hey @polyfractal, I just merged the master into the PR, would you mind kicking off a CI build? Thanks! |
@elasticmachine test this please |
Hmm I agree, I think it's related to #37792 Would you mind merging in most recent master once more? On that note, if you'd like I can help out merging in master to your branch in the future, since CI can be finicky at times. Would help expedite getting this in :) |
Hey @polyfractal, I just merged this with the recent master. I would be more than happy to let you help me with merging, I sent you a collaborator request for my fork of elasticsearch, though I'm not quite sure if this is what you meant. Please let me know if I have to do anything else on my side to let you help merge. |
@elasticmachine test this please
Ah oops, my fault. All that's needed is a checkbox on the PR itself that allows repo maintainers to push to the PR (it should be checked by default, I just like to double-check with contributors so they know what's going on, when new commits show up) Will help with merging and getting this passed, thanks! |
Ah okay, thanks for clarifying! |
Due to floating point error, it was possible for variances to become negative which should never happen. This bugfix sets variance to zero if it becomes negative as a result of fp error.
Thank you so much, it's my first PR, hope to send more :) Thanks for helping till the end! |
No problem, you did all the hard work :) Looking forward to future PRs! |
* elastic/master: (68 commits) Fix potential IllegalCapacityException in LLRC when selecting nodes (elastic#37821) Mute CcrRepositoryIT#testFollowerMappingIsUpdated Fix S3 Repository ITs When Docker is not Available (elastic#37878) Pass distribution type through to docs tests (elastic#37885) Mute SharedClusterSnapshotRestoreIT#testSnapshotCanceledOnRemovedShard SQL: Fix casting from date to numeric type to use millis (elastic#37869) Migrate o.e.i.r.RecoveryState to Writeable (elastic#37380) ML: removing unnecessary upgrade code (elastic#37879) Relax cluster metadata version check (elastic#37834) Mute TransformIntegrationTests#testSearchTransform Refactored GeoHashGrid unit tests (elastic#37832) Fixes for a few randomized agg tests that fail hasValue() checks Geo: replace intermediate geo objects with libs/geo (elastic#37721) Remove NOREPLACE for /etc/elasticsearch in rpm and deb (elastic#37839) Remove "reinstall" packaging tests (elastic#37851) Add unit tests for ShardStateAction's ShardStartedClusterStateTaskExecutor (elastic#37756) Exit batch files explictly using ERRORLEVEL (elastic#29583) TransportUnfollowAction should increase settings version (elastic#37859) AsyncTwoPhaseIndexerTests race condition fixed (elastic#37830) Do not allow negative variances (elastic#37384) ...
Hey folks, I am sorry to say this is just a patch for a totally wrong variance calculation method. The sum of square method used here should not be used for floating point calculations as it introduces a phenomenon called catastrophic cancellation hence the negative values, instead you should perform calculation with the simple Welford algorithm as presented here https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford's_online_algorithm |
Heya @mysticaltech, thanks for the note! I mentioned this in a different ticket (#50416 (comment)), but forgot to open a followup ticket to change the algo. Thanks for the ping, I'll open one to track this particular enhancement 👍 |
Thanks @polyfractal, that's great, if I knew enough about Elastic Search I would try to help, the Welford algorithm is quite simple and very stable numerically. If it can help, here's a Javascript implementation adapted from the wikipedia page python one https://gist.github.com/mysticaltech/7ceb16eaa6ad00209e6fd87a55785991 |
Thanks @mysticaltech :) I suspect the hard part will actually be unrelated to the algo entirely, it'll be around dealing with backwards compat with older versions (since we'll need to support the old algo serialization, etc). Cheers! |
@polyfractal Yes I understand, it makes sense. But the variance should end up being the same, unless for cases where there are too much error and folks adapted to that. That said, I fully agree that such a change would be too risky (even though theoretically it shouldn't be). So I would suggest keeping all variance methods by default using the sum of squares, and adding an optional flag to switch to Welford. This way, folks will have to manually specify Welford. And if the M2 moment aggregate calculation is added, this can be done in one-pass if I'm not mistaken. |