Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AVLTreeDigest with a lot of datas : integer overflow #81

Closed
Nikko78 opened this issue Apr 11, 2017 · 4 comments
Closed

AVLTreeDigest with a lot of datas : integer overflow #81

Nikko78 opened this issue Apr 11, 2017 · 4 comments

Comments

@Nikko78
Copy link

Nikko78 commented Apr 11, 2017

For a spacial project, we use TDigest to make statistics on 5 billion stars. Thanks for your works, it's very useful !

With integer fields, AVLTreeDigest give false results while TreeDigest give good results.

The problem become from "int[] aggregatedCounts", in AVLGroupTree, sometimes sum overflow integer capacity and the result become negative.

I change "int" to "long" and after AVLTreeDigest gave good results (the same result as TreeDigest).

AVLGroupTree_patch.txt

@tdunning
Copy link
Owner

tdunning commented Apr 11, 2017 via email

@Nikko78
Copy link
Author

Nikko78 commented Apr 11, 2017

Thanks for your suggestion for MergingDigest, we have a lot of stats to do in the future with new data which will be sent by the satellite.
I'm working on GAIA : http://sci.esa.int/gaia/ in Paris Observatory (http://www.obspm.fr) on CU9 for WP940 (data validation) (http://gaia.ub.edu/?page_id=4320)
Nicolas

@tdunning
Copy link
Owner

I just checked and the MergingDigest accumulates the weights in double. AVLTreeDigest uses int. This means that MergingDigest will work for your case with many counts. I will document the limitation for AVLTreeDigest, but I don't expect that I will change it in the near future.

Thanks for the references!

@tdunning
Copy link
Owner

Also, check out the MegaMergeTest for an example of merging lots of digests at once. This is what you need for parallelism.

Let me know if you publish any papers that describe your use of t-digest so I can make sure to reference them!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants