-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
approximate histograms #58
Comments
I've prototyped a new implementation of the Ben-Haim streaming histogram in pure Python, no package dependencies. Given a max bin count of 64 (streamhist default), adding a value takes 12 usec vs. 36 usec for streamhist (on 5 year old MacBook Air). That's a compelling speed given that the algorithm dynamically adjusts bins on every insert at steady state. It's possible to maintain that insert speed at much larger max bin counts (say 1,000), but the implementation gets significantly more complex (maintain a sorted container of bin distances). More interesting is how flexible the algorithm is regarding bin merging strategies. The strategy in the Ben-Haim paper merges nearest bins, which I suppose preserves the most information about a distribution. However with a simple tweak it can instead maintain equal bin sizes-- ideal for arbitrary percentile queries. Then with some cheating I've made a "best effort" equal bin size strategy that is amortized O(1) regardless of max bin count, with fill running at 3 usec. Other merge strategies are possible, such as providing specific percentiles using a minim number of bins (e.g. accurate 50% and 90% using only 5 bins). |
implementation update I improved the fill performance of the simple Python implementation and added a numba implementation:
(I wonder if physt would take on numba as I've implemented equivalent of numpy.quantile(). It matches quantile() output exactly until max_bins is reached, then transitions gracefully to approximate quantiles. |
Sorry, this will probably never be implemented :-/ |
I'm following the paper (http://jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf) implemented by https://github.com/carsonfarmer/streamhist, and the notion of approximate histograms seems elegant and efficient.
After seeing the internals of streamhist (trying to fix bugs) and reading the paper, I can imagine ways to make a better implementation: e.g. much more efficient discovery of bins to be joined, and avoiding temporary lists when possible. Also the code seems overly complex, partially due to features like "bin freezing" which try to workaround poor bin joining performance.
Anyway since streamhist is defunct, I'm thinking about trying an implementation. I wonder if this kind of histogram would fit into physt (and if sortedcollections would be reasonable as a dependency).
The text was updated successfully, but these errors were encountered: