You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current algorithm defines a threshold for the number of values any centroid can hold as proportional to 4q(1-q). This means that computing quantiles at extreme values (near 0 or 1) gives the most accurate values, and computing the median value will gives the least accurate value (since the threshold is highest in the middle of the histogram.
One thought I had was that if the algorithm was told up-front what quantiles are the most interesting to the user could this distribution be modified to concentrate the accuracy (i.e. lower the threshold) at these values? So for example if the user is most interested in the median value the thresholds equation used could be proportional to 4q(q-1)+1.
I can see two complications coming from this approach:
If the user is interested in multiple quantile values the calculation of the threshold value could become complex. Of course there is nothing stopping the user from asking the algorithm for other quantile values, they would just receive lower accuracy results for these other values compared with the value the distribution is centred on. So maybe to start it could be restricted to a single 'interesting' quantile.
For use-cases where the values recorded in the histogram follow a normal Gaussian distribution (or approximate to this distribution) the number of compressions required will be higher than with the current distribution. The worst case of this would be when the user is most interested in the median value and the distribution is normal Gaussian. In this case the values recorded will be concentrated in the area that has the lowest threshold so will cause a lot of new centroids to be created and therefore more merge events (on the flip side of this the algorithm should preform much less merges in the ordered values case). This may just be the cost that needs to be paid to have high accuracy at the median?
I would appreciate any thoughts on this approach. Does it sound like something thats worth exploring? Have I missed something that would mean this wouldn't be a good idea? If this idea seems to make sense I can work on trying it out and seeing how it performs with different distributions of data.
The text was updated successfully, but these errors were encountered:
In the current implementations, you should observe some constraints on your mapping function. The basic idea is that you need to make sure that the total space for the digest is bounded. The condition that allows this is that the bounding has to have the right properties. The gotcha is that the minimum count for a centroid is 1 so for some functions the number of centroids with a single sample grows unboundedly. This is true, for instance, for the original q (1-q) constraint. On the other hand, if you use sqrt(q (1-q)) as your constraint, the size is bounded.
In the current forms of the algorithm itself, the way that this works is that there is something called an index function that maps q to an index k. A centroid is legal if it goes from q_0 to q_1 and if N(k(q_1) - k(q_0)) < 1 except that we always allow centroids with a single element. If you look at the copy of the paper in the master branch, you will see a bit of what I mean. It isn't finished, but it gets pretty far down that line.
The normal index curve looks like this:
For your example, you could use an index curve where the steep part is in the middle instead of near the edges. Kind of like this:
The current algorithm defines a threshold for the number of values any centroid can hold as proportional to
4q(1-q)
. This means that computing quantiles at extreme values (near 0 or 1) gives the most accurate values, and computing the median value will gives the least accurate value (since the threshold is highest in the middle of the histogram.One thought I had was that if the algorithm was told up-front what quantiles are the most interesting to the user could this distribution be modified to concentrate the accuracy (i.e. lower the threshold) at these values? So for example if the user is most interested in the median value the thresholds equation used could be proportional to
4q(q-1)+1
.I can see two complications coming from this approach:
I would appreciate any thoughts on this approach. Does it sound like something thats worth exploring? Have I missed something that would mean this wouldn't be a good idea? If this idea seems to make sense I can work on trying it out and seeing how it performs with different distributions of data.
The text was updated successfully, but these errors were encountered: