Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjusting the centroid threshold values to obtain better accuracy at interesting values #75

Closed
colings86 opened this issue Oct 26, 2016 · 1 comment

Comments

@colings86
Copy link

The current algorithm defines a threshold for the number of values any centroid can hold as proportional to 4q(1-q). This means that computing quantiles at extreme values (near 0 or 1) gives the most accurate values, and computing the median value will gives the least accurate value (since the threshold is highest in the middle of the histogram.

4q 1-q

One thought I had was that if the algorithm was told up-front what quantiles are the most interesting to the user could this distribution be modified to concentrate the accuracy (i.e. lower the threshold) at these values? So for example if the user is most interested in the median value the thresholds equation used could be proportional to 4q(q-1)+1.

4q q-1 1

I can see two complications coming from this approach:

  1. If the user is interested in multiple quantile values the calculation of the threshold value could become complex. Of course there is nothing stopping the user from asking the algorithm for other quantile values, they would just receive lower accuracy results for these other values compared with the value the distribution is centred on. So maybe to start it could be restricted to a single 'interesting' quantile.
  2. For use-cases where the values recorded in the histogram follow a normal Gaussian distribution (or approximate to this distribution) the number of compressions required will be higher than with the current distribution. The worst case of this would be when the user is most interested in the median value and the distribution is normal Gaussian. In this case the values recorded will be concentrated in the area that has the lowest threshold so will cause a lot of new centroids to be created and therefore more merge events (on the flip side of this the algorithm should preform much less merges in the ordered values case). This may just be the cost that needs to be paid to have high accuracy at the median?

I would appreciate any thoughts on this approach. Does it sound like something thats worth exploring? Have I missed something that would mean this wouldn't be a good idea? If this idea seems to make sense I can work on trying it out and seeing how it performs with different distributions of data.

@tdunning
Copy link
Owner

tdunning commented Apr 20, 2017

Sorry for the delay in responding.

Yes. You can do exactly what you say.

In the current implementations, you should observe some constraints on your mapping function. The basic idea is that you need to make sure that the total space for the digest is bounded. The condition that allows this is that the bounding has to have the right properties. The gotcha is that the minimum count for a centroid is 1 so for some functions the number of centroids with a single sample grows unboundedly. This is true, for instance, for the original q (1-q) constraint. On the other hand, if you use sqrt(q (1-q)) as your constraint, the size is bounded.

In the current forms of the algorithm itself, the way that this works is that there is something called an index function that maps q to an index k. A centroid is legal if it goes from q_0 to q_1 and if N(k(q_1) - k(q_0)) < 1 except that we always allow centroids with a single element. If you look at the copy of the paper in the master branch, you will see a bit of what I mean. It isn't finished, but it gets pretty far down that line.

The normal index curve looks like this:
image

For your example, you could use an index curve where the steep part is in the middle instead of near the edges. Kind of like this:

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants