Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements to the histogram field an its interactions with aggregations #74213

Open
benwtrent opened this issue Jun 16, 2021 · 5 comments
Open
Labels
:Analytics/Aggregations Aggregations >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)

Comments

@benwtrent
Copy link
Member

In the documentation of histogram fields we have the following snippet:

When using a histogram as part of an aggregation, the accuracy of the results will depend on how the histogram was constructed. It is important to consider the percentiles aggregation mode that will be used to build it. 
...<snip> description of t-digist and HDRHistos</snip>
The histogram field is "algorithm agnostic" and does not store data specific to either T-Digest or HDRHistogram. While this means the field can technically be aggregated with either algorithm, in practice the user should chose one algorithm and index data in that manner (e.g. centroids for T-Digest or intervals for HDRHistogram) to ensure best accuracy.

This is very flexible but may cause worse aggregation results in the long run.

An example of this is the range aggregation.

A naive way (and possibly the best way, unsure) to implement a range aggregation over a histogram field is to:

  • iterate the histogram values
  • check if the bucket value is in a range
  • increment the range with that count

A different way would be an attempt to rebuild the appropriate statistical distribution from the histogram results. If we knew the histogram was built utilizing the HDR structure, could we implement ranges similarly?

  • Iterate the histogram values
  • Add values to an HDR data structure
  • Look in the HDR structure by seeing the count "between values" for the ranges

The HDR methodology MAY provide better results. I am specifically thinking of the following situation:

histogram: {
values: [0.2, 0.4, 0.6, 0.8],
counts: [4, 3, 5, 10]
}

With a range like:

range: {
ranges: [{from: 0.3, to: 0.4]
}

Should this range return values? Would an interpolation of values or seeing where the range values would fit in the HDR help?

@elasticmachine elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Jun 16, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (Team:Analytics)

@benwtrent
Copy link
Member Author

In an effort to see if this would be worth it (specifically for range aggregations, it may still be valuable for other reasons, like automatically choosing the correct percentiles config), here is some data and tests I have ran.
Termonology:

naive: is no interpolation of range values. Meaning, if the histogram mapped value is in the range, we count the doc_count.
hdr: The aggregation rebuilds an HDR histogram, and attempts to interpolate values that may not exactly cover the histogram values.

Metholody and data

I built multiple "ranges" of random double values. Then ran multiple test passes on hdr and naive range bucketing over the raw range values and the histogram values. I created one histogram doc for each double "range".

Then to compare, I checked the difference in document count between the range bucket count over the raw docs and the range over histogram for the same ranges.

Here is the resulting data

Here is the gist of the test files

It seems to point to that there is NO significant difference between the way I am interpolating the histogram values vs using the naive way. Also, it seems that the hdr interpolation provides WORSE results than the naive implementation for all the test cases (though the difference is small).

This indicates that interpolation is not useful for range aggs over histogram fields.

Let me know if anything of this seems off...

Visualization of absolute error for each range bucket for all the test runs. Smaller is better:

The key method for the HDR interpolation is (this is not particularly production ready, I was just trying to put something together to see if interpolation gave us better results):

public InternalAggregation[] buildAggregations(long[] owningBucketOrds) throws IOException {
            InternalAggregation[] results = new InternalAggregation[owningBucketOrds.length];
            for (int owningOrdIdx = 0; owningOrdIdx < owningBucketOrds.length; owningOrdIdx++) {
                List<org.elasticsearch.search.aggregations.bucket.range.Range.Bucket> buckets = new ArrayList<>(ranges.length);
                DoubleHistogram hdrHisto = hdrHistos.get(0);
                final double min = hdrHisto.getMinValue();
                final double max = hdrHisto.getMaxValue();
                for (Range range : ranges) {
                    long count = 0;
                    try {
                        if (range.getFrom() <= min && range.getTo() >= max) {
                            count = hdrHisto.getTotalCount();
                        } else if (range.getFrom() > max || range.getTo() < min) {
                        } else if (range.getFrom() != range.getTo()) {
                            double from = Math.max(range.getFrom(), min);
                            double to = Math.min(range.getTo(), max);
                            double fromNext = hdrHisto.highestEquivalentValue(from);
                            double toDown = hdrHisto.lowestEquivalentValue(to);
                            double fullyCapturedBuckets = hdrHisto.getCountBetweenValues(fromNext, toDown);

                            double fromSize = hdrHisto.sizeOfEquivalentValueRange(from);
                            double fromIntersection = (fromNext - from)/fromSize;
                            double fromIntersectionCount = Math.max(
                                (hdrHisto.getCountAtValue(from) - hdrHisto.getCountAtValue(fromNext)) * fromIntersection,
                                0.0
                            );

                            double toSize = hdrHisto.sizeOfEquivalentValueRange(to);
                            double toIntersection = (to - toDown)/toSize;
                            double toIntersectionCount = Math.max(
                                (hdrHisto.getCountAtValue(to) - hdrHisto.getCountAtValue(toDown)) * toIntersection,
                                0.0
                            );

                            // I am not sure why I continually have fence post errors
                            // Without this, the aggregated value is usually too high :(
                            count = Math.round((fullyCapturedBuckets + fromIntersectionCount + toIntersectionCount)) - 1;
                        }
                    } catch (ArrayIndexOutOfBoundsException ex) {
                        //???
                        count = 0L;
                    }
                    buckets.add(rangeFactory.createBucket(
                        range.getKey(),
                        range.getFrom(),
                        range.getTo(),
                        count,
                        InternalAggregations.EMPTY, keyed, format)
                    );
                }
                results[owningOrdIdx] = rangeFactory.create(name, buckets, format, keyed, metadata());
            }
            return results;
        }
    }

Folks who might be interested:

@tveasey @csoulios

@wchaparro
Copy link
Member

@benwtrent given your analysis (thanks btw) showing no real advantage to interpolation... we could close this one out - or would you like to have a team-discuss on this? thx

@benwtrent
Copy link
Member Author

@wchaparro interpolation when it comes to range is probably not useful.

But, something that would be useful is percentiles automatically applying the appropriate settings based on some indexed values.

Right now, to make sure the histogram values return sane results, you have to make sure that:

  • Your percentiles agg is the right kind (t-digest vs. hdr)
  • and the internal settings are the same

This is problematic as the USER of the histogram data may not be the same individual/org that set it up and may not know the internals of how it was created.

@axw
Copy link
Member

axw commented Feb 23, 2022

++ recording the histogram field params in field metadata, or something along those lines, would be helpful for APM. Histograms might come from some third-party instrumentation which a user won't know the details of.

This would help provide a sensible default for Lens, maybe making the UI selection in elastic/kibana#98499 unnecessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)
Projects
None yet
Development

No branches or pull requests

4 participants