-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improvements to the histogram
field an its interactions with aggregations
#74213
Comments
Pinging @elastic/es-analytics-geo (Team:Analytics) |
In an effort to see if this would be worth it (specifically for
Metholody and dataI built multiple "ranges" of random double values. Then ran multiple test passes on Then to compare, I checked the difference in document count between the Here is the gist of the test files It seems to point to that there is NO significant difference between the way I am interpolating the This indicates that interpolation is not useful for Let me know if anything of this seems off... Visualization of absolute error for each range bucket for all the test runs. Smaller is better: The key method for the HDR interpolation is (this is not particularly production ready, I was just trying to put something together to see if interpolation gave us better results): public InternalAggregation[] buildAggregations(long[] owningBucketOrds) throws IOException {
InternalAggregation[] results = new InternalAggregation[owningBucketOrds.length];
for (int owningOrdIdx = 0; owningOrdIdx < owningBucketOrds.length; owningOrdIdx++) {
List<org.elasticsearch.search.aggregations.bucket.range.Range.Bucket> buckets = new ArrayList<>(ranges.length);
DoubleHistogram hdrHisto = hdrHistos.get(0);
final double min = hdrHisto.getMinValue();
final double max = hdrHisto.getMaxValue();
for (Range range : ranges) {
long count = 0;
try {
if (range.getFrom() <= min && range.getTo() >= max) {
count = hdrHisto.getTotalCount();
} else if (range.getFrom() > max || range.getTo() < min) {
} else if (range.getFrom() != range.getTo()) {
double from = Math.max(range.getFrom(), min);
double to = Math.min(range.getTo(), max);
double fromNext = hdrHisto.highestEquivalentValue(from);
double toDown = hdrHisto.lowestEquivalentValue(to);
double fullyCapturedBuckets = hdrHisto.getCountBetweenValues(fromNext, toDown);
double fromSize = hdrHisto.sizeOfEquivalentValueRange(from);
double fromIntersection = (fromNext - from)/fromSize;
double fromIntersectionCount = Math.max(
(hdrHisto.getCountAtValue(from) - hdrHisto.getCountAtValue(fromNext)) * fromIntersection,
0.0
);
double toSize = hdrHisto.sizeOfEquivalentValueRange(to);
double toIntersection = (to - toDown)/toSize;
double toIntersectionCount = Math.max(
(hdrHisto.getCountAtValue(to) - hdrHisto.getCountAtValue(toDown)) * toIntersection,
0.0
);
// I am not sure why I continually have fence post errors
// Without this, the aggregated value is usually too high :(
count = Math.round((fullyCapturedBuckets + fromIntersectionCount + toIntersectionCount)) - 1;
}
} catch (ArrayIndexOutOfBoundsException ex) {
//???
count = 0L;
}
buckets.add(rangeFactory.createBucket(
range.getKey(),
range.getFrom(),
range.getTo(),
count,
InternalAggregations.EMPTY, keyed, format)
);
}
results[owningOrdIdx] = rangeFactory.create(name, buckets, format, keyed, metadata());
}
return results;
}
} Folks who might be interested: |
@benwtrent given your analysis (thanks btw) showing no real advantage to interpolation... we could close this one out - or would you like to have a team-discuss on this? thx |
@wchaparro interpolation when it comes to But, something that would be useful is Right now, to make sure the histogram values return sane results, you have to make sure that:
This is problematic as the USER of the histogram data may not be the same individual/org that set it up and may not know the internals of how it was created. |
++ recording the histogram field params in field metadata, or something along those lines, would be helpful for APM. Histograms might come from some third-party instrumentation which a user won't know the details of. This would help provide a sensible default for Lens, maybe making the UI selection in elastic/kibana#98499 unnecessary. |
In the documentation of
histogram
fields we have the following snippet:This is very flexible but may cause worse aggregation results in the long run.
An example of this is the
range
aggregation.A naive way (and possibly the best way, unsure) to implement a
range
aggregation over ahistogram
field is to:A different way would be an attempt to rebuild the appropriate statistical distribution from the histogram results. If we knew the histogram was built utilizing the HDR structure, could we implement ranges similarly?
The HDR methodology MAY provide better results. I am specifically thinking of the following situation:
With a range like:
Should this range return values? Would an interpolation of values or seeing where the range values would fit in the HDR help?
The text was updated successfully, but these errors were encountered: