Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make TDigestState configurable #96787

Closed
wants to merge 137 commits into from
Closed

Conversation

kkrik-es
Copy link
Contributor

Add SortingDigest as a simple structure for percentile calculations that
tracks all data points in a sorted array. This is a fast and perfectly
accurate solution that leads to bloated memory allocation.

Add HybridDigest that uses SortingDigest for small sample counts, then
switches to MergingDigest. This approach delivers extreme
performance and accuracy for small populations while scaling
indefinitely and maintaining acceptable performance and accuracy with
constant memory allocation (15kB by default).

Make TDigestState a TDigest decorator instead of a subclass. Its
factories hide the details of the underlying TDigest implementation,
offering a default and an optimizer for accuracy version. Both point
to AVLTreeDigest for now; the default will be switched to HybridDigest
in a follow-up PR.

Related to #96086.

kkrik-es and others added 30 commits May 9, 2023 10:23
More work needed for TDigestPercentile*Tests and the TDigestTest (and
the rest of the tests) in the tdigest lib to pass.
Remove wrong asserts from tests and MergingDigest.
Remove redundant serializing interfaces from the library.
These tests don't address compatibility issues in mixed cluster tests as
the latter contain a mix of older and newer nodes, so the output depends
on which node is picked as a data node since the forked TDigest library
is not backwards compatible (produces slightly different results).
kkrik-es added 28 commits June 13, 2023 09:53
The latest library version is 50% slower and less accurate, as verified
by ComparisonTests.
This helps with mixed cluster tests, where some of the tests where
blocked.
Update quantile calculations between centroids and min/max values to
match v.3.2.
The SortingDigest tracks all samples in an ArrayList that
gets sorted for quantile calculations. This approach
provides perfectly accurate results and is the most
efficient implementation for up to millions of samples,
at the cost of bloated memory footprint.

The HybridDigest uses a SortingDigest for small sample
populations, then switches to a MergingDigest. This
approach combines to the best performance and results for
small sample counts with very good performance and
acceptable accuracy for effectively unbounded sample
counts.
These will be submitted in a follow-up PR for enabling MergingDigest.
Delete dead and commented out code, make the remaining tests run
reasonably fast. Remove unused annotations, esp. SuppressWarnings.
Add SortingDigest as a simple structure for percentile calculations that
tracks all data points in a sorted array. This is a fast and perfectly
accurate solution that leads to bloated memory allocation.

Add HybridDigest that uses SortingDigest for small sample counts, then
switches to MergingDigest. This approach delivers extreme
performance and accuracy for small populations while scaling
indefinitely and maintaining acceptable performance and accuracy with
constant memory allocation (15kB by default).

Provide knobs to switch back to AVLTreeDigest, either per query or
through ClusterSettings.
# Conflicts:
#	benchmarks/src/main/java/org/elasticsearch/benchmark/tdigest/TDigestBench.java
#	libs/tdigest/src/main/java/org/elasticsearch/tdigest/AVLGroupTree.java
#	libs/tdigest/src/main/java/org/elasticsearch/tdigest/AVLTreeDigest.java
#	libs/tdigest/src/main/java/org/elasticsearch/tdigest/AbstractTDigest.java
#	libs/tdigest/src/main/java/org/elasticsearch/tdigest/Dist.java
#	libs/tdigest/src/main/java/org/elasticsearch/tdigest/MergingDigest.java
#	libs/tdigest/src/main/java/org/elasticsearch/tdigest/TDigest.java
#	libs/tdigest/src/test/java/org/elasticsearch/tdigest/ComparisonTests.java
#	libs/tdigest/src/test/java/org/elasticsearch/tdigest/MedianTests.java
#	libs/tdigest/src/test/java/org/elasticsearch/tdigest/MergingDigestTests.java
#	libs/tdigest/src/test/java/org/elasticsearch/tdigest/ScaleFunctionTests.java
#	libs/tdigest/src/test/java/org/elasticsearch/tdigest/TDigestTests.java
#	modules/aggregations/build.gradle
#	modules/aggregations/src/yamlRestTest/resources/rest-api-spec/test/aggregations/percentile_ranks_tdigest_metric.yml
#	modules/aggregations/src/yamlRestTest/resources/rest-api-spec/test/aggregations/percentiles_tdigest_metric.yml
#	server/src/main/java/org/elasticsearch/search/aggregations/metrics/EmptyTDigestState.java
#	server/src/main/java/org/elasticsearch/search/aggregations/metrics/TDigestState.java
#	x-pack/plugin/analytics/src/test/java/org/elasticsearch/xpack/analytics/AnalyticsTestsUtils.java
#	x-pack/plugin/build.gradle
#	x-pack/plugin/sql/qa/server/src/main/resources/unsigned-long.csv-spec
#	x-pack/plugin/src/yamlRestTest/resources/rest-api-spec/test/ml/evaluate_data_frame.yml
#	x-pack/qa/xpack-prefix-rest-compat/build.gradle
…903-merging

# Conflicts:
#	server/src/test/java/org/elasticsearch/search/aggregations/metrics/InternalTDigestPercentilesRanksTests.java
#	server/src/test/java/org/elasticsearch/search/aggregations/metrics/InternalTDigestPercentilesTests.java
@kkrik-es kkrik-es closed this Jun 13, 2023
@kkrik-es kkrik-es deleted the fix/95903-merging branch June 13, 2023 12:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants