Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[8.8] Add note about random sampler consistency (#107479) #107525

Merged
merged 1 commit into from
Apr 16, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,17 @@ higher sampling rates, the relative error is still low.

NOTE: This represents the result of aggregations against a typical positively skewed APM data set which also has outliers in the upper tail. The linear dependence of the relative error on the sample size is found to hold widely, but the slope depends on the variation in the quantity being aggregated. As such, the variance in your own data may
cause relative error rates to increase or decrease at a different rate.
[[random-sampler-consistency]]
==== Random sampler consistency

For a given `probability` and `seed`, the random sampler aggregation is consistent when sampling unchanged data from the same shard.
However, this is background random sampling if a particular document is included in the sampled set or not is dependent on current number of segments.

Meaning, replica vs. primary shards could return different values as different particular documents are sampled.

If the shard changes in via doc addition, update, deletion, or segment merging, the particular documents sampled could change, and thus the resulting statistics could change.

The resulting statistics used from the random sampler aggregation are approximate and should be treated as such.

[[random-sampler-special-cases]]
==== Random sampling special cases
Expand All @@ -105,6 +116,6 @@ for a bucket is `10,000` with `probability: 0.1`, the actual number of documents

An exception to this is <<search-aggregations-metrics-cardinality-aggregation, cardinality aggregation>>. Unique item
counts are not suitable for automatic scaling. When interpreting the cardinality count, compare it
to the number of sampled docs provided in the top level `doc_count` within the random_sampler aggregation. It gives
to the number of sampled docs provided in the top level `doc_count` within the random_sampler aggregation. It gives
you an idea of unique values as a percentage of total values. It may not reflect, however, the exact number of unique values
for the given field.
Loading