-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
schedulerlatency: implement linear interpolation for percentiles #94193
schedulerlatency: implement linear interpolation for percentiles #94193
Conversation
pkg/util/schedulerlatency/sampler.go
Outdated
// computed off of go.scheduler_latency vs. | ||
// admission.scheduler_latency_listener.p99_nanos. | ||
// Find the target rank within bucket boundaries. | ||
// NB: The 0.5 is added for rounding purposes; it helps in cases where |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@irfansharif After doing this here, I am wondering whether my decision in util/metric.go was correct and if this is the strategy we want. By rounding it to the nearest int, I think we are assuming the data within the bucket is equally spaced. For example, if we have 5 samples in a bucket with a lower bound of 0 and an upper bound of 10. We assume the data is [0, 2.5, 5, 7.5, 10]. Here, 99th and 100th percentile would return 10 instead of 9.9 for the 99th. I initially thought this was ok, but not so sure anymore. It probably doesn't matter when buckets have large counts as this smooths itself out, but something to consider and we should keep both parts consistent with each other.
FWIW prometheus client does not do this: https://github.com/prometheus/prometheus/blob/d9162189/promql/quantile.go#L100
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't need this rounding (didn't understand it even back in #87883 (review)). Also, the uint64 type cast is having you lose precision. Pulled this branch down and tried the following, which worked better with the existing unit tests:
rank := float64(total)*p - float64(total-cumulative)
interpolator := rank / float64(h.Counts[i])
return start + (end-start)*interpolator
Is rank
the best/most precise name here? Given the return values are floating points, use require.InDelta
instead of require.Equal
in the supporting test. Could you also improve the commentary for this math? I found it difficult to follow. It looks correct to me after the following (verbose, sorry!) derivation:
// Want the `p * total`-th value, using something like:
// bucket-start + (bucket-end - bucket-start) * linear_interoplator
//
// For this next example:
// - Percentile of value at end-of-bucket: p100
// - Percentile of value at start-of-bucket: p90
// - Target percentile: p95
//
// In our example:
// linear interpolated p95 = bucket-start + (bucket-end - bucket-start) * 0.5
//
// Where 0.5 == (95 - 90) / (100 - 90) == 5 / 10 = 0.5. So:
// linear_interpolator = (p - pstart) / (pend - pstart)
//
// Taking each term:
//
// 1.
// (pend - pstart) == (total - cumulative + h.Counts[i] - total + cumulative) / total
// (pend - pstart) == h.Counts[i] / total
//
// 2.
// (p - pstart) == p - (total-cumulative) / total
// (p - pstart) == (total*p - (total-cumulative)) / total
//
// Simplifying 1 / 2:
//
// (p - pstart) / (pend - pstart) == ((total*p - (total-cumulative)) / total ) / (h.Counts[i] / total)
// (p - pstart) / (pend - pstart) == ((total*p - (total-cumulative)) / total ) * (total / h.Counts[i])
// (p - pstart) / (pend - pstart) == ((total*p - (total-cumulative)) ) * (1 / h.Counts[i])
// (p - pstart) / (pend - pstart) == (total*p - (total-cumulative)) / h.Counts[i]
I'm likely missing some more obvious way you got to this form, but could you add it as a comment?
5a9896b
to
f1683b2
Compare
pkg/util/schedulerlatency/sampler.go
Outdated
// computed off of go.scheduler_latency vs. | ||
// admission.scheduler_latency_listener.p99_nanos. | ||
// Find the target rank within bucket boundaries. | ||
// NB: The 0.5 is added for rounding purposes; it helps in cases where |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't need this rounding (didn't understand it even back in #87883 (review)). Also, the uint64 type cast is having you lose precision. Pulled this branch down and tried the following, which worked better with the existing unit tests:
rank := float64(total)*p - float64(total-cumulative)
interpolator := rank / float64(h.Counts[i])
return start + (end-start)*interpolator
Is rank
the best/most precise name here? Given the return values are floating points, use require.InDelta
instead of require.Equal
in the supporting test. Could you also improve the commentary for this math? I found it difficult to follow. It looks correct to me after the following (verbose, sorry!) derivation:
// Want the `p * total`-th value, using something like:
// bucket-start + (bucket-end - bucket-start) * linear_interoplator
//
// For this next example:
// - Percentile of value at end-of-bucket: p100
// - Percentile of value at start-of-bucket: p90
// - Target percentile: p95
//
// In our example:
// linear interpolated p95 = bucket-start + (bucket-end - bucket-start) * 0.5
//
// Where 0.5 == (95 - 90) / (100 - 90) == 5 / 10 = 0.5. So:
// linear_interpolator = (p - pstart) / (pend - pstart)
//
// Taking each term:
//
// 1.
// (pend - pstart) == (total - cumulative + h.Counts[i] - total + cumulative) / total
// (pend - pstart) == h.Counts[i] / total
//
// 2.
// (p - pstart) == p - (total-cumulative) / total
// (p - pstart) == (total*p - (total-cumulative)) / total
//
// Simplifying 1 / 2:
//
// (p - pstart) / (pend - pstart) == ((total*p - (total-cumulative)) / total ) / (h.Counts[i] / total)
// (p - pstart) / (pend - pstart) == ((total*p - (total-cumulative)) / total ) * (total / h.Counts[i])
// (p - pstart) / (pend - pstart) == ((total*p - (total-cumulative)) ) * (1 / h.Counts[i])
// (p - pstart) / (pend - pstart) == (total*p - (total-cumulative)) / h.Counts[i]
I'm likely missing some more obvious way you got to this form, but could you add it as a comment?
require.Equal(t, 100.0, percentile(&hist, 0.99)) // p99 | ||
|
||
// Compare values against metric.Histogram (prometheus-based implementation) | ||
promhist := metric.NewHistogram(metric.Metadata{}, time.Hour, hist.Buckets) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea to compare this to the prometheus impl. Could you take this out into a separate unit test? Either in this file+package, or in the prometheus one.
require.Equal(t, 95.0, percentile(&hist, 0.99)) // p99 | ||
require.Equal(t, 100.0, percentile(&hist, 1.00)) // pmax | ||
require.Equal(t, 0.0, percentile(&hist, 0.00)) // pmin | ||
require.Equal(t, 50.0, percentile(&hist, 0.50)) // p50 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment above. After dropping the 0.5, use require.InDelta
instead. After the uint casting change, some of these values will change.
f1683b2
to
d2fe573
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some cleanup based on your suggestions. PTAL, and let me know if there are other things I should fix.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif)
pkg/util/schedulerlatency/sampler.go
line 337 at r1 (raw file):
Previously, irfansharif (irfan sharif) wrote…
We shouldn't need this rounding (didn't understand it even back in #87883 (review)). Also, the uint64 type cast is having you lose precision. Pulled this branch down and tried the following, which worked better with the existing unit tests:
rank := float64(total)*p - float64(total-cumulative) interpolator := rank / float64(h.Counts[i]) return start + (end-start)*interpolator
Is
rank
the best/most precise name here? Given the return values are floating points, userequire.InDelta
instead ofrequire.Equal
in the supporting test. Could you also improve the commentary for this math? I found it difficult to follow. It looks correct to me after the following (verbose, sorry!) derivation:// Want the `p * total`-th value, using something like: // bucket-start + (bucket-end - bucket-start) * linear_interoplator // // For this next example: // - Percentile of value at end-of-bucket: p100 // - Percentile of value at start-of-bucket: p90 // - Target percentile: p95 // // In our example: // linear interpolated p95 = bucket-start + (bucket-end - bucket-start) * 0.5 // // Where 0.5 == (95 - 90) / (100 - 90) == 5 / 10 = 0.5. So: // linear_interpolator = (p - pstart) / (pend - pstart) // // Taking each term: // // 1. // (pend - pstart) == (total - cumulative + h.Counts[i] - total + cumulative) / total // (pend - pstart) == h.Counts[i] / total // // 2. // (p - pstart) == p - (total-cumulative) / total // (p - pstart) == (total*p - (total-cumulative)) / total // // Simplifying 1 / 2: // // (p - pstart) / (pend - pstart) == ((total*p - (total-cumulative)) / total ) / (h.Counts[i] / total) // (p - pstart) / (pend - pstart) == ((total*p - (total-cumulative)) / total ) * (total / h.Counts[i]) // (p - pstart) / (pend - pstart) == ((total*p - (total-cumulative)) ) * (1 / h.Counts[i]) // (p - pstart) / (pend - pstart) == (total*p - (total-cumulative)) / h.Counts[i]
I'm likely missing some more obvious way you got to this form, but could you add it as a comment?
Removed casting and rounding. Also added an explanation of how I derived the formula. It is less mathematical than what you did, but if you prefer your version, I am happy to include that instead.
pkg/util/schedulerlatency/scheduler_latency_test.go
line 124 at r2 (raw file):
Previously, irfansharif (irfan sharif) wrote…
See comment above. After dropping the 0.5, use
require.InDelta
instead. After the uint casting change, some of these values will change.
Done.
pkg/util/schedulerlatency/scheduler_latency_test.go
line 130 at r2 (raw file):
Previously, irfansharif (irfan sharif) wrote…
Good idea to compare this to the prometheus impl. Could you take this out into a separate unit test? Either in this file+package, or in the prometheus one.
Done.
d2fe573
to
f345d82
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the comment suggested below.
Reviewed all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aadityasondhi)
pkg/util/schedulerlatency/sampler.go
line 337 at r1 (raw file):
Previously, aadityasondhi (Aaditya Sondhi) wrote…
Removed casting and rounding. Also added an explanation of how I derived the formula. It is less mathematical than what you did, but if you prefer your version, I am happy to include that instead.
I think the commentary is still slightly ambiguous. We have this first formula: rank = percentile * count
, which when "scoped" to just a bucket, gives us subset_percentile = subset_rank / subset_count
. But then we're using subset_percentile
as an interpolation term in bucket_size * subset_percentile
(bucket_size
here being different from bucket_count
). Maybe I'm the only confused reader (something about "subset percentile" reads oddly), but I recommend this small amendment instead.
// Goal: We want to linear approximate the "p * total"-th value from the histogram.
// - p * total is the ordinal rank (or simply, rank) of the target value
// within the histogram bounds.
//
// Step 1: Find the bucket[i] which contains the target value.
// - This will be the largest bucket[i] where the left-bound cumulative
// count of bucket[i] <= p*total.
//
// Step 2: Determine the rank within the bucket (subset of histogram).
// - Since the bucket is a subset of the original data set, we can simply
// subtract left-bound cumulative to get the "subset_rank".
//
// Step 3: Use the "subset_rank" to interpolate the target value.
// - So we want to interpolate the "subset_rank"-th value of the subset_count
// that lies between [bucket_start, bucket_end). This is:
// bucket_start + (bucket_start - bucket_end) * linear_interpolator
// Where linear_interpolator = subset_rank / subset_count
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, the last rubber stamp didn't go through.
f345d82
to
0250c25
Compare
This patch improves the percentile calculation by using linear interpolation between bucket boundaries instead of using the mid-point. The patch also adds tests to compare values of the scheduler histograms to those found in util/metric.go which are based on the prometheus implementation to maintain consistency among metrics. Fixes cockroachdb#89829. Release note: None
0250c25
to
d82e4ac
Compare
bors r+ |
bors r- |
Canceled. |
bors r+ |
Build failed (retrying...): |
Build failed (retrying...): |
Build succeeded: |
This patch improves the percentile calculation by using linear interpolation between bucket boundaries instead of using the mid-point. The patch also adds tests to compare values of the scheduler histograms to those found in util/metric.go which are based on the prometheus implementation to maintain consistency among metrics.
Fixes #89829
Release note: None