schedulerlatency: implement linear interpolation for percentiles #94193

aadityasondhi · 2022-12-22T22:51:27Z

This patch improves the percentile calculation by using linear interpolation between bucket boundaries instead of using the mid-point. The patch also adds tests to compare values of the scheduler histograms to those found in util/metric.go which are based on the prometheus implementation to maintain consistency among metrics.

Fixes #89829

Release note: None

cockroach-teamcity · 2022-12-22T22:51:35Z

This change is

aadityasondhi · 2022-12-22T23:02:02Z

pkg/util/schedulerlatency/sampler.go

-	// computed off of go.scheduler_latency vs.
-	// admission.scheduler_latency_listener.p99_nanos.
+	// Find the target rank within bucket boundaries.
+	// NB: The 0.5 is added for rounding purposes; it helps in cases where


@irfansharif After doing this here, I am wondering whether my decision in util/metric.go was correct and if this is the strategy we want. By rounding it to the nearest int, I think we are assuming the data within the bucket is equally spaced. For example, if we have 5 samples in a bucket with a lower bound of 0 and an upper bound of 10. We assume the data is [0, 2.5, 5, 7.5, 10]. Here, 99th and 100th percentile would return 10 instead of 9.9 for the 99th. I initially thought this was ok, but not so sure anymore. It probably doesn't matter when buckets have large counts as this smooths itself out, but something to consider and we should keep both parts consistent with each other.

FWIW prometheus client does not do this: https://github.com/prometheus/prometheus/blob/d9162189/promql/quantile.go#L100

We shouldn't need this rounding (didn't understand it even back in #87883 (review)). Also, the uint64 type cast is having you lose precision. Pulled this branch down and tried the following, which worked better with the existing unit tests:

rank := float64(total)*p - float64(total-cumulative) interpolator := rank / float64(h.Counts[i]) return start + (end-start)*interpolator

Is rank the best/most precise name here? Given the return values are floating points, use require.InDelta instead of require.Equal in the supporting test. Could you also improve the commentary for this math? I found it difficult to follow. It looks correct to me after the following (verbose, sorry!) derivation:

// Want the `p * total`-th value, using something like: // bucket-start + (bucket-end - bucket-start) * linear_interoplator // // For this next example: // - Percentile of value at end-of-bucket: p100 // - Percentile of value at start-of-bucket: p90 // - Target percentile: p95 // // In our example: // linear interpolated p95 = bucket-start + (bucket-end - bucket-start) * 0.5 // // Where 0.5 == (95 - 90) / (100 - 90) == 5 / 10 = 0.5. So: // linear_interpolator = (p - pstart) / (pend - pstart) // // Taking each term: // // 1. // (pend - pstart) == (total - cumulative + h.Counts[i] - total + cumulative) / total // (pend - pstart) == h.Counts[i] / total // // 2. // (p - pstart) == p - (total-cumulative) / total // (p - pstart) == (total*p - (total-cumulative)) / total // // Simplifying 1 / 2: // // (p - pstart) / (pend - pstart) == ((total*p - (total-cumulative)) / total ) / (h.Counts[i] / total) // (p - pstart) / (pend - pstart) == ((total*p - (total-cumulative)) / total ) * (total / h.Counts[i]) // (p - pstart) / (pend - pstart) == ((total*p - (total-cumulative)) ) * (1 / h.Counts[i]) // (p - pstart) / (pend - pstart) == (total*p - (total-cumulative)) / h.Counts[i]

I'm likely missing some more obvious way you got to this form, but could you add it as a comment?

irfansharif · 2023-01-03T22:33:08Z

pkg/util/schedulerlatency/sampler.go

-	// computed off of go.scheduler_latency vs.
-	// admission.scheduler_latency_listener.p99_nanos.
+	// Find the target rank within bucket boundaries.
+	// NB: The 0.5 is added for rounding purposes; it helps in cases where


We shouldn't need this rounding (didn't understand it even back in #87883 (review)). Also, the uint64 type cast is having you lose precision. Pulled this branch down and tried the following, which worked better with the existing unit tests:

rank := float64(total)*p - float64(total-cumulative) interpolator := rank / float64(h.Counts[i]) return start + (end-start)*interpolator

Is rank the best/most precise name here? Given the return values are floating points, use require.InDelta instead of require.Equal in the supporting test. Could you also improve the commentary for this math? I found it difficult to follow. It looks correct to me after the following (verbose, sorry!) derivation:

// Want the `p * total`-th value, using something like: // bucket-start + (bucket-end - bucket-start) * linear_interoplator // // For this next example: // - Percentile of value at end-of-bucket: p100 // - Percentile of value at start-of-bucket: p90 // - Target percentile: p95 // // In our example: // linear interpolated p95 = bucket-start + (bucket-end - bucket-start) * 0.5 // // Where 0.5 == (95 - 90) / (100 - 90) == 5 / 10 = 0.5. So: // linear_interpolator = (p - pstart) / (pend - pstart) // // Taking each term: // // 1. // (pend - pstart) == (total - cumulative + h.Counts[i] - total + cumulative) / total // (pend - pstart) == h.Counts[i] / total // // 2. // (p - pstart) == p - (total-cumulative) / total // (p - pstart) == (total*p - (total-cumulative)) / total // // Simplifying 1 / 2: // // (p - pstart) / (pend - pstart) == ((total*p - (total-cumulative)) / total ) / (h.Counts[i] / total) // (p - pstart) / (pend - pstart) == ((total*p - (total-cumulative)) / total ) * (total / h.Counts[i]) // (p - pstart) / (pend - pstart) == ((total*p - (total-cumulative)) ) * (1 / h.Counts[i]) // (p - pstart) / (pend - pstart) == (total*p - (total-cumulative)) / h.Counts[i]

I'm likely missing some more obvious way you got to this form, but could you add it as a comment?

irfansharif · 2023-01-03T22:34:12Z

pkg/util/schedulerlatency/scheduler_latency_test.go

+		require.Equal(t, 100.0, percentile(&hist, 0.99)) // p99
+
+		// Compare values against metric.Histogram (prometheus-based implementation)
+		promhist := metric.NewHistogram(metric.Metadata{}, time.Hour, hist.Buckets)


Good idea to compare this to the prometheus impl. Could you take this out into a separate unit test? Either in this file+package, or in the prometheus one.

irfansharif · 2023-01-03T22:34:51Z

pkg/util/schedulerlatency/scheduler_latency_test.go

-		require.Equal(t, 95.0, percentile(&hist, 0.99)) // p99
+		require.Equal(t, 100.0, percentile(&hist, 1.00)) // pmax
+		require.Equal(t, 0.0, percentile(&hist, 0.00))   // pmin
+		require.Equal(t, 50.0, percentile(&hist, 0.50))  // p50


See comment above. After dropping the 0.5, use require.InDelta instead. After the uint casting change, some of these values will change.

aadityasondhi

I did some cleanup based on your suggestions. PTAL, and let me know if there are other things I should fix.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif)

pkg/util/schedulerlatency/sampler.go line 337 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

We shouldn't need this rounding (didn't understand it even back in #87883 (review)). Also, the uint64 type cast is having you lose precision. Pulled this branch down and tried the following, which worked better with the existing unit tests:
rank := float64(total)*p - float64(total-cumulative)
interpolator := rank / float64(h.Counts[i])
return start + (end-start)*interpolator
Is rank the best/most precise name here? Given the return values are floating points, use require.InDelta instead of require.Equal in the supporting test. Could you also improve the commentary for this math? I found it difficult to follow. It looks correct to me after the following (verbose, sorry!) derivation:
	// Want the `p * total`-th value, using something like:
	//    bucket-start + (bucket-end - bucket-start) * linear_interoplator
	//
	// For this next example:
	//  - Percentile of value at end-of-bucket: p100
	//  - Percentile of value at start-of-bucket: p90
	//  - Target percentile: p95
	//
	// In our example:
	//    linear interpolated p95 = bucket-start + (bucket-end - bucket-start) * 0.5
	//
	// Where 0.5 == (95 - 90) / (100 - 90) == 5 / 10 = 0.5. So:
	//    linear_interpolator = (p - pstart) / (pend - pstart)
	//
	// Taking each term:
	//
	// 1.
	//   (pend - pstart) == (total - cumulative + h.Counts[i] - total + cumulative) / total
	//   (pend - pstart) == h.Counts[i] / total
	//
	// 2.
	//   (p - pstart) == p - (total-cumulative) / total
	//   (p - pstart) == (total*p - (total-cumulative)) / total
	//
	// Simplifying 1 / 2:
	//
	//   (p - pstart) / (pend - pstart) == ((total*p - (total-cumulative)) / total ) / (h.Counts[i] / total)
	//   (p - pstart) / (pend - pstart) == ((total*p - (total-cumulative)) / total ) * (total / h.Counts[i])
	//   (p - pstart) / (pend - pstart) == ((total*p - (total-cumulative)) ) * (1 / h.Counts[i])
	//   (p - pstart) / (pend - pstart) == (total*p - (total-cumulative)) / h.Counts[i]
I'm likely missing some more obvious way you got to this form, but could you add it as a comment?

Removed casting and rounding. Also added an explanation of how I derived the formula. It is less mathematical than what you did, but if you prefer your version, I am happy to include that instead.

pkg/util/schedulerlatency/scheduler_latency_test.go line 124 at r2 (raw file):

Previously, irfansharif (irfan sharif) wrote…

See comment above. After dropping the 0.5, use require.InDelta instead. After the uint casting change, some of these values will change.

Done.

pkg/util/schedulerlatency/scheduler_latency_test.go line 130 at r2 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Good idea to compare this to the prometheus impl. Could you take this out into a separate unit test? Either in this file+package, or in the prometheus one.

Done.

irfansharif

With the comment suggested below.

Reviewed all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aadityasondhi)

pkg/util/schedulerlatency/sampler.go line 337 at r1 (raw file):

Previously, aadityasondhi (Aaditya Sondhi) wrote…

Removed casting and rounding. Also added an explanation of how I derived the formula. It is less mathematical than what you did, but if you prefer your version, I am happy to include that instead.

I think the commentary is still slightly ambiguous. We have this first formula: rank = percentile * count, which when "scoped" to just a bucket, gives us subset_percentile = subset_rank / subset_count. But then we're using subset_percentile as an interpolation term in bucket_size * subset_percentile (bucket_size here being different from bucket_count). Maybe I'm the only confused reader (something about "subset percentile" reads oddly), but I recommend this small amendment instead.

// Goal: We want to linear approximate the "p * total"-th value from the histogram.
//      - p * total is the ordinal rank (or simply, rank) of the target value
//        within the histogram bounds.
//
// Step 1: Find the bucket[i] which contains the target value.
//      - This will be the largest bucket[i] where the left-bound cumulative
//        count of bucket[i] <= p*total.
//
// Step 2: Determine the rank within the bucket (subset of histogram).
//      - Since the bucket is a subset of the original data set, we can simply
//        subtract left-bound cumulative to get the "subset_rank".
//
// Step 3: Use the "subset_rank" to interpolate the target value.
//      - So we want to interpolate the "subset_rank"-th value of the subset_count
//        that lies between [bucket_start, bucket_end). This is:
//        bucket_start + (bucket_start - bucket_end) * linear_interpolator
//        Where linear_interpolator = subset_rank / subset_count

irfansharif

Oops, the last rubber stamp didn't go through.

This patch improves the percentile calculation by using linear interpolation between bucket boundaries instead of using the mid-point. The patch also adds tests to compare values of the scheduler histograms to those found in util/metric.go which are based on the prometheus implementation to maintain consistency among metrics. Fixes cockroachdb#89829. Release note: None

aadityasondhi · 2023-01-18T20:48:43Z

bors r+

aadityasondhi · 2023-01-18T22:19:58Z

bors r-

craig · 2023-01-18T22:20:00Z

Canceled.

aadityasondhi · 2023-01-18T22:20:17Z

bors r+

craig · 2023-01-19T00:33:04Z

Build failed (retrying...):

Bazel Essential CI (Cockroach)

craig · 2023-01-19T01:50:22Z

Build failed (retrying...):

Bazel Essential CI (Cockroach)

craig · 2023-01-19T04:30:54Z

Build succeeded:

Bazel Essential CI (Cockroach)

aadityasondhi requested a review from irfansharif December 22, 2022 22:51

aadityasondhi commented Dec 22, 2022

View reviewed changes

aadityasondhi marked this pull request as ready for review December 22, 2022 23:03

aadityasondhi requested a review from a team as a code owner December 22, 2022 23:03

aadityasondhi force-pushed the aaditya/scheduler-latency-buckets-interpolation branch from 5a9896b to f1683b2 Compare December 22, 2022 23:06

irfansharif suggested changes Jan 3, 2023

View reviewed changes

aadityasondhi force-pushed the aaditya/scheduler-latency-buckets-interpolation branch from f1683b2 to d2fe573 Compare January 13, 2023 20:35

blathers-crl bot requested a review from irfansharif January 13, 2023 20:35

aadityasondhi commented Jan 13, 2023

View reviewed changes

aadityasondhi force-pushed the aaditya/scheduler-latency-buckets-interpolation branch from d2fe573 to f345d82 Compare January 13, 2023 20:39

irfansharif reviewed Jan 16, 2023

View reviewed changes

irfansharif approved these changes Jan 17, 2023

View reviewed changes

aadityasondhi force-pushed the aaditya/scheduler-latency-buckets-interpolation branch from f345d82 to 0250c25 Compare January 18, 2023 18:39

aadityasondhi force-pushed the aaditya/scheduler-latency-buckets-interpolation branch from 0250c25 to d82e4ac Compare January 18, 2023 19:13

craig bot merged commit 942b55e into cockroachdb:master Jan 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

schedulerlatency: implement linear interpolation for percentiles #94193

schedulerlatency: implement linear interpolation for percentiles #94193

aadityasondhi commented Dec 22, 2022

cockroach-teamcity commented Dec 22, 2022

aadityasondhi Dec 22, 2022

irfansharif Jan 3, 2023 •

edited

Loading

irfansharif Jan 3, 2023 •

edited

Loading

irfansharif Jan 3, 2023

irfansharif Jan 3, 2023

aadityasondhi left a comment

irfansharif left a comment

irfansharif left a comment

aadityasondhi commented Jan 18, 2023

aadityasondhi commented Jan 18, 2023

craig bot commented Jan 18, 2023

aadityasondhi commented Jan 18, 2023

craig bot commented Jan 19, 2023

craig bot commented Jan 19, 2023

craig bot commented Jan 19, 2023

schedulerlatency: implement linear interpolation for percentiles #94193

schedulerlatency: implement linear interpolation for percentiles #94193

Conversation

aadityasondhi commented Dec 22, 2022

cockroach-teamcity commented Dec 22, 2022

aadityasondhi Dec 22, 2022

Choose a reason for hiding this comment

irfansharif Jan 3, 2023 • edited Loading

Choose a reason for hiding this comment

irfansharif Jan 3, 2023 • edited Loading

Choose a reason for hiding this comment

irfansharif Jan 3, 2023

Choose a reason for hiding this comment

irfansharif Jan 3, 2023

Choose a reason for hiding this comment

aadityasondhi left a comment

Choose a reason for hiding this comment

irfansharif left a comment

Choose a reason for hiding this comment

irfansharif left a comment

Choose a reason for hiding this comment

aadityasondhi commented Jan 18, 2023

aadityasondhi commented Jan 18, 2023

craig bot commented Jan 18, 2023

aadityasondhi commented Jan 18, 2023

craig bot commented Jan 19, 2023

craig bot commented Jan 19, 2023

craig bot commented Jan 19, 2023

irfansharif Jan 3, 2023 •

edited

Loading

irfansharif Jan 3, 2023 •

edited

Loading