Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

schedulerlatency: implement linear interpolation for percentiles #94193

Conversation

aadityasondhi
Copy link
Collaborator

This patch improves the percentile calculation by using linear interpolation between bucket boundaries instead of using the mid-point. The patch also adds tests to compare values of the scheduler histograms to those found in util/metric.go which are based on the prometheus implementation to maintain consistency among metrics.

Fixes #89829

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

// computed off of go.scheduler_latency vs.
// admission.scheduler_latency_listener.p99_nanos.
// Find the target rank within bucket boundaries.
// NB: The 0.5 is added for rounding purposes; it helps in cases where
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@irfansharif After doing this here, I am wondering whether my decision in util/metric.go was correct and if this is the strategy we want. By rounding it to the nearest int, I think we are assuming the data within the bucket is equally spaced. For example, if we have 5 samples in a bucket with a lower bound of 0 and an upper bound of 10. We assume the data is [0, 2.5, 5, 7.5, 10]. Here, 99th and 100th percentile would return 10 instead of 9.9 for the 99th. I initially thought this was ok, but not so sure anymore. It probably doesn't matter when buckets have large counts as this smooths itself out, but something to consider and we should keep both parts consistent with each other.

FWIW prometheus client does not do this: https://github.com/prometheus/prometheus/blob/d9162189/promql/quantile.go#L100

Copy link
Contributor

@irfansharif irfansharif Jan 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't need this rounding (didn't understand it even back in #87883 (review)). Also, the uint64 type cast is having you lose precision. Pulled this branch down and tried the following, which worked better with the existing unit tests:

rank := float64(total)*p - float64(total-cumulative)
interpolator := rank / float64(h.Counts[i])
return start + (end-start)*interpolator

Is rank the best/most precise name here? Given the return values are floating points, use require.InDelta instead of require.Equal in the supporting test. Could you also improve the commentary for this math? I found it difficult to follow. It looks correct to me after the following (verbose, sorry!) derivation:

	// Want the `p * total`-th value, using something like:
	//    bucket-start + (bucket-end - bucket-start) * linear_interoplator
	//
	// For this next example:
	//  - Percentile of value at end-of-bucket: p100
	//  - Percentile of value at start-of-bucket: p90
	//  - Target percentile: p95
	//
	// In our example:
	//    linear interpolated p95 = bucket-start + (bucket-end - bucket-start) * 0.5
	//
	// Where 0.5 == (95 - 90) / (100 - 90) == 5 / 10 = 0.5. So:
	//    linear_interpolator = (p - pstart) / (pend - pstart)
	//
	// Taking each term:
	//
	// 1.
	//   (pend - pstart) == (total - cumulative + h.Counts[i] - total + cumulative) / total
	//   (pend - pstart) == h.Counts[i] / total
	//
	// 2.
	//   (p - pstart) == p - (total-cumulative) / total
	//   (p - pstart) == (total*p - (total-cumulative)) / total
	//
	// Simplifying 1 / 2:
	//
	//   (p - pstart) / (pend - pstart) == ((total*p - (total-cumulative)) / total ) / (h.Counts[i] / total)
	//   (p - pstart) / (pend - pstart) == ((total*p - (total-cumulative)) / total ) * (total / h.Counts[i])
	//   (p - pstart) / (pend - pstart) == ((total*p - (total-cumulative)) ) * (1 / h.Counts[i])
	//   (p - pstart) / (pend - pstart) == (total*p - (total-cumulative)) / h.Counts[i]

I'm likely missing some more obvious way you got to this form, but could you add it as a comment?

@aadityasondhi aadityasondhi marked this pull request as ready for review December 22, 2022 23:03
@aadityasondhi aadityasondhi requested a review from a team as a code owner December 22, 2022 23:03
@aadityasondhi aadityasondhi force-pushed the aaditya/scheduler-latency-buckets-interpolation branch from 5a9896b to f1683b2 Compare December 22, 2022 23:06
// computed off of go.scheduler_latency vs.
// admission.scheduler_latency_listener.p99_nanos.
// Find the target rank within bucket boundaries.
// NB: The 0.5 is added for rounding purposes; it helps in cases where
Copy link
Contributor

@irfansharif irfansharif Jan 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't need this rounding (didn't understand it even back in #87883 (review)). Also, the uint64 type cast is having you lose precision. Pulled this branch down and tried the following, which worked better with the existing unit tests:

rank := float64(total)*p - float64(total-cumulative)
interpolator := rank / float64(h.Counts[i])
return start + (end-start)*interpolator

Is rank the best/most precise name here? Given the return values are floating points, use require.InDelta instead of require.Equal in the supporting test. Could you also improve the commentary for this math? I found it difficult to follow. It looks correct to me after the following (verbose, sorry!) derivation:

	// Want the `p * total`-th value, using something like:
	//    bucket-start + (bucket-end - bucket-start) * linear_interoplator
	//
	// For this next example:
	//  - Percentile of value at end-of-bucket: p100
	//  - Percentile of value at start-of-bucket: p90
	//  - Target percentile: p95
	//
	// In our example:
	//    linear interpolated p95 = bucket-start + (bucket-end - bucket-start) * 0.5
	//
	// Where 0.5 == (95 - 90) / (100 - 90) == 5 / 10 = 0.5. So:
	//    linear_interpolator = (p - pstart) / (pend - pstart)
	//
	// Taking each term:
	//
	// 1.
	//   (pend - pstart) == (total - cumulative + h.Counts[i] - total + cumulative) / total
	//   (pend - pstart) == h.Counts[i] / total
	//
	// 2.
	//   (p - pstart) == p - (total-cumulative) / total
	//   (p - pstart) == (total*p - (total-cumulative)) / total
	//
	// Simplifying 1 / 2:
	//
	//   (p - pstart) / (pend - pstart) == ((total*p - (total-cumulative)) / total ) / (h.Counts[i] / total)
	//   (p - pstart) / (pend - pstart) == ((total*p - (total-cumulative)) / total ) * (total / h.Counts[i])
	//   (p - pstart) / (pend - pstart) == ((total*p - (total-cumulative)) ) * (1 / h.Counts[i])
	//   (p - pstart) / (pend - pstart) == (total*p - (total-cumulative)) / h.Counts[i]

I'm likely missing some more obvious way you got to this form, but could you add it as a comment?

require.Equal(t, 100.0, percentile(&hist, 0.99)) // p99

// Compare values against metric.Histogram (prometheus-based implementation)
promhist := metric.NewHistogram(metric.Metadata{}, time.Hour, hist.Buckets)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea to compare this to the prometheus impl. Could you take this out into a separate unit test? Either in this file+package, or in the prometheus one.

require.Equal(t, 95.0, percentile(&hist, 0.99)) // p99
require.Equal(t, 100.0, percentile(&hist, 1.00)) // pmax
require.Equal(t, 0.0, percentile(&hist, 0.00)) // pmin
require.Equal(t, 50.0, percentile(&hist, 0.50)) // p50
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above. After dropping the 0.5, use require.InDelta instead. After the uint casting change, some of these values will change.

@aadityasondhi aadityasondhi force-pushed the aaditya/scheduler-latency-buckets-interpolation branch from f1683b2 to d2fe573 Compare January 13, 2023 20:35
@blathers-crl blathers-crl bot requested a review from irfansharif January 13, 2023 20:35
Copy link
Collaborator Author

@aadityasondhi aadityasondhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some cleanup based on your suggestions. PTAL, and let me know if there are other things I should fix.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif)


pkg/util/schedulerlatency/sampler.go line 337 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

We shouldn't need this rounding (didn't understand it even back in #87883 (review)). Also, the uint64 type cast is having you lose precision. Pulled this branch down and tried the following, which worked better with the existing unit tests:

rank := float64(total)*p - float64(total-cumulative)
interpolator := rank / float64(h.Counts[i])
return start + (end-start)*interpolator

Is rank the best/most precise name here? Given the return values are floating points, use require.InDelta instead of require.Equal in the supporting test. Could you also improve the commentary for this math? I found it difficult to follow. It looks correct to me after the following (verbose, sorry!) derivation:

	// Want the `p * total`-th value, using something like:
	//    bucket-start + (bucket-end - bucket-start) * linear_interoplator
	//
	// For this next example:
	//  - Percentile of value at end-of-bucket: p100
	//  - Percentile of value at start-of-bucket: p90
	//  - Target percentile: p95
	//
	// In our example:
	//    linear interpolated p95 = bucket-start + (bucket-end - bucket-start) * 0.5
	//
	// Where 0.5 == (95 - 90) / (100 - 90) == 5 / 10 = 0.5. So:
	//    linear_interpolator = (p - pstart) / (pend - pstart)
	//
	// Taking each term:
	//
	// 1.
	//   (pend - pstart) == (total - cumulative + h.Counts[i] - total + cumulative) / total
	//   (pend - pstart) == h.Counts[i] / total
	//
	// 2.
	//   (p - pstart) == p - (total-cumulative) / total
	//   (p - pstart) == (total*p - (total-cumulative)) / total
	//
	// Simplifying 1 / 2:
	//
	//   (p - pstart) / (pend - pstart) == ((total*p - (total-cumulative)) / total ) / (h.Counts[i] / total)
	//   (p - pstart) / (pend - pstart) == ((total*p - (total-cumulative)) / total ) * (total / h.Counts[i])
	//   (p - pstart) / (pend - pstart) == ((total*p - (total-cumulative)) ) * (1 / h.Counts[i])
	//   (p - pstart) / (pend - pstart) == (total*p - (total-cumulative)) / h.Counts[i]

I'm likely missing some more obvious way you got to this form, but could you add it as a comment?

Removed casting and rounding. Also added an explanation of how I derived the formula. It is less mathematical than what you did, but if you prefer your version, I am happy to include that instead.


pkg/util/schedulerlatency/scheduler_latency_test.go line 124 at r2 (raw file):

Previously, irfansharif (irfan sharif) wrote…

See comment above. After dropping the 0.5, use require.InDelta instead. After the uint casting change, some of these values will change.

Done.


pkg/util/schedulerlatency/scheduler_latency_test.go line 130 at r2 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Good idea to compare this to the prometheus impl. Could you take this out into a separate unit test? Either in this file+package, or in the prometheus one.

Done.

@aadityasondhi aadityasondhi force-pushed the aaditya/scheduler-latency-buckets-interpolation branch from d2fe573 to f345d82 Compare January 13, 2023 20:39
Copy link
Contributor

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: With the comment suggested below.

Reviewed all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aadityasondhi)


pkg/util/schedulerlatency/sampler.go line 337 at r1 (raw file):

Previously, aadityasondhi (Aaditya Sondhi) wrote…

Removed casting and rounding. Also added an explanation of how I derived the formula. It is less mathematical than what you did, but if you prefer your version, I am happy to include that instead.

I think the commentary is still slightly ambiguous. We have this first formula: rank = percentile * count, which when "scoped" to just a bucket, gives us subset_percentile = subset_rank / subset_count. But then we're using subset_percentile as an interpolation term in bucket_size * subset_percentile (bucket_size here being different from bucket_count). Maybe I'm the only confused reader (something about "subset percentile" reads oddly), but I recommend this small amendment instead.

// Goal: We want to linear approximate the "p * total"-th value from the histogram.
//      - p * total is the ordinal rank (or simply, rank) of the target value
//        within the histogram bounds.
//
// Step 1: Find the bucket[i] which contains the target value.
//      - This will be the largest bucket[i] where the left-bound cumulative
//        count of bucket[i] <= p*total.
//
// Step 2: Determine the rank within the bucket (subset of histogram).
//      - Since the bucket is a subset of the original data set, we can simply
//        subtract left-bound cumulative to get the "subset_rank".
//
// Step 3: Use the "subset_rank" to interpolate the target value.
//      - So we want to interpolate the "subset_rank"-th value of the subset_count
//        that lies between [bucket_start, bucket_end). This is:
//        bucket_start + (bucket_start - bucket_end) * linear_interpolator
//        Where linear_interpolator = subset_rank / subset_count

Copy link
Contributor

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, the last rubber stamp didn't go through.

@aadityasondhi aadityasondhi force-pushed the aaditya/scheduler-latency-buckets-interpolation branch from f345d82 to 0250c25 Compare January 18, 2023 18:39
This patch improves the percentile calculation by using linear
interpolation between bucket boundaries instead of using the mid-point.
The patch also adds tests to compare values of the scheduler histograms
to those found in util/metric.go which are based on the prometheus
implementation to maintain consistency among metrics.

Fixes cockroachdb#89829.

Release note: None
@aadityasondhi aadityasondhi force-pushed the aaditya/scheduler-latency-buckets-interpolation branch from 0250c25 to d82e4ac Compare January 18, 2023 19:13
@aadityasondhi
Copy link
Collaborator Author

bors r+

@aadityasondhi
Copy link
Collaborator Author

bors r-

@craig
Copy link
Contributor

craig bot commented Jan 18, 2023

Canceled.

@aadityasondhi
Copy link
Collaborator Author

bors r+

@craig
Copy link
Contributor

craig bot commented Jan 19, 2023

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Jan 19, 2023

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Jan 19, 2023

Build succeeded:

@craig craig bot merged commit 942b55e into cockroachdb:master Jan 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

schedulerlatency: improve percentile computation
3 participants