-
Notifications
You must be signed in to change notification settings - Fork 39.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add profile label to existing scheduling metrics #92189
Comments
cc @ahg-g @Huang-Wei |
cc @logicalhan |
cc @ingvagabund |
Both The code under clusterloader2/pkg/measurement/common/scheduler_latency.go does not take histogram labels into account. Also, schedulingMetrics type will need to be extended to support multiple profiles (at least for the default one). Including changes in the perf dashboard as well. [1] https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/metrics/metrics.go#L52 |
I'm migrating to the Vec counterparts, like Thanks for the pointers. What's the easiest way to run the perf tests to make sure they work? |
Yeah, that is more natural.
Through PR against perf-tests and checking CI artifacts. Feel free to lemme know if you need any help. |
It looks like the measurements will continue to work with the new profile label. Basically the code would merge all the values together at this point And that is fine even if we chose to have multiple profiles by default. Unless we really needed to analyze the data per profiles. But, today we only have one, so no problems. The problem could be with the new Note that successful scheduling attempts run through pretty much the same steps (although some pods might go through preemption). Unschedulable attempts could finish after filtering, after running preemption or during the permit endpoint which no default plugin uses. Errors could happen at any time, except that we actually shouldn't expecting them to happen. So the question is: in the perf dashboard, do we want to track (1) all the latency of all attempts regardless of the result, (2) each result of attempts separately or (3) only the successful attempts. If we do nothing, we are going with (1). If we want to (roughly) preserve the current behavior, we need to do (3). Note that I say roughly, because we were still recording latency for the cases were binding failed. But I don't think we were expecting that to happen during perf tests. Not sure if we expected unschedulable pods. Thoughts? @ahg-g @ingvagabund. Personally, I say, given the nature of perf_tests, (1) is fine. |
aggregating all attempts under one graph is fine, we have postfilter extension point latency which should give us a good idea of the impact of preemption if it gets executed. I would like to expand on the perf tests we have for kubemark to include cases that include preemption, so once those are in, we can reevaluate the metrics. |
If I read the
scheduler_latency.go will still accept the histogram. Though, you will not be able to distinguish extension points of individual profiles.
Which is the case so taking all my concerns back.
"We are now tracking" you mean your PR is, right? Do you need to add the result label and have the metric updated for the failed attempts as well? |
Correct
If we leave the perf code as is, they will be aggregated together. Do you think we should distinguish?
|
I don't think that's to be discussed in this PR ( |
My interpretation of |
Recording all latencies is important to understand and estimate the throughput of the scheduler in a running cluster. If you only check successful attempts and you have a significant amount of unschedulable pods, the latency reported by the metric is not representative. |
@ingvagabund you will be able to break it down result (scheduled, unscheduled and error), so the information is not lost. |
Given #92202 merged, it might be useful to update help text of [1] https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/metrics/metrics.go#L75 |
The work done for 1.19 is enough /close |
@alculquicondor: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/sig scheduling
Ref kubernetes/enhancements#1451
As described in the KEP for Multiple Profiles, we decided to add a profile label to the following metrics:
@dashpole pointed me to metrics stability classes, which indicate that such change wouldn't be backwards compatible.
All our metrics are ALPHA, so we should be good to add the label, having the appropriate release note.
Any concerns? GKE forsees no major risk
The text was updated successfully, but these errors were encountered: