Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add profile label to existing scheduling metrics #92189

Closed
alculquicondor opened this issue Jun 16, 2020 · 17 comments
Closed

Add profile label to existing scheduling metrics #92189

alculquicondor opened this issue Jun 16, 2020 · 17 comments
Labels
sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.

Comments

@alculquicondor
Copy link
Member

alculquicondor commented Jun 16, 2020

/sig scheduling

Ref kubernetes/enhancements#1451

As described in the KEP for Multiple Profiles, we decided to add a profile label to the following metrics:

  • "schedule_attempts_total"
  • "e2e_scheduling_duration_seconds"
  • "scheduling_algorithm_duration_seconds"
  • "scheduling_algorithm_preemption_evaluation_seconds"
  • "binding_duration_seconds"

@dashpole pointed me to metrics stability classes, which indicate that such change wouldn't be backwards compatible.

All our metrics are ALPHA, so we should be good to add the label, having the appropriate release note.

Any concerns? GKE forsees no major risk

@k8s-ci-robot k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Jun 16, 2020
@alculquicondor
Copy link
Member Author

cc @ahg-g @Huang-Wei

@liggitt
Copy link
Member

liggitt commented Jun 16, 2020

cc @logicalhan

@alculquicondor
Copy link
Member Author

cc @ingvagabund

@ingvagabund
Copy link
Contributor

Both CounterOpts [1] and HistogramOpts have ConstLabels field which I presume you plan to use to set the profile label?

The code under clusterloader2/pkg/measurement/common/scheduler_latency.go does not take histogram labels into account. Also, schedulingMetrics type will need to be extended to support multiple profiles (at least for the default one). Including changes in the perf dashboard as well.

[1] https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/metrics/metrics.go#L52

@alculquicondor
Copy link
Member Author

Both CounterOpts [1] and HistogramOpts have ConstLabels field which I presume you plan to use to set the profile label?

I'm migrating to the Vec counterparts, like HistogramVec

Thanks for the pointers. What's the easiest way to run the perf tests to make sure they work?

@ingvagabund
Copy link
Contributor

I'm migrating to the Vec counterparts, like HistogramVec

Yeah, that is more natural.

Thanks for the pointers. What's the easiest way to run the perf tests to make sure they work?

Through PR against perf-tests and checking CI artifacts. Feel free to lemme know if you need any help.

@alculquicondor
Copy link
Member Author

It looks like the measurements will continue to work with the new profile label. Basically the code would merge all the values together at this point

https://github.com/kubernetes/perf-tests/blob/f3cf096d437805dd51e27b406f32f1af2e42307d/clusterloader2/pkg/measurement/common/scheduler_latency.go#L179

And that is fine even if we chose to have multiple profiles by default. Unless we really needed to analyze the data per profiles. But, today we only have one, so no problems.

The problem could be with the new result label for e2e_scheduling_duration_seconds
We are now tracking latency for any scheduling result, whereas before we only recorded it for successful scheduling attempts.

Note that successful scheduling attempts run through pretty much the same steps (although some pods might go through preemption). Unschedulable attempts could finish after filtering, after running preemption or during the permit endpoint which no default plugin uses. Errors could happen at any time, except that we actually shouldn't expecting them to happen.

So the question is: in the perf dashboard, do we want to track (1) all the latency of all attempts regardless of the result, (2) each result of attempts separately or (3) only the successful attempts.

If we do nothing, we are going with (1). If we want to (roughly) preserve the current behavior, we need to do (3). Note that I say roughly, because we were still recording latency for the cases were binding failed. But I don't think we were expecting that to happen during perf tests. Not sure if we expected unschedulable pods.

Thoughts? @ahg-g @ingvagabund. Personally, I say, given the nature of perf_tests, (1) is fine.

@ahg-g
Copy link
Member

ahg-g commented Jun 18, 2020

aggregating all attempts under one graph is fine, we have postfilter extension point latency which should give us a good idea of the impact of preemption if it gets executed.

I would like to expand on the perf tests we have for kubemark to include cases that include preemption, so once those are in, we can reevaluate the metrics.

@ingvagabund
Copy link
Contributor

It looks like the measurements will continue to work with the new profile label. Basically the code would merge all the values together at this point

If I read the github.com/prometheus/client_go/prometheus code correctly, after extending the labels with the profiles, i.e.:

FrameworkExtensionPointDuration = metrics.NewHistogramVec(
		&metrics.HistogramOpts{
			Subsystem: SchedulerSubsystem,
			Name:      "framework_extension_point_duration_seconds",
			Help:      "Latency for running all plugins of a specific extension point.",
			// Start with 0.1ms with the last bucket being [~200ms, Inf)
			Buckets:        metrics.ExponentialBuckets(0.0001, 2, 12),
			StabilityLevel: metrics.ALPHA,
		},
		[]string{"extension_point", "status", "profile"})

scheduler_latency.go will still accept the histogram. Though, you will not be able to distinguish extension points of individual profiles.

And that is fine even if we chose to have multiple profiles by default. Unless we really needed to analyze the data per profiles. But, today we only have one, so no problems.

Which is the case so taking all my concerns back.

The problem could be with the new result label for e2e_scheduling_duration_seconds
We are now tracking latency for any scheduling result, whereas before we only recorded it for successful scheduling attempts.

"We are now tracking" you mean your PR is, right? Do you need to add the result label and have the metric updated for the failed attempts as well?

@alculquicondor
Copy link
Member Author

"We are now tracking" you mean your PR is, right?

Correct

Do you need to add the result label and have the metric updated for the failed attempts as well?

If we leave the perf code as is, they will be aggregated together. Do you think we should distinguish?
This describes our options:

So the question is: in the perf dashboard, do we want to track (1) all the latency of all attempts regardless of the result, (2) each result of attempts separately or (3) only the successful attempts.

If we do nothing, we are going with (1). If we want to (roughly) preserve the current behavior, we need to do (3). Note that I say roughly, because we were still recording latency for the cases were binding failed. But I don't think we were expecting that to happen during perf tests. Not sure if we expected unschedulable pods.

@ingvagabund
Copy link
Contributor

If we leave the perf code as is, they will be aggregated together.

I don't think that's to be discussed in this PR (Add profile label to existing scheduling metrics).
Though, aggregation of the attempts assumes you will observe even the failed attempts. Is there a specific case where those are use full? Though, I am not opposed to that. In which case the result label is needed.

@ingvagabund
Copy link
Contributor

ingvagabund commented Jun 22, 2020

My interpretation of e2eSchedulingLatency is the time spend on successfully scheduling a pod plus time spent on evaluating the bind operation (no matter if bind operation succeeds or not). In order to know the worst/average/min time of a single and completed scheduling cycle. If the scheduling fails in the middle (e.g. due to score/filter error), we don't care much unless the errors happen too frequently. In which case there's probably a bug and some transient/permanent error. So I don't see much use in recording latency for the failed cases.

@alculquicondor
Copy link
Member Author

Recording all latencies is important to understand and estimate the throughput of the scheduler in a running cluster.

If you only check successful attempts and you have a significant amount of unschedulable pods, the latency reported by the metric is not representative.

@ahg-g
Copy link
Member

ahg-g commented Jun 22, 2020

@ingvagabund you will be able to break it down result (scheduled, unscheduled and error), so the information is not lost.

@ingvagabund
Copy link
Contributor

Given #92202 merged, it might be useful to update help text of e2e_scheduling_duration_seconds. Right now it is E2e scheduling latency in seconds (scheduling algorithm + binding) [1] which historically meant scheduling_algorithm_duration_seconds + binding_duration_seconds. Which is no longer reflected in #92202 given scheduling_algorithm_duration_seconds is reported only when the scheduling algorithm returns nil error but e2e_scheduling_duration_seconds is now reported even then the algorithm returns non-nil error.

[1] https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/metrics/metrics.go#L75

@alculquicondor
Copy link
Member Author

The work done for 1.19 is enough

/close

@k8s-ci-robot
Copy link
Contributor

@alculquicondor: Closing this issue.

In response to this:

The work done for 1.19 is enough

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.
Projects
None yet
Development

No branches or pull requests

5 participants