Add Pod Topology Spread KEP to new template #1796

Huang-Wei · 2020-05-18T22:11:52Z

ref #895

/sig scheduling

k8s-ci-robot · 2020-05-18T22:12:10Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Huang-Wei

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/sig-scheduling/OWNERS~~ [Huang-Wei]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Huang-Wei · 2020-05-18T22:12:23Z

/assign @alculquicondor

keps/sig-scheduling/895-pod-topology-spread/README.md

alculquicondor · 2020-05-19T14:02:26Z

keps/sig-scheduling/895-pod-topology-spread/README.md

+
+* **How can an operator determine if the feature is in use by workloads?**
+
+  Operator can query `pod.spec.topologySpreadConstraints` field and identify if


Also the metric "plugin_execution_duration_seconds" which has a "plugin" sub-field

alculquicondor · 2020-05-19T14:02:48Z

keps/sig-scheduling/895-pod-topology-spread/README.md

+* **What are the SLIs (Service Level Indicators) an operator can use to
+  determine the health of the service?**
+
+  N/A.


The metric "plugin_execution_duration_seconds" is the perfect candidate.

If we are to mention plugin_execution_duration_seconds, clarify that this is an indicator for latency. Ideally, to measure the "health of the service", we also want a "functionality" indicator, an indictor for the spread of pods that use the feature, which we don't have.

That might not be possible to do, since this is a Pod-level feature.

I think it is possible, but needs work, and should likely be done by an external component. One can also simplify things by for example monitoring only Deployments that use topology spread in their template...

In that case, we can put it as a suggestion for operators.

keps/sig-scheduling/895-pod-topology-spread/README.md

alculquicondor · 2020-05-19T14:06:11Z

keps/sig-scheduling/895-pod-topology-spread/README.md

+* **Will enabling / using this feature result in increasing time taken by any
+  operations covered by [existing SLIs/SLOs][]?**
+
+  No.


framework_extension_point_duration_seconds and other latency SLI

I would say an SLI/SLO can be setup for pod scheduling latency, this feature is replacing default spreading, so it is unlikely to increase pod scheduling latency. framework_extension_point_duration_seconds will not be a metric that we use to measure an SLO, it is more for debugging purposes.

That is true, but there is a difference: default spreading uses only the nodes that pass filters, whereas we use all nodes in Topology spreading. It might we worth saying that some increase in latency might happen, and that it can be monitored with the above metric.

alculquicondor · 2020-05-19T14:07:13Z

keps/sig-scheduling/895-pod-topology-spread/README.md

+
+* **How does this feature react if the API server and/or etcd is unavailable?**
+
+  Running workloads won't be impacted. Submissions of new workloads using this


I would say something like: "no new scheduling impact", but PRR to confirm.

Huang-Wei · 2020-05-19T19:37:29Z

Thanks for the review.

Comments addressed. @alculquicondor @ahg-g PTAL.

alculquicondor

You need a Production Readiness reviewer, according to new guidelines :(

keps/sig-scheduling/895-pod-topology-spread/README.md

alculquicondor · 2020-05-19T19:59:48Z

keps/sig-scheduling/895-pod-topology-spread/README.md


 * **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**

-  N/A.
+  - Metric `plugin_execution_duration_seconds{plugin="PodTopologySpread"}` < 500ms.


That's a very high number. Our benchmarks put the plugin at around 2ms (score) and 3ms (filter) for 10k pods #89487

But I'm not sure how to state a SLO from those numbers.

How about <= 100 ms on 90-percentile?

That's still high, but more reasonable. But I think we can mention the number of pods for that SLO to hold.

Huang-Wei · 2020-05-19T21:20:28Z

@alculquicondor I'm not sure it's mandatory to have a PRR reviewer for each KEP. If so, @wojtek-t could you kindly review the PRR part? Thanks.

/assign @wojtek-t
(feel free to unassign or assign others)

alculquicondor · 2020-05-19T21:24:19Z

It's not enforced for 1.19, but it will be in the future. So I think I could LGTM, given that the freeze is today. Could you squash?

Huang-Wei · 2020-05-19T21:52:10Z

@alculquicondor squashed.

alculquicondor · 2020-05-19T22:06:15Z

/lgtm

k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 18, 2020

k8s-ci-robot requested review from ahg-g and mrbobbytables May 18, 2020 22:12

k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory approved Indicates a PR has been approved by an approver from all required OWNERS files. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. labels May 18, 2020

k8s-ci-robot assigned alculquicondor May 18, 2020

Huang-Wei mentioned this pull request May 19, 2020

Even pod spreading across failure domains #895

Closed

Huang-Wei force-pushed the ga-pod-topology-spread branch from f6a5c5b to 27e28a4 Compare May 19, 2020 05:14

alculquicondor reviewed May 19, 2020

View reviewed changes

keps/sig-scheduling/895-pod-topology-spread/README.md Outdated Show resolved Hide resolved

alculquicondor reviewed May 19, 2020

View reviewed changes

Huang-Wei force-pushed the ga-pod-topology-spread branch 2 times, most recently from 84d2919 to 3885fbe Compare May 19, 2020 19:37

alculquicondor reviewed May 19, 2020

View reviewed changes

k8s-ci-robot assigned wojtek-t May 19, 2020

Add Pod Topology Spread KEP to new template

e695a8b

Huang-Wei force-pushed the ga-pod-topology-spread branch from efe24c1 to e695a8b Compare May 19, 2020 21:52

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 19, 2020

k8s-ci-robot merged commit 7e828cc into kubernetes:master May 19, 2020

k8s-ci-robot added this to the v1.19 milestone May 19, 2020

Huang-Wei deleted the ga-pod-topology-spread branch May 19, 2020 22:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Pod Topology Spread KEP to new template #1796

Add Pod Topology Spread KEP to new template #1796

Huang-Wei commented May 18, 2020

k8s-ci-robot commented May 18, 2020

Huang-Wei commented May 18, 2020

alculquicondor May 19, 2020

alculquicondor May 19, 2020

ahg-g May 19, 2020

alculquicondor May 19, 2020

ahg-g May 19, 2020

alculquicondor May 19, 2020

alculquicondor May 19, 2020

ahg-g May 19, 2020

alculquicondor May 19, 2020

alculquicondor May 19, 2020

Huang-Wei commented May 19, 2020

alculquicondor left a comment

alculquicondor May 19, 2020

Huang-Wei May 19, 2020

alculquicondor May 19, 2020

Huang-Wei commented May 19, 2020

alculquicondor commented May 19, 2020

Huang-Wei commented May 19, 2020

alculquicondor commented May 19, 2020


		* How can an operator determine if the feature is in use by workloads?

		Operator can query `pod.spec.topologySpreadConstraints` field and identify if


		* How does this feature react if the API server and/or etcd is unavailable?

		Running workloads won't be impacted. Submissions of new workloads using this

Add Pod Topology Spread KEP to new template #1796

Add Pod Topology Spread KEP to new template #1796

Conversation

Huang-Wei commented May 18, 2020

k8s-ci-robot commented May 18, 2020

Huang-Wei commented May 18, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Huang-Wei commented May 19, 2020

alculquicondor left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Huang-Wei commented May 19, 2020

alculquicondor commented May 19, 2020

Huang-Wei commented May 19, 2020

alculquicondor commented May 19, 2020