-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Pod Topology Spread KEP to new template #1796
Add Pod Topology Spread KEP to new template #1796
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Huang-Wei The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/assign @alculquicondor |
f6a5c5b
to
27e28a4
Compare
|
||
* **How can an operator determine if the feature is in use by workloads?** | ||
|
||
Operator can query `pod.spec.topologySpreadConstraints` field and identify if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also the metric "plugin_execution_duration_seconds" which has a "plugin" sub-field
* **What are the SLIs (Service Level Indicators) an operator can use to | ||
determine the health of the service?** | ||
|
||
N/A. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The metric "plugin_execution_duration_seconds" is the perfect candidate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are to mention plugin_execution_duration_seconds, clarify that this is an indicator for latency. Ideally, to measure the "health of the service", we also want a "functionality" indicator, an indictor for the spread of pods that use the feature, which we don't have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That might not be possible to do, since this is a Pod-level feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is possible, but needs work, and should likely be done by an external component. One can also simplify things by for example monitoring only Deployments that use topology spread in their template...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, we can put it as a suggestion for operators.
* **Will enabling / using this feature result in increasing time taken by any | ||
operations covered by [existing SLIs/SLOs][]?** | ||
|
||
No. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
framework_extension_point_duration_seconds
and other latency SLI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say an SLI/SLO can be setup for pod scheduling latency, this feature is replacing default spreading, so it is unlikely to increase pod scheduling latency. framework_extension_point_duration_seconds
will not be a metric that we use to measure an SLO, it is more for debugging purposes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is true, but there is a difference: default spreading uses only the nodes that pass filters, whereas we use all nodes in Topology spreading. It might we worth saying that some increase in latency might happen, and that it can be monitored with the above metric.
|
||
* **How does this feature react if the API server and/or etcd is unavailable?** | ||
|
||
Running workloads won't be impacted. Submissions of new workloads using this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say something like: "no new scheduling impact", but PRR to confirm.
84d2919
to
3885fbe
Compare
Thanks for the review. Comments addressed. @alculquicondor @ahg-g PTAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need a Production Readiness reviewer, according to new guidelines :(
|
||
* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?** | ||
|
||
N/A. | ||
- Metric `plugin_execution_duration_seconds{plugin="PodTopologySpread"}` < 500ms. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a very high number. Our benchmarks put the plugin at around 2ms (score) and 3ms (filter) for 10k pods #89487
But I'm not sure how to state a SLO from those numbers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about <= 100 ms on 90-percentile?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's still high, but more reasonable. But I think we can mention the number of pods for that SLO to hold.
@alculquicondor I'm not sure it's mandatory to have a PRR reviewer for each KEP. If so, @wojtek-t could you kindly review the PRR part? Thanks. /assign @wojtek-t |
It's not enforced for 1.19, but it will be in the future. So I think I could LGTM, given that the freeze is today. Could you squash? |
efe24c1
to
e695a8b
Compare
@alculquicondor squashed. |
/lgtm |
ref #895
/sig scheduling