-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP of even pods spreading #851
Conversation
0a6a44f
to
f92d9a6
Compare
"Even spreading is a predicate (hard requirement) instead of a priority" |
@Fei-Guo I agree that hard requirement (predicate) could probably place a pod onto a high load node. So the "maxSkew" option is brought up - setting it to a reasonable value instead of 1 (default value, which can be thought of "double hard") can give you higher odds to pick up a low load node. It will firstly tolerate more candidate nodes in predicate phase, and then proceed to rely on priority strategies to choose which node is best fit. |
bcd375a
to
befdd94
Compare
6ea59dc
to
abb5d63
Compare
A new commit is pushed which contains details of Design B. PTAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @Huang-Wei. I think Design B makes more sense. I have a few minor comments.
- "@Huang-Wei" | ||
owning-sig: sig-scheduling | ||
reviewers: | ||
- "@wojtek-t" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if Wojtek will have the time to review this, but please add @MaciekPytel to ensure that this will not cause any issues for the autoscaler.
RequiredDuringSchedulingIgnoredDuringExecution []PodAffinityTerm | ||
// Similar with the same field in PodAffinity/PodAntiAffinity | ||
// +optional | ||
PreferredDuringSchedulingIgnoredDuringExecution []WeightedPodAffinityTerm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, the "Preferred" version will be the same as our current preferred inter-pod anti-affinity. Existence of this item is only justified if we plan to remove preferred anti-affinity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the "Preferred" version will be the same as our current preferred inter-pod anti-affinity
It's more similar as podAffinity, for both "required" and "preferred" versions. It's just that additionally we need to implement MaxSkew
.
The whole Design B works independently with any other predicate/priority.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh wait, I missed a top level TopologyKey
along with MaxSkew
. Without it, we can't identify which specific topology we're gonna place the matching pods evenly. (b/c we support []PodAffinityTerms
, and each term can has different topology key)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW: the top level TopologyKey
can be renamed to DistributedBy
(or DistributionKey
), so as not to cause confusion with TopologyKey
which works inside each affinityTerm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we should remove PodAffinityTerm and WeightedPodAffinityTerm and add a NodeSelector. NodeSelector allows users to specify where pods should be scheduled and what subset of nodes even spreading is applied. For example, a user should be able to specify that their pods should be scheduled in 3 out of 4 available zones and pods should be spread evenly among those 3 zones.
We don't need to specify pod affinity here at all. It complicates the design and I think the API becomes hard to use. Besides, if pod affinity is needed, users can add inter-pod affinity rules to their pod spec. Instead, we should add a pod selector which is a label selector and a namespace only. Unlike PodAffinityTerm, this one does not have a topology key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
omg, i think i just now understood the presence of two different topology keys, one in the podAffinityTerms and one outside. I think its getting too complicated to use. I think we need to think harder here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Huang-Wei Sorry if I was not clear. I think the API should specify all the following pieces of information:
- Which pods should be spread (needs a pod selector).
- What topology pods should be spread to (e.g., node, zone, region, etc.)
- What set of nodes should be considered. This is needed to filter out some nodes. For example, topology may be "zone", but a user may want to filter out one of the zones.
- MaxSkew as described in the KEP.
- Whether this is preferred or required.
This is what I have in mind:
type PodSelectorTerm struct {
// A label query over a set of resources, in this case pods.
// +optional
LabelSelector *metav1.LabelSelector
// namespaces specifies which namespaces the labelSelector applies to (matches against);
// null or empty list means "this pod's namespace"
// +optional
Namespaces []string
}
type EvenSpreading struct {
MaxSkew int32
TopologyKey string
PodSelector []PodSelectorTerm
Required bool // vs. preffered
// +optional
NodeSelector v1.NodeSelector
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After talking with @Huang-Wei offline, I think we can drop the NodeSelector from EvenSpreading
and let the predicate process an external NodeAffinity
if there is one in the pod spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bsalamat do you mean the NodeAffinity specified in Affinity *Affinity
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@krmayankk Yes. NodeAffinity and NodeSelector are two other predicates that the scheduler has. This feature will take those into account.
0401a63
to
4e36308
Compare
/cc @wojtek-t for reviewing. |
// - DoNotSchedule (default) tells the scheduler not to schedule it | ||
// - ScheduleAnyway tells the scheduler to still schedule it | ||
// Note: it's considered as "Unsatisfiable" only when ActualySkew on all nodes exceeds "MaxSkew". | ||
WhenUnsatisfiable ScheduleDecision |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this not a bool like required ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to Mayank's question. Do you think you want to support other choices in the future? Do you have any examples in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to Mayank's question. Do you think you want to support other choices in the future? Do you have any examples in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now, nope :)
(I was thinking of a third option to respect NodeSelector/NodeAffinity, but it turns out to be more a reasonable implicit assumption)
@lavalamp are you OK with changing it back to bool: ScheduleWhenUnsatisfiable bool
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think things like this are more understandable if the consequences of the action are in the enum value rather than true/false.
E.g., as an outsider, reading "ScheduleWhenUnsatisifiable" I have a lot of questions, like, "does that mean it doesn't schedule unless it's unsatisfiable?"
And I do think that you might need an additional parameter in the future, which is how strong the preference is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I do think that you might need an additional parameter in the future, which is how strong the preference is.
That reminds me of a possible scenario called "FallBackToAvailableNodes". In between "ScheduleAnyway" and "DoNotSchedule", maybe it's possible to compute the internal ActualySkew and MinimalSkew among the nodes which passed previous Predicates. This is related with earlier discussion with @bsalamat.
This is just an immature thought though, we definitely won't consider it in initial implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the readability part.
Regarding another parameter to indicate the strength of preference, we already have a weight parameter for all preference (priority) functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this particular case, I feel fairly strongly that the strings in the current draft communicate the purpose and implications of the decision much more clearly than the bool.
/hold |
I can appreciate that, but I do think part of the point of a KEP is to show how a feature fits into Kubernetes holistically. Showing how it is complimentary to existing features, or how the use cases are redistributed between features seems very relevant to that end. Basically I think you answered the question, it might be nice to say what you wrote in the KEP somewhere so future readers understand more exactly what is changing. |
eliminate embedded "TopologyKey": | ||
|
||
```go | ||
type ScheduleDecision string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a little too broad of a name. "UnsatisfiableConstraintResponse"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
That makes sense. I will add a paragraph addressing this. Updated: done. |
// namespaces specifies which namespaces the labelSelector applies to (matches against); | ||
// null or empty list means "this pod's namespace" | ||
// +optional | ||
Namespaces []string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't see a user story or something explaining why you would ever want a different namespace than the pod's own namespace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used to put it as a derivation from PodAffinity. I agree we can remove it and default to "this pod's namespace".
cc @bsalamat in case Bobby happen to know some histories or particular user story.
As long as you go for option 2, I think this is good enough from an API review perspective. I'll likely nitpick the exact comment phrasing more when the actual API change PR happens. :) |
To be clear, I said we should postpone making the final decision on anti-affinity after we analyzed our code-base and after we have even pod spreading, but at this point we know that we will either limit the topology of anti-affinity to node only or make "anti-affinity on node" a fast path and any other topology in anti-affinity a slow path. I think we need to point this out in the KEP so that readers know that inter-pod anti-affinity in any topology other than node will either be dropped or will cause scheduling slow down. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
Thanks, @Huang-Wei!
Obviously, we may change some names of types when you send the actual PRs to implement the API. When those names are finalized we can come back and update this KEP.
8c72537
to
7b24e34
Compare
Commits have been squashed. Thanks all!
@bsalamat Absolutely. PS: need another /lgtm. |
/hold cancel |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bsalamat, Huang-Wei, wgliang The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
As a follow-up of discussion kubernetes/kubernetes#72479 (comment)
Tracking issue: #895
/sig scheduling
/kind design
/priority important-longterm