KEP of even pods spreading #851

Huang-Wei · 2019-02-22T01:00:46Z

As a follow-up of discussion kubernetes/kubernetes#72479 (comment)

Tracking issue: #895

/sig scheduling
/kind design
/priority important-longterm

Huang-Wei · 2019-02-22T01:02:06Z

cc/ @wojtek-t @bsalamat @k82cn @ravisantoshgudimetla @misterikkit @resouer @wgliang

Fei-Guo · 2019-02-22T05:20:30Z

"Even spreading is a predicate (hard requirement) instead of a priority"
For feature like this, a typical concern is what if spreading causes App performance issue. i.e., Do you want to place Pod to a high load node for the sake of spreading? Difference people may have difference option, but a hard rule seems to be very restrictive here. In practice, I feel AZ level spreading can be a hard rule but within a AZ, a soft rule may be more suitable for spreading.

Huang-Wei · 2019-02-22T05:30:30Z

@Fei-Guo I agree that hard requirement (predicate) could probably place a pod onto a high load node. So the "maxSkew" option is brought up - setting it to a reasonable value instead of 1 (default value, which can be thought of "double hard") can give you higher odds to pick up a low load node. It will firstly tolerate more candidate nodes in predicate phase, and then proceed to rely on priority strategies to choose which node is best fit.

keps/sig-scheduling/20190221-even-pods-spreading.md

Huang-Wei · 2019-03-12T07:23:11Z

A new commit is pushed which contains details of Design B. PTAL.

bsalamat

Thanks, @Huang-Wei. I think Design B makes more sense. I have a few minor comments.

bsalamat · 2019-03-12T23:43:37Z

keps/sig-scheduling/20190221-even-pods-spreading.md

+  - "@Huang-Wei"
+owning-sig: sig-scheduling
+reviewers:
+  - "@wojtek-t"


Not sure if Wojtek will have the time to review this, but please add @MaciekPytel to ensure that this will not cause any issues for the autoscaler.

keps/sig-scheduling/20190221-even-pods-spreading.md

bsalamat · 2019-03-13T00:11:48Z

keps/sig-scheduling/20190221-even-pods-spreading.md

+    RequiredDuringSchedulingIgnoredDuringExecution []PodAffinityTerm
+    // Similar with the same field in PodAffinity/PodAntiAffinity
+    // +optional
+    PreferredDuringSchedulingIgnoredDuringExecution []WeightedPodAffinityTerm


IIUC, the "Preferred" version will be the same as our current preferred inter-pod anti-affinity. Existence of this item is only justified if we plan to remove preferred anti-affinity.

the "Preferred" version will be the same as our current preferred inter-pod anti-affinity

It's more similar as podAffinity, for both "required" and "preferred" versions. It's just that additionally we need to implement MaxSkew.

The whole Design B works independently with any other predicate/priority.

Oh wait, I missed a top level TopologyKey along with MaxSkew. Without it, we can't identify which specific topology we're gonna place the matching pods evenly. (b/c we support []PodAffinityTerms, and each term can has different topology key)

BTW: the top level TopologyKey can be renamed to DistributedBy (or DistributionKey), so as not to cause confusion with TopologyKey which works inside each affinityTerm.

IMO we should remove PodAffinityTerm and WeightedPodAffinityTerm and add a NodeSelector. NodeSelector allows users to specify where pods should be scheduled and what subset of nodes even spreading is applied. For example, a user should be able to specify that their pods should be scheduled in 3 out of 4 available zones and pods should be spread evenly among those 3 zones.
We don't need to specify pod affinity here at all. It complicates the design and I think the API becomes hard to use. Besides, if pod affinity is needed, users can add inter-pod affinity rules to their pod spec. Instead, we should add a pod selector which is a label selector and a namespace only. Unlike PodAffinityTerm, this one does not have a topology key.

omg, i think i just now understood the presence of two different topology keys, one in the podAffinityTerms and one outside. I think its getting too complicated to use. I think we need to think harder here.

@Huang-Wei Sorry if I was not clear. I think the API should specify all the following pieces of information:

Which pods should be spread (needs a pod selector).

What topology pods should be spread to (e.g., node, zone, region, etc.)

What set of nodes should be considered. This is needed to filter out some nodes. For example, topology may be "zone", but a user may want to filter out one of the zones.

MaxSkew as described in the KEP.

Whether this is preferred or required.

This is what I have in mind:

type PodSelectorTerm struct { // A label query over a set of resources, in this case pods. // +optional LabelSelector *metav1.LabelSelector // namespaces specifies which namespaces the labelSelector applies to (matches against); // null or empty list means "this pod's namespace" // +optional Namespaces []string } type EvenSpreading struct { MaxSkew int32 TopologyKey string PodSelector []PodSelectorTerm Required bool // vs. preffered // +optional NodeSelector v1.NodeSelector }

After talking with @Huang-Wei offline, I think we can drop the NodeSelector from EvenSpreading and let the predicate process an external NodeAffinity if there is one in the pod spec.

@bsalamat do you mean the NodeAffinity specified in Affinity *Affinity ?

@krmayankk Yes. NodeAffinity and NodeSelector are two other predicates that the scheduler has. This feature will take those into account.

keps/sig-scheduling/20190221-even-pods-spreading.md

Huang-Wei · 2019-03-13T01:55:06Z

/cc @wojtek-t for reviewing.

keps/sig-scheduling/20190221-even-pods-spreading.md

krmayankk · 2019-04-11T06:32:11Z

keps/sig-scheduling/20190221-even-pods-spreading.md

+    // - DoNotSchedule (default) tells the scheduler not to schedule it
+    // - ScheduleAnyway tells the scheduler to still schedule it
+    // Note: it's considered as "Unsatisfiable" only when ActualySkew on all nodes exceeds "MaxSkew".
+    WhenUnsatisfiable ScheduleDecision


why is this not a bool like required ?

+1 to Mayank's question. Do you think you want to support other choices in the future? Do you have any examples in mind?

Right now, nope :)

(I was thinking of a third option to respect NodeSelector/NodeAffinity, but it turns out to be more a reasonable implicit assumption)

@lavalamp are you OK with changing it back to bool: ScheduleWhenUnsatisfiable bool.

I think things like this are more understandable if the consequences of the action are in the enum value rather than true/false.

E.g., as an outsider, reading "ScheduleWhenUnsatisifiable" I have a lot of questions, like, "does that mean it doesn't schedule unless it's unsatisfiable?"

And I do think that you might need an additional parameter in the future, which is how strong the preference is.

And I do think that you might need an additional parameter in the future, which is how strong the preference is.

That reminds me of a possible scenario called "FallBackToAvailableNodes". In between "ScheduleAnyway" and "DoNotSchedule", maybe it's possible to compute the internal ActualySkew and MinimalSkew among the nodes which passed previous Predicates. This is related with earlier discussion with @bsalamat.

This is just an immature thought though, we definitely won't consider it in initial implementation.

I agree with the readability part.
Regarding another parameter to indicate the strength of preference, we already have a weight parameter for all preference (priority) functions.

For this particular case, I feel fairly strongly that the strings in the current draft communicate the purpose and implications of the decision much more clearly than the bool.

keps/sig-scheduling/20190221-even-pods-spreading.md

Huang-Wei · 2019-04-11T18:18:59Z

/hold
before squashing commits.

lavalamp · 2019-04-11T18:27:54Z

@Huang-Wei

Bobby suggested earlier to focus on the KEP itself... As the design is a standalone predicate/priority, users can do whatever kind of combination as they wish.

I can appreciate that, but I do think part of the point of a KEP is to show how a feature fits into Kubernetes holistically. Showing how it is complimentary to existing features, or how the use cases are redistributed between features seems very relevant to that end.

Basically I think you answered the question, it might be nice to say what you wrote in the KEP somewhere so future readers understand more exactly what is changing.

lavalamp · 2019-04-11T18:30:21Z

keps/sig-scheduling/20190221-even-pods-spreading.md

+eliminate embedded "TopologyKey":
+
+```go
+type ScheduleDecision string


This is a little too broad of a name. "UnsatisfiableConstraintResponse"?

Huang-Wei · 2019-04-11T18:32:28Z

I can appreciate that, but I do think part of the point of a KEP is to show how a feature fits into Kubernetes holistically. Showing how it is complimentary to existing features, or how the use cases are redistributed between features seems very relevant to that end.

That makes sense. I will add a paragraph addressing this.

Updated: done.

lavalamp · 2019-04-11T18:33:51Z

keps/sig-scheduling/20190221-even-pods-spreading.md

+    // namespaces specifies which namespaces the labelSelector applies to (matches against);
+    // null or empty list means "this pod's namespace"
+    // +optional
+    Namespaces []string


I still don't see a user story or something explaining why you would ever want a different namespace than the pod's own namespace?

I used to put it as a derivation from PodAffinity. I agree we can remove it and default to "this pod's namespace".

cc @bsalamat in case Bobby happen to know some histories or particular user story.

lavalamp · 2019-04-11T18:35:12Z

As long as you go for option 2, I think this is good enough from an API review perspective. I'll likely nitpick the exact comment phrasing more when the actual API change PR happens. :)

bsalamat · 2019-04-11T18:39:04Z

Bobby suggested earlier to focus on the KEP itself... As the design is a standalone predicate/priority, users can do whatever kind of combination as they wish.

To be clear, I said we should postpone making the final decision on anti-affinity after we analyzed our code-base and after we have even pod spreading, but at this point we know that we will either limit the topology of anti-affinity to node only or make "anti-affinity on node" a fast path and any other topology in anti-affinity a slow path. I think we need to point this out in the KEP so that readers know that inter-pod anti-affinity in any topology other than node will either be dropped or will cause scheduling slow down.

bsalamat

/lgtm
/approve

Thanks, @Huang-Wei!
Obviously, we may change some names of types when you send the actual PRs to implement the API. When those names are finalized we can come back and update this KEP.

Huang-Wei · 2019-04-17T00:17:29Z

Commits have been squashed.

Thanks all!

Obviously, we may change some names of types when you send the actual PRs to implement the API. When those names are finalized we can come back and update this KEP.

@bsalamat Absolutely. PS: need another /lgtm.

Huang-Wei · 2019-04-17T00:17:55Z

/hold cancel

wgliang · 2019-04-17T03:20:15Z

/lgtm
Thanks, @Huang-Wei

k8s-ci-robot · 2019-04-17T03:20:28Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bsalamat, Huang-Wei, wgliang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/sig-scheduling/OWNERS~~ [bsalamat]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot requested review from bsalamat and k82cn February 22, 2019 01:00

k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 22, 2019

Huang-Wei force-pushed the even-pods-spreading branch from 0a6a44f to f92d9a6 Compare February 22, 2019 01:04

Huang-Wei mentioned this pull request Feb 22, 2019

Add max number of replicas per node/topologyKey to pod anti-affinity kubernetes/kubernetes#40358

Closed

Huang-Wei force-pushed the even-pods-spreading branch 2 times, most recently from bcd375a to befdd94 Compare February 22, 2019 20:27

losipiuk reviewed Feb 25, 2019

View reviewed changes

keps/sig-scheduling/20190221-even-pods-spreading.md Show resolved Hide resolved

resouer reviewed Feb 26, 2019

View reviewed changes

keps/sig-scheduling/20190221-even-pods-spreading.md Show resolved Hide resolved

resouer reviewed Feb 26, 2019

View reviewed changes

keps/sig-scheduling/20190221-even-pods-spreading.md Outdated Show resolved Hide resolved

Huang-Wei force-pushed the even-pods-spreading branch 2 times, most recently from 6ea59dc to abb5d63 Compare February 27, 2019 18:10

bsalamat reviewed Mar 5, 2019

View reviewed changes

keps/sig-scheduling/20190221-even-pods-spreading.md Outdated Show resolved Hide resolved

keps/sig-scheduling/20190221-even-pods-spreading.md Show resolved Hide resolved

wgliang reviewed Mar 6, 2019

View reviewed changes

keps/sig-scheduling/20190221-even-pods-spreading.md Outdated Show resolved Hide resolved

bsalamat reviewed Mar 13, 2019

View reviewed changes

Huang-Wei force-pushed the even-pods-spreading branch 2 times, most recently from 0401a63 to 4e36308 Compare March 13, 2019 01:55

bsalamat mentioned this pull request Mar 14, 2019

Even pod spreading across failure domains #895

Closed

krmayankk reviewed Mar 22, 2019

View reviewed changes

keps/sig-scheduling/20190221-even-pods-spreading.md Outdated Show resolved Hide resolved

krmayankk reviewed Mar 22, 2019

View reviewed changes

keps/sig-scheduling/20190221-even-pods-spreading.md Outdated Show resolved Hide resolved

krmayankk reviewed Apr 11, 2019

View reviewed changes

keps/sig-scheduling/20190221-even-pods-spreading.md Outdated Show resolved Hide resolved

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 11, 2019

lavalamp reviewed Apr 11, 2019

View reviewed changes

k8s-ci-robot assigned bsalamat Apr 17, 2019

bsalamat reviewed Apr 17, 2019

View reviewed changes

k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Apr 17, 2019

KEP of even pods spreading

7b24e34

Huang-Wei force-pushed the even-pods-spreading branch from 8c72537 to 7b24e34 Compare April 17, 2019 00:15

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 17, 2019

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 17, 2019

k8s-ci-robot assigned wgliang Apr 17, 2019

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 17, 2019

k8s-ci-robot merged commit 809819e into kubernetes:master Apr 17, 2019

Huang-Wei deleted the even-pods-spreading branch April 17, 2019 06:17

Huang-Wei mentioned this pull request May 4, 2019

Even Pods Spread - 1. API changes kubernetes/kubernetes#77327

Merged

KEP of even pods spreading #851

KEP of even pods spreading #851

Conversation

Huang-Wei commented Feb 22, 2019 • edited Loading

Huang-Wei commented Feb 22, 2019

Fei-Guo commented Feb 22, 2019

Huang-Wei commented Feb 22, 2019

Huang-Wei commented Mar 12, 2019

bsalamat left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Huang-Wei commented Mar 13, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Huang-Wei commented Apr 11, 2019

lavalamp commented Apr 11, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Huang-Wei commented Apr 11, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lavalamp commented Apr 11, 2019

bsalamat commented Apr 11, 2019

bsalamat left a comment

Choose a reason for hiding this comment

Huang-Wei commented Apr 17, 2019

Huang-Wei commented Apr 17, 2019

wgliang commented Apr 17, 2019

k8s-ci-robot commented Apr 17, 2019

Huang-Wei commented Feb 22, 2019 •

edited

Loading

Huang-Wei commented Apr 11, 2019 •

edited

Loading