Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP of even pods spreading #851

Merged
merged 1 commit into from
Apr 17, 2019

Conversation

Huang-Wei
Copy link
Member

@Huang-Wei Huang-Wei commented Feb 22, 2019

As a follow-up of discussion kubernetes/kubernetes#72479 (comment)

Tracking issue: #895

/sig scheduling
/kind design
/priority important-longterm

@k8s-ci-robot k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. kind/design Categorizes issue or PR as related to design. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Feb 22, 2019
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 22, 2019
@Huang-Wei
Copy link
Member Author

@Fei-Guo
Copy link

Fei-Guo commented Feb 22, 2019

"Even spreading is a predicate (hard requirement) instead of a priority"
For feature like this, a typical concern is what if spreading causes App performance issue. i.e., Do you want to place Pod to a high load node for the sake of spreading? Difference people may have difference option, but a hard rule seems to be very restrictive here. In practice, I feel AZ level spreading can be a hard rule but within a AZ, a soft rule may be more suitable for spreading.

@Huang-Wei
Copy link
Member Author

@Fei-Guo I agree that hard requirement (predicate) could probably place a pod onto a high load node. So the "maxSkew" option is brought up - setting it to a reasonable value instead of 1 (default value, which can be thought of "double hard") can give you higher odds to pick up a low load node. It will firstly tolerate more candidate nodes in predicate phase, and then proceed to rely on priority strategies to choose which node is best fit.

@Huang-Wei Huang-Wei force-pushed the even-pods-spreading branch 2 times, most recently from bcd375a to befdd94 Compare February 22, 2019 20:27
@Huang-Wei Huang-Wei force-pushed the even-pods-spreading branch 2 times, most recently from 6ea59dc to abb5d63 Compare February 27, 2019 18:10
@Huang-Wei
Copy link
Member Author

A new commit is pushed which contains details of Design B. PTAL.

Copy link
Member

@bsalamat bsalamat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @Huang-Wei. I think Design B makes more sense. I have a few minor comments.

- "@Huang-Wei"
owning-sig: sig-scheduling
reviewers:
- "@wojtek-t"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if Wojtek will have the time to review this, but please add @MaciekPytel to ensure that this will not cause any issues for the autoscaler.

keps/sig-scheduling/20190221-even-pods-spreading.md Outdated Show resolved Hide resolved
RequiredDuringSchedulingIgnoredDuringExecution []PodAffinityTerm
// Similar with the same field in PodAffinity/PodAntiAffinity
// +optional
PreferredDuringSchedulingIgnoredDuringExecution []WeightedPodAffinityTerm
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, the "Preferred" version will be the same as our current preferred inter-pod anti-affinity. Existence of this item is only justified if we plan to remove preferred anti-affinity.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the "Preferred" version will be the same as our current preferred inter-pod anti-affinity

It's more similar as podAffinity, for both "required" and "preferred" versions. It's just that additionally we need to implement MaxSkew.

The whole Design B works independently with any other predicate/priority.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wait, I missed a top level TopologyKey along with MaxSkew. Without it, we can't identify which specific topology we're gonna place the matching pods evenly. (b/c we support []PodAffinityTerms, and each term can has different topology key)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW: the top level TopologyKey can be renamed to DistributedBy (or DistributionKey), so as not to cause confusion with TopologyKey which works inside each affinityTerm.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should remove PodAffinityTerm and WeightedPodAffinityTerm and add a NodeSelector. NodeSelector allows users to specify where pods should be scheduled and what subset of nodes even spreading is applied. For example, a user should be able to specify that their pods should be scheduled in 3 out of 4 available zones and pods should be spread evenly among those 3 zones.
We don't need to specify pod affinity here at all. It complicates the design and I think the API becomes hard to use. Besides, if pod affinity is needed, users can add inter-pod affinity rules to their pod spec. Instead, we should add a pod selector which is a label selector and a namespace only. Unlike PodAffinityTerm, this one does not have a topology key.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

omg, i think i just now understood the presence of two different topology keys, one in the podAffinityTerms and one outside. I think its getting too complicated to use. I think we need to think harder here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Huang-Wei Sorry if I was not clear. I think the API should specify all the following pieces of information:

  1. Which pods should be spread (needs a pod selector).
  2. What topology pods should be spread to (e.g., node, zone, region, etc.)
  3. What set of nodes should be considered. This is needed to filter out some nodes. For example, topology may be "zone", but a user may want to filter out one of the zones.
  4. MaxSkew as described in the KEP.
  5. Whether this is preferred or required.

This is what I have in mind:

type PodSelectorTerm struct {
	// A label query over a set of resources, in this case pods.
	// +optional
	LabelSelector *metav1.LabelSelector
	// namespaces specifies which namespaces the labelSelector applies to (matches against);
	// null or empty list means "this pod's namespace"
	// +optional
	Namespaces []string
}

type EvenSpreading struct {
    MaxSkew        int32
    TopologyKey    string
    PodSelector    []PodSelectorTerm
    Required       bool  // vs. preffered
    // +optional
    NodeSelector   v1.NodeSelector
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After talking with @Huang-Wei offline, I think we can drop the NodeSelector from EvenSpreading and let the predicate process an external NodeAffinity if there is one in the pod spec.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bsalamat do you mean the NodeAffinity specified in Affinity *Affinity ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@krmayankk Yes. NodeAffinity and NodeSelector are two other predicates that the scheduler has. This feature will take those into account.

keps/sig-scheduling/20190221-even-pods-spreading.md Outdated Show resolved Hide resolved
@Huang-Wei Huang-Wei force-pushed the even-pods-spreading branch 2 times, most recently from 0401a63 to 4e36308 Compare March 13, 2019 01:55
@Huang-Wei
Copy link
Member Author

/cc @wojtek-t for reviewing.

// - DoNotSchedule (default) tells the scheduler not to schedule it
// - ScheduleAnyway tells the scheduler to still schedule it
// Note: it's considered as "Unsatisfiable" only when ActualySkew on all nodes exceeds "MaxSkew".
WhenUnsatisfiable ScheduleDecision

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this not a bool like required ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to Mayank's question. Do you think you want to support other choices in the future? Do you have any examples in mind?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to Mayank's question. Do you think you want to support other choices in the future? Do you have any examples in mind?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, nope :)

(I was thinking of a third option to respect NodeSelector/NodeAffinity, but it turns out to be more a reasonable implicit assumption)

@lavalamp are you OK with changing it back to bool: ScheduleWhenUnsatisfiable bool.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think things like this are more understandable if the consequences of the action are in the enum value rather than true/false.

E.g., as an outsider, reading "ScheduleWhenUnsatisifiable" I have a lot of questions, like, "does that mean it doesn't schedule unless it's unsatisfiable?"

And I do think that you might need an additional parameter in the future, which is how strong the preference is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I do think that you might need an additional parameter in the future, which is how strong the preference is.

That reminds me of a possible scenario called "FallBackToAvailableNodes". In between "ScheduleAnyway" and "DoNotSchedule", maybe it's possible to compute the internal ActualySkew and MinimalSkew among the nodes which passed previous Predicates. This is related with earlier discussion with @bsalamat.

This is just an immature thought though, we definitely won't consider it in initial implementation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the readability part.
Regarding another parameter to indicate the strength of preference, we already have a weight parameter for all preference (priority) functions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this particular case, I feel fairly strongly that the strings in the current draft communicate the purpose and implications of the decision much more clearly than the bool.

@Huang-Wei
Copy link
Member Author

/hold
before squashing commits.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 11, 2019
@lavalamp
Copy link
Member

@Huang-Wei

Bobby suggested earlier to focus on the KEP itself... As the design is a standalone predicate/priority, users can do whatever kind of combination as they wish.

I can appreciate that, but I do think part of the point of a KEP is to show how a feature fits into Kubernetes holistically. Showing how it is complimentary to existing features, or how the use cases are redistributed between features seems very relevant to that end.

Basically I think you answered the question, it might be nice to say what you wrote in the KEP somewhere so future readers understand more exactly what is changing.

eliminate embedded "TopologyKey":

```go
type ScheduleDecision string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a little too broad of a name. "UnsatisfiableConstraintResponse"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@Huang-Wei
Copy link
Member Author

Huang-Wei commented Apr 11, 2019

I can appreciate that, but I do think part of the point of a KEP is to show how a feature fits into Kubernetes holistically. Showing how it is complimentary to existing features, or how the use cases are redistributed between features seems very relevant to that end.

That makes sense. I will add a paragraph addressing this.

Updated: done.

// namespaces specifies which namespaces the labelSelector applies to (matches against);
// null or empty list means "this pod's namespace"
// +optional
Namespaces []string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't see a user story or something explaining why you would ever want a different namespace than the pod's own namespace?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used to put it as a derivation from PodAffinity. I agree we can remove it and default to "this pod's namespace".

cc @bsalamat in case Bobby happen to know some histories or particular user story.

@lavalamp
Copy link
Member

As long as you go for option 2, I think this is good enough from an API review perspective. I'll likely nitpick the exact comment phrasing more when the actual API change PR happens. :)

@bsalamat
Copy link
Member

Bobby suggested earlier to focus on the KEP itself... As the design is a standalone predicate/priority, users can do whatever kind of combination as they wish.

To be clear, I said we should postpone making the final decision on anti-affinity after we analyzed our code-base and after we have even pod spreading, but at this point we know that we will either limit the topology of anti-affinity to node only or make "anti-affinity on node" a fast path and any other topology in anti-affinity a slow path. I think we need to point this out in the KEP so that readers know that inter-pod anti-affinity in any topology other than node will either be dropped or will cause scheduling slow down.

Copy link
Member

@bsalamat bsalamat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

Thanks, @Huang-Wei!
Obviously, we may change some names of types when you send the actual PRs to implement the API. When those names are finalized we can come back and update this KEP.

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Apr 17, 2019
@Huang-Wei Huang-Wei force-pushed the even-pods-spreading branch from 8c72537 to 7b24e34 Compare April 17, 2019 00:15
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 17, 2019
@Huang-Wei
Copy link
Member Author

Commits have been squashed.

Thanks all!

Obviously, we may change some names of types when you send the actual PRs to implement the API. When those names are finalized we can come back and update this KEP.

@bsalamat Absolutely. PS: need another /lgtm.

@Huang-Wei
Copy link
Member Author

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 17, 2019
@wgliang
Copy link
Contributor

wgliang commented Apr 17, 2019

/lgtm
Thanks, @Huang-Wei

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 17, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bsalamat, Huang-Wei, wgliang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 809819e into kubernetes:master Apr 17, 2019
@Huang-Wei Huang-Wei deleted the even-pods-spreading branch April 17, 2019 06:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-review Categorizes an issue or PR as actively needing an API review. approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/design Categorizes issue or PR as related to design. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.