-
Notifications
You must be signed in to change notification settings - Fork 263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Topology Aware Scheduling KEP #2725
Topology Aware Scheduling KEP #2725
Conversation
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
1ab3cb9
to
5d7847b
Compare
Tracking of nodes by Kueue will increase its code complexity and memory | ||
consumption. | ||
|
||
## Alternatives |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add some comments about how this would inface with scheduler or kubelet based polciies?
https://scheduler-plugins.sigs.k8s.io/docs/kep/119-node-resource-topology-aware-scheduling/readme/
At the end of the day, users can specify this at the kueue level but if the scheduler or kubelet are not set up, I don't see how you can guarantee this at the Kueue level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe there is an unnecessary naming coincidence, but I don't think the feature in Kueue is related to the plugin.
We don't require it to run. Also, this feature is not related to NUMA nodes. Let me know if I'm missing something.
The aim of this feature is to provide compact placement of pods on nodes, by using the information about node placement in datacenter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this feature is not really about topology on a node (where Topology Manager or Topology-aware-schediling) would take in account. You are proposing intra node topology aware scheduling (node - node placement). I guess that this is the distinction.
So from a Kueue follower hat, I think this is a neat feature.
Wearing from openshift hat, I am concerned that this feature is supporting a particular cloud vendor and not really publishing how one can label these nodes so that this feature could be used.
Are you planning on proposing official labels/annotations in k8s to walk through how other cloud provider vendors (or on-prem admins) can actually set up their cluster so that this feature is valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And don't take this to be blocking but I would like to see some work being done to maybe formalize how a cluster should be labeled. Otherwise I don't know how other vendors/on-prem could use this feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I attended a wg-serving meeting earlier today where someone was talking about the types of ways data centers are laid out and what is necessary for higher speed interconnects. And I think the point of that meeting was to start to define what kinds of labels/information would be necessary.
I think this is a hard problem..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Networking in data centers is complex for sure. Agree, this may proposal does not reflect every nuance of a data center networking. Especially, since details of networking, the depth of hierarchy will differ between data centers of different cloud providers. The details may even differ between multiple data centers of the same cloud provider. The details, I believe, may also differ between machine types: TPU vs GPU.
This is the first iteration of support for Node-node Topology Aware Scheduling in Kueue. Every journey needs to start with a first step.
Thus, the proposed model is deliberately simplified so that it could work as a common denominator between different cloud providers.
Note that, the label names cloud-provider.com/topology-block
and cloud-provider.com/topology-rack
used in Hierarchy organization are just examples, and they can be customized in the admin-facing API for a ResourceFlavor.
The only thing that matters for the model is that there is some hierarchy of nodes in a data center, which is almost certainly true. Since almost every data center has some concept of racks and blocks I use such labels to call the intuition, but it is flexible. The mechanics of exposing information as labels are left to cloud providers.
LGTM |
6dc1f7b
to
00022a6
Compare
fb7a6d5
to
42048d9
Compare
FYI to all reviewers: I have updated a couple of sections of the KEP based on the offline review and discussions with @alculquicondor and @mwielgus. PTAL. |
``` | ||
Then, when the compute the assignments per pods sets corresponding to | ||
ReplicatedJobs, rather than the entire PodSet. Finally, the `PodSetUngater` | ||
ungates pods per replicatedJob assignment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we also need changes to the TopologyAssignment
?
How can the user specify that they want each replicated job to be in a particular topology domain?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we also need changes to the
TopologyAssignment
?
I don't think this is needed as the TopologyAssignment contains a list of PodSetAssignments. If there are 2 ReplicatedJobs, there would be two items on the list. Alternative API is to have ReplicatedJobAssignment which would have the ReplicatedJobKey
and the PodSetAssignment
. However, we would need to do it this way day 0. So, I proposed the incremental approach.
How can the user specify that they want each replicated job to be in a particular topology domain?
A user can use "require: rack" or "prefer: rack". In case of "require: rack" we will require that each instance of ReplicatedJob is fully contained within a rack, but there might be two jobs landing on the same rack. When using "prefer" a single ReplicatedJob may span 2 or more racks, but we try to compactify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, in any other job, if I say require: rack
, it means that the entire job will land on one rack.
But in jobset, it means that each replica will land on one rack. Or is the annotation only going to be in the Job templates? That's not clearly explained.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is related to the other question about the scope of topology. There are actually 3 possible levels where we want to require/prefer a topology domain:
L1. Workload level (comprises multiple PodSets for MPIJob, JobSet and others)
L2. PodSet (comprises multiple ReplicatedJobs in case of JobSet)
L3. ReplicatedJob
I discussed with @mwielgus that all are valid use cases we may want to do at some point. The open questions are:
Q1: what are the APIs to request topology at every such level
Q2: which we prioritize for Alpha and which we can defer for Beta
Regarding Q1 I think we can have:
L1: "preferred-topology / required-topology" annotations at the workload level
L2: "replicaset-preferred-topology / replicaset-required-topology" annotations at the PodTemplate level
L3: "replica-preferred-topology / replica-required-topology" annotations at the PodTemplate level
I'm open to other/better names. Marcin suggests to start with whatever is simplest to deliver. I think looking at how flavorassigner works in Kueue L2 is the simplest, even if not the most functional. WDYT @alculquicondor @tenzen-y ? I will update the KEP accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should start with the most useful scenario, one that we know is already in use. And that's L3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that L3 is probably the most useful. However at the same time it is significantly more complex than L2, on which Kueue already works right now.
L3 requires some non-trivial refactoring. It has to be done but for the first step I would suggest doing the most basic functionality to make sure that the thing works, more or less, E2E and THEN focusing on refactoring to enable the most valuable use cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, please update user stories and design accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sgtm, the Support for ReplicatedJobs in JobSet section focuses L3, but it still has some open questions.
The options I see:
- figure out the open questions there before Alpha
- move the section to alternatives or notes (as a backup for the work already done) and add a graduation point about determining the remaining design points then
I would be leaning for (2.), any preference @alculquicondor ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 seems fine. But leave notes for whatever work for jobset that is going to happen in alpha.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, moved the started API work to alternatives. I don't think we need any special work for JobSet in alpha then, except for this: https://github.com/kubernetes-sigs/kueue/pull/2725/files#diff-40027534eab1ffd244e3c960acbf91b55b9556bebd8cd56077ba5f3270648284R163-R167, which is already released, so we may consider dropping the note.
Ok, please update user stories and design accordingly.
Done, PTAL.
46bf5b7
to
0623107
Compare
0623107
to
c65cc2b
Compare
82e6c19
to
fd5ac7c
Compare
/cc @tenzen-y @alculquicondor @KPostOffice @PBundyra I believe I have addressed all outstanding comments, PTAL. |
@mimowo: GitHub didn't allow me to request PR reviews from the following users: KPostOffice. Note that only kubernetes-sigs members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically, lgtm!
I left some small questions.
// +listType=atomic | ||
// +kubebuilder:validation:MinItems=1 | ||
// +kubebuilder:validation:MaxItems=5 | ||
Levels []TopologyLevel `json:"levels,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we use the "podset-preferred-topology" policy, do these levels are evaluated in order? Or is there any strategy to evaluate these levels?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes they get evaluated in order: from the one specified going up to the top. At each step we check if it is possible to satisfy the resource demand within a single topology domain. Say user specifies rack. Then we check rack, if not possible we fallback to block, if not possible we admit using multiple blocks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. It would be great to mention the 3 step evaluation mechanism in API comment or dedicated section.
Additionally, I'm wondering if we need to evolve the internal PodSetInfo mechanism to support TAS.
Maybe we need to have nodeSelector information for every Pod instead of PodSet, right?
Lines 42 to 49 in ac77dea
type PodSetInfo struct { | |
Name string | |
Count int32 | |
Annotations map[string]string | |
Labels map[string]string | |
NodeSelector map[string]string | |
Tolerations []corev1.Toleration | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to extend the PodSetInfo mechanism. This mechanism is used to restore the nodeSelectors (and other proporties of a Job) on suspend / eviction. However, we can restore it to the same values regardless of the assignment during admission.
Also, the PodSetInfo is an internal structure, rather than part of the API so I think we can figure it out during the implementation phase, but I don't expect a need to change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great to mention the 3 step evaluation mechanism in API comment or dedicated section.
Good point, I have added explanation in the documentation for the "preferred" annotation. PTAL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to extend the PodSetInfo mechanism. This mechanism is used to restore the nodeSelectors (and other proporties of a Job) on suspend / eviction. However, we can restore it to the same values regardless of the assignment during admission.
That makes sense. Thank you for the clarifications.
Also, the PodSetInfo is an internal structure, rather than part of the API so I think we can figure it out during the implementation phase, but I don't expect a need to change it.
Yeah, I think so too, I just cared what we will go with PodSetsinfo after we introduce this feature. I think we do not think that we need to mention that here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great to mention the 3 step evaluation mechanism in API comment or dedicated section.
Good point, I have added explanation in the documentation for the "preferred" annotation. PTAL.
Great to see. Thank you!
`maxPrefixLength` in `workload_names.go`. However, we can also facilitate | ||
fast lookups based on labels if we index the workloads. | ||
|
||
### Implement it in ClusterAutoscaler or kube-scheduler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section can resolve my core questions. Thank you!
Also, both of the components don't have a concept of quota, which we want to be | ||
supported with this feature. So, if Kueue admits a Job with "required: rack" | ||
just based on quota, the Pods might be created, but the components wouldn't | ||
be able to schedule the Pods if there is not enough capacity in the topology | ||
domain. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just question. This means that the kube-scheduler and cluster-autoscaler do not have a retry mechanism when the Pod can not be scheduled based on PodGroup, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They do have a retry mechanism. AFAIK the components don't have a native concept of Pod group. You can simulate Pod group using Volcano in kube-scheduler, but kube-scheduler won't be able to provision the nodes by itself if they don't exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. Maybe we can delegate TAS scheduling to kube-scheduler after we support the native PodGroup feature: kubernetes/enhancements#4671
But, there are not progressing in the k/k now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even then I think there are other blockers like lack of API to support topology. I think it makes sense to incubate such an API in CRDs before moving to the core.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree with you.
As an alternative, we can provide the kube-scheduler-plugin for TAS in Kueue side instead of watching Node resources in the kueue, but I guess the kueue custom kube-scheduler-plugin may bring batch admins obstacles to install Kueue.
So, I add +1 to this node watch mechanism in Kueue side for now.
But, if we add more and more features depending on the node information, I'm wondering if we should revisit here and discuss if we should provide the custom kube-scheduler plugin for Kueue.
// +kubebuilder:validation:MaxLength=316 | ||
// +kubebuilder:validation:Pattern=`^([a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*/)?(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])$` | ||
NodeLabel string `json:"nodeLabel,omitempty"` | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will we want to support tolerations here based on user feedback?
I agree with supporting only nodeLabel during the alpha stage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think having different set of tolerations depending on the hierarchy level would be a complication. I don't see it necessary, but yes, if there are use-case we will have a place to introduce it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. Thanks for the clarifications.
fd5ac7c
to
c487be6
Compare
Update keps/2724-topology-aware-schedling/README.md Co-authored-by: Yuki Iwai <[email protected]> Co-authored-by: Aldo Culquicondor <[email protected]>
c487be6
to
2ab5bb3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for driving this significantly valuable feature!
I'm looking forward to release this feature!
/lgtm
/approve
// +listType=atomic | ||
// +kubebuilder:validation:MinItems=1 | ||
// +kubebuilder:validation:MaxItems=5 | ||
Levels []TopologyLevel `json:"levels,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to extend the PodSetInfo mechanism. This mechanism is used to restore the nodeSelectors (and other proporties of a Job) on suspend / eviction. However, we can restore it to the same values regardless of the assignment during admission.
That makes sense. Thank you for the clarifications.
Also, the PodSetInfo is an internal structure, rather than part of the API so I think we can figure it out during the implementation phase, but I don't expect a need to change it.
Yeah, I think so too, I just cared what we will go with PodSetsinfo after we introduce this feature. I think we do not think that we need to mention that here as well.
// +listType=atomic | ||
// +kubebuilder:validation:MinItems=1 | ||
// +kubebuilder:validation:MaxItems=5 | ||
Levels []TopologyLevel `json:"levels,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great to mention the 3 step evaluation mechanism in API comment or dedicated section.
Good point, I have added explanation in the documentation for the "preferred" annotation. PTAL.
Great to see. Thank you!
LGTM label has been added. Git tree hash: 7bf5553fb927b2cf1562c60a784833cc7bb27639
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mimowo, tenzen-y The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
* Topology Aware Scheduling KEP * Review remarks Update keps/2724-topology-aware-schedling/README.md Co-authored-by: Yuki Iwai <[email protected]> Co-authored-by: Aldo Culquicondor <[email protected]> --------- Co-authored-by: Yuki Iwai <[email protected]> Co-authored-by: Aldo Culquicondor <[email protected]>
What type of PR is this?
/kind documentation
What this PR does / why we need it:
To propose the design for TAS.
Which issue(s) this PR fixes:
KEP for #2724
Special notes for your reviewer:
Does this PR introduce a user-facing change?