-
Notifications
You must be signed in to change notification settings - Fork 262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TAS: PodSet label and Workload annotation for PodTemplates #3228
Conversation
Skipping CI for Draft Pull Request. |
/test all |
✅ Deploy Preview for kubernetes-sigs-kueue ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
a9b55cd
to
c028214
Compare
cc @PBundyra |
/cc |
@@ -39,4 +39,14 @@ const ( | |||
|
|||
// ProvReqAnnotationPrefix is the prefix for annotations that should be pass to ProvisioningRequest as Parameters. | |||
ProvReqAnnotationPrefix = "provreq.kueue.x-k8s.io/" | |||
|
|||
// WorkloadAnnotation is an annotation set on the Job's PodTemplate to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So reading this I think its limiting to a K8s job. I think you mean a general kueue supported job but not sure what term we pick to reference this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is supported for any Job (can be MPIJob, JobSet, etc.). In Kueue we say Job in many contexts meaning any of the CRDs, see the user-facing docs: https://kueue.sigs.k8s.io/docs/overview/.
@@ -772,6 +774,8 @@ func TestReconciler(t *testing.T) { | |||
*basePodWrapper. | |||
Clone(). | |||
Label(constants.ManagedByKueueLabel, "true"). | |||
Label(controllerconsts.PodSetLabel, "dc85db45"). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you define these as constants and reuse that here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is the best course of action, because:
- this string is used 105 times in the file already:
*utiltesting.MakePodSet("dc85db45", 2). - if I introduce a const like
podSetNameForTest
at the top of the file it could suggest it should be used in all tests, but it is derived from the Pods actually, it was just copy-pasted many times.
So, I don't see this would be very clean, but let me know your thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind if I do a cleanup?
I agree with you that this should not be in scope for this feature then. I didn't see how many occurences this has.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, feel free to follow up! maybe instead of hard-coding a const we could compute the hash in tests, but not sure at the moment.
/hold Since author has approval rights, we should unhold when it is ready to merge. |
// WorkloadAnnotation is an annotation set on the Job's PodTemplate to | ||
// indicate the name of the admitted Workload corresponding to the Job. The | ||
// annotation is set when starting the Job, and removed on stopping the Job. | ||
WorkloadAnnotation = "kueue.x-k8s.io/workload" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the difference between this annotation and the Workload name hash?
func GetWorkloadNameForOwnerWithGVK(ownerName string, ownerUID types.UID, ownerGVK schema.GroupVersionKind) string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, is there any case where this annotation and pre-build workload annotation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the difference between this annotation and the Workload name hash?
This new annotation contains the full workload name, for example job-sample-job-12345
, where 12345 is the hash. The annotation is set on PodTemplate so that newly created Pods carry the information to lookup the Workload.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, is there any case where this annotation and pre-build workload annotation?
First, note that the two annotations are set at different levels. Prebuilt-workload is the Job-level annotations, whilst the new annotation is set at the PodTemplate level, so that the new Pods "know" their workload.
This might happen on MultiKueue worker cluster. The MK would set the prebuit-workload annotation, and the admission process on the worker will set the new "workload" annotation. However, the new annotation is set on the PodTemplate rather than at the Job level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new annotation contains the full workload name, for example job-sample-job-12345, where 12345 is the hash. The annotation is set on PodTemplate so that newly created Pods carry the information to lookup the Workload.
In that case, we can remove the hash mechanism after we graduate this feature to GA?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might happen on MultiKueue worker cluster. The MK would set the prebuit-workload annotation, and the admission process on the worker will set the new "workload" annotation. However, the new annotation is set on the PodTemplate rather than at the Job level.
I see, so the Pod has only new "workload" annotation. In that case, there seems no to be conflict.
Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, we can remove the hash mechanism after we graduate this feature to GA?
I think the hash mechanism is here to stay, and was introduced prior to TAS. The hash in the workload name is to ensure Workload names don't exceed the max length (currently defined as 253 chars). This is needed because workload names add the prefix to Job names. For example for an MPIJob "test-job", we have "mpijob-test-job-".
info.Labels[controllerconsts.PodSetLabel] = podSetFlavor.Name | ||
info.Annotations[controllerconsts.WorkloadAnnotation] = w.Name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we guard these labels and annotations insertion by the feature gates?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, let's summarize the options:
- no feature gate (as currently)
- use the TopologyAwareScheduling feature-gate
- use another dedicated feature gate like PodTemplatePodSetLabelAndWorkloadAnnotation
TBH, I would be leaning towards 1. as the code is simple and low risk.
My second choice would be (2.) as it is introduced for the needs for TopologyAwareScheduling. if we do (2.) we can always enable it by or with other feature gates. For example TopologyAwareScheduling || AnotherFeatureNeedingTheLabels
. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I discussed this with @mimowo offline.
We decided to add the TAS gate here so that we can break this labels and annotations specification during the beta / GA graduation. For example, we may want to move the "Workload" annotation to labels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right. After a second thought I think it is safer. For the purpose of fast lookup of the workload from Pod it would be better to have a Workload label rather than annotation. For now it is an annotation as the workload names can be longer that 63 characters, so it requires some preparatory work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to gate it for 0.9, but this has the potential to be useful beyond TAS. An efficient mapping from Pods back to their Workloads is a very useful building block for general fault tolerance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you.
So, the during beta graduation, we will consider if we move all to labels or change key name. After beta graduation, these labels are added by default.
c028214
to
b245734
Compare
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mimowo The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
FYI following the discussion here. I have updated the PR to introduce the TopologyAwareScheduling gate which I will also reuse in the follow up PRs for the feature. PTAL. |
b245734
to
9ce743b
Compare
/lgtm |
LGTM label has been added. Git tree hash: 413450b80b29c498e21873c3a408249d0ce8c058
|
What type of PR is this?
/kind feature
What this PR does / why we need it:
Which issue(s) this PR fixes:
Part of #2724
Special notes for your reviewer:
The annotation and label are used only for TAS now, but the names are generic since they can be used in other contexts too. For TAS they are used by TopologyUngater to lookup The admitted Workload from a Pod.
Does this PR introduce a user-facing change?