-
Notifications
You must be signed in to change notification settings - Fork 262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[KubeRay] Support NumOfHosts when calculating PodSet assignments #3384
[KubeRay] Support NumOfHosts when calculating PodSet assignments #3384
Conversation
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
@@ -150,6 +150,11 @@ func (j *ClusterWrapper) WithWorkerPriorityClassName(value string) *ClusterWrapp | |||
return j | |||
} | |||
|
|||
func (j *ClusterWrapper) WithNumOfHosts(value int32) *ClusterWrapper { | |||
j.Spec.WorkerGroupSpecs[0].NumOfHosts = value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why "0" index?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just for testing and following the assumption in other places that the generated object only has 1 worker group
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like that we hardcoding this index. Maybe like this?
func (j *ClusterWrapper) WithNumOfHosts(groupName string, value int32) *ClusterWrapper {
for index, group := range j.Spec.WorkerGroupSpecs {
if group.GroupName == groupName {
j.Spec.WorkerGroupSpecs[index].NumOfHosts = value
}
}
return j
}
func (j *JobWrapper) WithNumOfHosts(value int32) *JobWrapper { | ||
j.Spec.RayClusterSpec.WorkerGroupSpecs[0].NumOfHosts = value | ||
return j | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's better to add it directly to rayv1.WorkerGroupSpec
on the test case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good catch
/test pull-kueue-test-e2e-main-1-30 pull-kueue-test-e2e-main-1-31 Due to #3368. |
8a12ceb
to
eebd0b4
Compare
count = *wgs.Replicas | ||
} | ||
if wgs.NumOfHosts > 1 { | ||
count = count * wgs.NumOfHosts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
count = count * wgs.NumOfHosts | |
count *= wgs.NumOfHosts |
count = *wgs.Replicas | ||
} | ||
if wgs.NumOfHosts > 1 { | ||
count = count * wgs.NumOfHosts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
count = count * wgs.NumOfHosts | |
count *= wgs.NumOfHosts |
}, | ||
{ | ||
Name: "group1", | ||
Count: 4, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Count: 4, | |
Count: 1, |
}, | ||
{ | ||
Name: "group2", | ||
Count: 3, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Count: 3, | |
Count: 12, |
eebd0b4
to
d14ebd4
Compare
9d46183
to
86ac93e
Compare
Thanks @mbobrovskyi, addressed your comments |
Please check this one #3384 (comment). |
86ac93e
to
e98217b
Compare
Thanks, I incorporated your suggestion |
Also FYI @tenzen-y @ryanaoleary @kevin85421 |
/lgtm Thanks! |
LGTM label has been added. Git tree hash: e09b5279525c032596f5ea5b99c7fcad5f1de22f
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to leave the final decision to @mimowo since there is no bandwidth.
My primary question is if this numOfHosts concept has a similar concept for the JobSet replicatedJobs[*].replicas.
Because the JobSet replicas have some limitations for TAS
/approve |
@mimowo: once the present PR merges, I will cherry-pick it on top of In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andrewsykim, mimowo The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest Due to #3406. |
@mimowo: #3384 failed to apply on top of branch "release-0.8":
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
ugh. this is my bad, I must have messed up the commit when I ammended it |
manually opened the cherry-pick PR with the conflicts resolved #3408 |
I think that It's not bad since users can obtain proper updated information from Release Note :) |
/release-note-edit
|
@tenzen-y looks like the v0.8.2 tag never included this change https://github.com/kubernetes-sigs/kueue/commits/v0.8.2 Despite it being included in release-0.8: https://github.com/kubernetes-sigs/kueue/commits/release-0.8 |
Oh, it looks like my bad. Let us release v0.8.3 for the fix. |
Co-authored-by: Mykhailo Bobrovskyi <[email protected]>
What type of PR is this?
/kind bug
What this PR does / why we need it:
NumOfHosts is a new field we added to KubeRay RayCluster to support multi-host training / inference with accelerators like TPUs. Kueue needs to be updated to check NumOfHosts and appropriately adjust podset assignments if NumOfHosts is greater than 1
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?