Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KubeRay] Support NumOfHosts when calculating PodSet assignments #3384

Merged
merged 1 commit into from
Oct 31, 2024

Conversation

andrewsykim
Copy link
Member

@andrewsykim andrewsykim commented Oct 31, 2024

What type of PR is this?

/kind bug

What this PR does / why we need it:

NumOfHosts is a new field we added to KubeRay RayCluster to support multi-host training / inference with accelerators like TPUs. Kueue needs to be updated to check NumOfHosts and appropriately adjust podset assignments if NumOfHosts is greater than 1

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Account for NumOfHosts when calculating PodSet assignments for RayJob and RayCluster

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. labels Oct 31, 2024
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 31, 2024
Copy link

netlify bot commented Oct 31, 2024

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit e98217b
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/6723b5dfbdef940008b7db8f

@andrewsykim andrewsykim changed the title [KubeRay] Support NumOfHosts when calculating PodSet assignments for KubeRay jobs [KubeRay] Support NumOfHosts when calculating PodSet assignments Oct 31, 2024
@@ -150,6 +150,11 @@ func (j *ClusterWrapper) WithWorkerPriorityClassName(value string) *ClusterWrapp
return j
}

func (j *ClusterWrapper) WithNumOfHosts(value int32) *ClusterWrapper {
j.Spec.WorkerGroupSpecs[0].NumOfHosts = value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why "0" index?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just for testing and following the assumption in other places that the generated object only has 1 worker group

Copy link
Contributor

@mbobrovskyi mbobrovskyi Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like that we hardcoding this index. Maybe like this?

func (j *ClusterWrapper) WithNumOfHosts(groupName string, value int32) *ClusterWrapper {
	for index, group := range j.Spec.WorkerGroupSpecs {
		if group.GroupName == groupName {
			j.Spec.WorkerGroupSpecs[index].NumOfHosts = value
		}
	}
	return j
}

Comment on lines 154 to 158
func (j *JobWrapper) WithNumOfHosts(value int32) *JobWrapper {
j.Spec.RayClusterSpec.WorkerGroupSpecs[0].NumOfHosts = value
return j
}

Copy link
Contributor

@mbobrovskyi mbobrovskyi Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's better to add it directly to rayv1.WorkerGroupSpec on the test case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good catch

@mbobrovskyi
Copy link
Contributor

/test pull-kueue-test-e2e-main-1-30 pull-kueue-test-e2e-main-1-31

Due to #3368.

count = *wgs.Replicas
}
if wgs.NumOfHosts > 1 {
count = count * wgs.NumOfHosts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
count = count * wgs.NumOfHosts
count *= wgs.NumOfHosts

count = *wgs.Replicas
}
if wgs.NumOfHosts > 1 {
count = count * wgs.NumOfHosts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
count = count * wgs.NumOfHosts
count *= wgs.NumOfHosts

},
{
Name: "group1",
Count: 4,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Count: 4,
Count: 1,

},
{
Name: "group2",
Count: 3,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Count: 3,
Count: 12,

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 31, 2024
@andrewsykim andrewsykim force-pushed the kuberay-num-of-hosts branch 3 times, most recently from 9d46183 to 86ac93e Compare October 31, 2024 16:23
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 31, 2024
@andrewsykim
Copy link
Member Author

Thanks @mbobrovskyi, addressed your comments

@mbobrovskyi
Copy link
Contributor

Thanks @mbobrovskyi, addressed your comments

Please check this one #3384 (comment).

@andrewsykim
Copy link
Member Author

Please check this one #3384 (comment).

Thanks, I incorporated your suggestion

@andrewsykim
Copy link
Member Author

Also FYI @tenzen-y @ryanaoleary @kevin85421

@mbobrovskyi
Copy link
Contributor

/lgtm

Thanks!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 31, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: e09b5279525c032596f5ea5b99c7fcad5f1de22f

@mbobrovskyi
Copy link
Contributor

/cc @mimowo @tenzen-y

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to leave the final decision to @mimowo since there is no bandwidth.
My primary question is if this numOfHosts concept has a similar concept for the JobSet replicatedJobs[*].replicas.

Because the JobSet replicas have some limitations for TAS

@mimowo
Copy link
Contributor

mimowo commented Oct 31, 2024

/approve
/cherry-pick release-0.8

@k8s-infra-cherrypick-robot
Copy link
Contributor

@mimowo: once the present PR merges, I will cherry-pick it on top of release-0.8 in a new PR and assign it to you.

In response to this:

/approve
/cherry-pick release-0.8

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andrewsykim, mimowo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 31, 2024
@mbobrovskyi
Copy link
Contributor

/retest

Due to #3406.

@k8s-ci-robot k8s-ci-robot merged commit 6047afe into kubernetes-sigs:main Oct 31, 2024
16 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.9 milestone Oct 31, 2024
@k8s-infra-cherrypick-robot
Copy link
Contributor

@mimowo: #3384 failed to apply on top of branch "release-0.8":

Applying: TAS: Support RayCluster. (#3386)
Using index info to reconstruct a base tree...
M	pkg/controller/jobs/raycluster/raycluster_controller.go
M	pkg/controller/jobs/raycluster/raycluster_controller_test.go
M	pkg/controller/jobs/rayjob/rayjob_controller.go
M	pkg/controller/jobs/rayjob/rayjob_controller_test.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/controller/jobs/rayjob/rayjob_controller_test.go
Auto-merging pkg/controller/jobs/rayjob/rayjob_controller.go
CONFLICT (content): Merge conflict in pkg/controller/jobs/rayjob/rayjob_controller.go
Auto-merging pkg/controller/jobs/raycluster/raycluster_controller_test.go
Auto-merging pkg/controller/jobs/raycluster/raycluster_controller.go
CONFLICT (content): Merge conflict in pkg/controller/jobs/raycluster/raycluster_controller.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Patch failed at 0001 TAS: Support RayCluster. (#3386)

In response to this:

/approve
/cherry-pick release-0.8

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@andrewsykim
Copy link
Member Author

The merge conflict is really odd.. also no idea why Github is considering this PR and TAS support for RayCluster as one commit:
image

@andrewsykim
Copy link
Member Author

ugh. this is my bad, I must have messed up the commit when I ammended it

@andrewsykim
Copy link
Member Author

manually opened the cherry-pick PR with the conflicts resolved #3408

@tenzen-y
Copy link
Member

tenzen-y commented Nov 1, 2024

ugh. this is my bad, I must have messed up the commit when I ammended it

I think that It's not bad since users can obtain proper updated information from Release Note :)

@tenzen-y
Copy link
Member

tenzen-y commented Nov 4, 2024

/release-note-edit

Account for NumOfHosts when calculating PodSet assignments for RayJob and RayCluster

@andrewsykim
Copy link
Member Author

@tenzen-y looks like the v0.8.2 tag never included this change https://github.com/kubernetes-sigs/kueue/commits/v0.8.2

Despite it being included in release-0.8: https://github.com/kubernetes-sigs/kueue/commits/release-0.8

@tenzen-y
Copy link
Member

tenzen-y commented Nov 5, 2024

@tenzen-y looks like the v0.8.2 tag never included this change https://github.com/kubernetes-sigs/kueue/commits/v0.8.2

Oh, it looks like my bad. Let us release v0.8.3 for the fix.
Thank you for reporting this!

@tenzen-y tenzen-y mentioned this pull request Nov 5, 2024
36 tasks
@mimowo mimowo mentioned this pull request Nov 5, 2024
34 tasks
kannon92 pushed a commit to openshift-kannon92/kubernetes-sigs-kueue that referenced this pull request Nov 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants