-
Notifications
You must be signed in to change notification settings - Fork 436
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add field to expose entrypoint num cpus in rayjob #1359
Conversation
@architkulkarni I think the code changes are ready for review. I have tried to complete all the steps given in the development guide. Apologies if something is missing. Also I am not sure why the tests are failing, I don't seem to have access to the logs. Thanks in advance for the help! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, but there's a typo. To catch this we would have needed an integration test with Ray. Can you add one? Here's one idea.
- Add a new sample YAML file which has the job specify some cpus, gpus, and resources. I think there should be some way to also specify these logical resources in the RayCluster spec for the job (since we don't have physical GPUs, we don't want to autodetect the number of GPUs, but we can just define that the cluster has 4 GPUs or something.)
- In the entrypoint script, use
ray.available_resources()
andray.cluster_resources()
to ensure the expected number of resources are taken up by the currently running script. - Add the name of the new YAML file to
test_sample_rayjob_yamls.py
so that it's tested in CI.
What do you think?
"--entrypoint_num_cpus", "1.000000", | ||
"--entrypoint_num_gpus", "0.500000", | ||
"--entrypoint_resources", `{"Custom_1": 1, "Custom_2": 5.5}`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"--entrypoint_num_cpus", "1.000000", | |
"--entrypoint_num_gpus", "0.500000", | |
"--entrypoint_resources", `{"Custom_1": 1, "Custom_2": 5.5}`, | |
"--entrypoint-num-cpus", "1.000000", | |
"--entrypoint-num-gpus", "0.500000", | |
"--entrypoint-resources", `{"Custom_1": 1, "Custom_2": 5.5}`, |
if entrypointNumCpus > 0 { | ||
k8sJobCommand = append(k8sJobCommand, "--entrypoint_num_cpus", fmt.Sprintf("%f", entrypointNumCpus)) | ||
} | ||
|
||
if entrypointNumGpus > 0 { | ||
k8sJobCommand = append(k8sJobCommand, "--entrypoint_num_gpus", fmt.Sprintf("%f", entrypointNumGpus)) | ||
} | ||
|
||
if len(entrypointResources) > 0 { | ||
k8sJobCommand = append(k8sJobCommand, "--entrypoint_resources", entrypointResources) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if entrypointNumCpus > 0 { | |
k8sJobCommand = append(k8sJobCommand, "--entrypoint_num_cpus", fmt.Sprintf("%f", entrypointNumCpus)) | |
} | |
if entrypointNumGpus > 0 { | |
k8sJobCommand = append(k8sJobCommand, "--entrypoint_num_gpus", fmt.Sprintf("%f", entrypointNumGpus)) | |
} | |
if len(entrypointResources) > 0 { | |
k8sJobCommand = append(k8sJobCommand, "--entrypoint_resources", entrypointResources) | |
if entrypointNumCpus > 0 { | |
k8sJobCommand = append(k8sJobCommand, "--entrypoint-num-cpus", fmt.Sprintf("%f", entrypointNumCpus)) | |
} | |
if entrypointNumGpus > 0 { | |
k8sJobCommand = append(k8sJobCommand, "--entrypoint-num-gpus", fmt.Sprintf("%f", entrypointNumGpus)) | |
} | |
if len(entrypointResources) > 0 { | |
k8sJobCommand = append(k8sJobCommand, "--entrypoint-resources", entrypointResources) |
Can you say more about "don't have access to the logs"? You should be able to run tests locally using |
Yes! That sounds great, Let me try adding that test! |
Great, let me know if you have any questions!
…On Wed, Aug 23, 2023 at 11:42 AM Shubham Sangamnerkar < ***@***.***> wrote:
Looks good to me, but there's a typo. To catch this we would have needed
an integration test with Ray. Can you add one? Here's one idea.
- Add a new sample YAML file which has the job specify some cpus,
gpus, and resources. I think there should be some way to also specify these
logical resources in the RayCluster spec for the job (since we don't have
physical GPUs, we don't want to autodetect the number of GPUs, but we can
just define that the cluster has 4 GPUs or something.)
- In the entrypoint script, use ray.available_resources() and
ray.cluster_resources() to ensure the expected number of resources are
taken up by the currently running script.
- Add the name of the new YAML file to test_sample_rayjob_yamls.py so
that it's tested in CI.
What do you think?
Looks good to me, but there's a typo. To catch this we would have needed
an integration test with Ray. Can you add one? Here's one idea.
- Add a new sample YAML file which has the job specify some cpus,
gpus, and resources. I think there should be some way to also specify these
logical resources in the RayCluster spec for the job (since we don't have
physical GPUs, we don't want to autodetect the number of GPUs, but we can
just define that the cluster has 4 GPUs or something.)
- In the entrypoint script, use ray.available_resources() and
ray.cluster_resources() to ensure the expected number of resources are
taken up by the currently running script.
- Add the name of the new YAML file to test_sample_rayjob_yamls.py so
that it's tested in CI.
What do you think?
Yes! That sounds great, Let me try adding that test!
—
Reply to this email directly, view it on GitHub
<#1359 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABJU5RSLW3PJJCUBZ4DHIEDXWZFKRANCNFSM6AAAAAA32PDQQM>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Actually what do you mean by this: ". I think there should be some way to also specify these logical resources in the RayCluster spec for the job (since we don't have physical GPUs, we don't want to autodetect the number of GPUs, but we can just define that the cluster has 4 GPUs or something.) Following is my understanding:
In that case, the infra on which our tests run would need to have a GPU (which I assume we don't have). Then won't the cluster creation remain in a pending state? I understand we are somehow trying to mock the presence of a GPU by specifying in the cluster spec, but I am not getting exactly how. It would be very helpful if you could clarify this! Thanks for your help!
|
Yeah exactly, we don't have physical GPUs, but from Ray's perspective the resources are logical, not physical. So we can just tell Ray that a certain node has 4 GPUs even if it doesn't, and it will schedule tasks and actors (and in this case, the entrypoint script) accordingly. https://docs.ray.io/en/latest/ray-core/scheduling/resources.html#specifying-node-resources (you can see the "KubeRay" tab) |
@architkulkarni I have pushed the tests. However, I was facing one issue while running the tests. The rayjob used to start and fail as the ray cluster was not ready (Atleast that is what I interepreted from the logs). However, the next attempt (a resubmission I guess) used to pass, because maybe the cluster was up and running by this time. I think the intended behavior is that the cluster comes up, it is ready and then we start/submit the job. Do you think this is a bug? |
@shubhscoder Ah thanks, that sounds like a bug. If you see it again or if you have the logs from last time would you mind filing an issue? Thanks! |
Please also add the new YAML file to |
@architkulkarni I tried running make sync locally and I keep getting this error:
Looks like I am missing something |
Oh weird, I'm not super familiar with this part... cc @kevin85421 in case you have any ideas Worst case I can just check out your PR and run the command and push to your branch, if you have that enabled. |
Would you mind creating an issue for the Also, another option is to "manually" sync the files, using ray-project/ray#38857 as an example |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code and the test LGTM otherwise!
@architkulkarni Sure I can file a bug for the Kustomize failure. This looks related to kubernetes-sigs/kustomize#3618 Specifically this comment: kubernetes-sigs/kustomize#3618 (comment) where the user tried to install Kustomize 4.0.1 and got the exact same error that I am getting (Kustomize in Kuberay seems to be 3.10.0). My guess is that this issue has started surfacing after upgrading to go 1.19, and maybe users who installed Kustomize with older versions of go did not face this issue and their old installations still seems to be working. (Just a guess). I got around it for now by changing the Kustomize version to 4.5.2 locally. However, I have created this issue to investigate other effects of upgrading the kustomize version: #1368 |
@architkulkarni Thanks for all the help on this one! Also, apologies for so much too and fro on this fairly simple issue! I will try my best to make the future code changes more concise and tested. |
Not at all, I think it's normal. Thanks for the contribution! |
Saw it a couple times, filed an issue here #1381 |
…ct#1359) --------- Co-authored-by: Sangamnerkar <[email protected]>
Why are these changes needed?
Adds the fields entrypoint_num_cpus, gpus, resources in Kuberay Rayjob spec. Ray Job API supports specification of these resources, but before this code change the Kuberay Job Spec did not expose these fields
Related issue number
Closes #1266
Checks