Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add global Job index label/annotation to provide a global index for each job across the entire JobSet #649

Closed
Tracked by #523
danielvegamyhre opened this issue Aug 13, 2024 · 0 comments · Fixed by #650
Assignees

Comments

@danielvegamyhre
Copy link
Contributor

What would you like to be added:
Add a label and annotation jobset.sigs.k8s.io/job-id which contains an integer value from 0 to N-1 where N=total number of jobs in the JobSet, to assign each Job a globally unique index within the JobSet.

Why is this needed:
Currently the jobset.sigs.k8s.io/job-index label contains the local job index within its parent replicatedJob (values range from 0 to N-1 where N=replicatedJob.replicas).

This means for a JobSet with multiple replicatedJobs, multiple jobs may have the same job index (for example, two replicated jobs of 1 replica each will result in 2 Jobs each with job-index of 0).

In TPU multislice training we have used a JobSet with a single replicated job and exclusive job placement per slice (node pool), to assign 1 job replica exclusive usage of each TPU Slice. The job-index is then a natural and convenient way of assigning a unique TPU slice ID at the TPU runtime layer, which is required by TPU driver/runtime libraries for multislice training.

However, some users want to run multislice training workloads using a JobSet multiple replicated jobs with different templates - however, this is currently not possible because the job-index annotations from multiple different replicatedJobs are not unique (as described above), and TPU runtime requires unique slice IDs.

Therefore, we can add a new annotation, "jobset.sigs.k8s.io/job-id" which sets a globally job index that is unique across the JobSet.

@danielvegamyhre danielvegamyhre self-assigned this Aug 13, 2024
@danielvegamyhre danielvegamyhre changed the title Add Job ID label/annotation to provide a global index for each job across the entire JobSet Add global Job index label/annotation to provide a global index for each job across the entire JobSet Aug 19, 2024
@danielvegamyhre danielvegamyhre mentioned this issue Aug 19, 2024
20 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant