You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What would you like to be added:
Add a label and annotation jobset.sigs.k8s.io/job-id which contains an integer value from 0 to N-1 where N=total number of jobs in the JobSet, to assign each Job a globally unique index within the JobSet.
Why is this needed:
Currently the jobset.sigs.k8s.io/job-index label contains the local job index within its parent replicatedJob (values range from 0 to N-1 where N=replicatedJob.replicas).
This means for a JobSet with multiple replicatedJobs, multiple jobs may have the same job index (for example, two replicated jobs of 1 replica each will result in 2 Jobs each with job-index of 0).
In TPU multislice training we have used a JobSet with a single replicated job and exclusive job placement per slice (node pool), to assign 1 job replica exclusive usage of each TPU Slice. The job-index is then a natural and convenient way of assigning a unique TPU slice ID at the TPU runtime layer, which is required by TPU driver/runtime libraries for multislice training.
However, some users want to run multislice training workloads using a JobSet multiple replicated jobs with different templates - however, this is currently not possible because the job-index annotations from multiple different replicatedJobs are not unique (as described above), and TPU runtime requires unique slice IDs.
Therefore, we can add a new annotation, "jobset.sigs.k8s.io/job-id" which sets a globally job index that is unique across the JobSet.
The text was updated successfully, but these errors were encountered:
danielvegamyhre
changed the title
Add Job ID label/annotation to provide a global index for each job across the entire JobSet
Add global Job index label/annotation to provide a global index for each job across the entire JobSet
Aug 19, 2024
What would you like to be added:
Add a label and annotation
jobset.sigs.k8s.io/job-id
which contains an integer value from 0 to N-1 where N=total number of jobs in the JobSet, to assign each Job a globally unique index within the JobSet.Why is this needed:
Currently the
jobset.sigs.k8s.io/job-index
label contains the local job index within its parent replicatedJob (values range from 0 to N-1 where N=replicatedJob.replicas).This means for a JobSet with multiple replicatedJobs, multiple jobs may have the same job index (for example, two replicated jobs of 1 replica each will result in 2 Jobs each with job-index of 0).
In TPU multislice training we have used a JobSet with a single replicated job and exclusive job placement per slice (node pool), to assign 1 job replica exclusive usage of each TPU Slice. The job-index is then a natural and convenient way of assigning a unique TPU slice ID at the TPU runtime layer, which is required by TPU driver/runtime libraries for multislice training.
However, some users want to run multislice training workloads using a JobSet multiple replicated jobs with different templates - however, this is currently not possible because the job-index annotations from multiple different replicatedJobs are not unique (as described above), and TPU runtime requires unique slice IDs.
Therefore, we can add a new annotation, "jobset.sigs.k8s.io/job-id" which sets a globally job index that is unique across the JobSet.
The text was updated successfully, but these errors were encountered: