-
Notifications
You must be signed in to change notification settings - Fork 71
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/cc |
Signed-off-by: yowenter <[email protected]>
Will it break the existing code? E.g. the TFConfig generation. |
Tensorflow has override |
hi @gaocegege , I found the mxnet implementation relies on the pod host,
So, the mxnet job pod name must be specified. Maybe I need add a implementation mxnet |
/cc @kubeflow/wg-training-leads |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this break any existing elastic training functionality?
We are considering fully consolidating the tf-operator to the training-operator. So, this change will be affected on the TFJob. |
Either way, we can improve handling the terminating Pod once we introduce batch/job API. |
@tenzen-y It's good if training operator reuse the kubernetes batchjob api. |
As the training operator needs the elastic indexed job feature available since k8s v1.27, we will introduce the batch/job after K8s v1.26 reaches EoL. |
Also, I'm working on adding the success policy feature similar to the TFJob's success policy to the batch/job. Moreover, I'm thinking of introducing the JobSet API instead of batch/job, although I think we need discussion whether we should introduce the JobSet API. |
Good, I'm closing this pr for now. |
If a job is recreated using the deleted job name, while the deleted job pods are still in
Terminating
state, the new job pod will create fail. So we'd better using generated pod name. I found the kubernetes Job implementation is also using the generated pod name.