Pod name using generated name #215

yowenter · 2023-04-26T12:19:56Z

If a job is recreated using the deleted job name, while the deleted job pods are still in Terminating state, the new job pod will create fail. So we'd better using generated pod name. I found the kubernetes Job implementation is also using the generated pod name.

google-oss-prow · 2023-04-26T12:20:02Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign gaocegege for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tenzen-y · 2023-04-26T15:28:18Z

/cc

Signed-off-by: yowenter <[email protected]>

gaocegege · 2023-04-27T02:39:40Z

Will it break the existing code? E.g. the TFConfig generation.

yowenter · 2023-04-27T02:51:07Z

Will it break the existing code? E.g. the TFConfig generation.

Tensorflow has override ReconcilePods func, the pod name is not generated, so it will be ok.
However, the mxnet job has dependency on pod name. let me see how to fix it.

yowenter · 2023-04-27T03:26:22Z

hi @gaocegege , I found the mxnet implementation relies on the pod host,

// genClusterSpec will generate ClusterSpec.
func genClusterSpec(mxjob *kubeflowv1.MXJob) (ClusterSpec, error) {
	clusterSpec := make(ClusterSpec)

	for rtype, spec := range mxjob.Spec.MXReplicaSpecs {
		rt := strings.ToLower(string(rtype))
		replicaNames := make([]UrlPort, 0, *spec.Replicas)

		port, err := getPortFromMXJob(mxjob, rtype)
		if err != nil {
			return nil, err
		}
		for i := int32(0); i < *spec.Replicas; i++ {
			host := UrlPort{
				Url:  common.GenGeneralName(mxjob.Name, rt, fmt.Sprintf("%d", i)),
				Port: int(port),
			}
			replicaNames = append(replicaNames, host)
		}

		clusterSpec[rt] = replicaNames
	}

	return clusterSpec, nil
}

So, the mxnet job pod name must be specified. Maybe I need add a implementation mxnet createnewpod.
How do you think about it ?

gaocegege · 2023-04-27T07:22:39Z

/cc @kubeflow/wg-training-leads

terrytangyuan

Will this break any existing elastic training functionality?

tenzen-y · 2023-05-02T10:26:07Z

Will it break the existing code? E.g. the TFConfig generation.

Tensorflow has override ReconcilePods func, the pod name is not generated, so it will be ok. However, the mxnet job has dependency on pod name. let me see how to fix it.

We are considering fully consolidating the tf-operator to the training-operator. So, this change will be affected on the TFJob.

kubeflow/training-operator#1727

tenzen-y · 2023-05-02T10:30:18Z

Either way, we can improve handling the terminating Pod once we introduce batch/job API.

kubeflow/training-operator#1718

yowenter · 2023-05-04T06:30:43Z

Either way, we can improve handling the terminating Pod once we introduce batch/job API.

kubeflow/training-operator#1718

@tenzen-y It's good if training operator reuse the kubernetes batchjob api.
By the way, when will the refactored job feature be released ?

tenzen-y · 2023-05-04T11:07:29Z

By the way, when will the refactored job feature be released ?

As the training operator needs the elastic indexed job feature available since k8s v1.27, we will introduce the batch/job after K8s v1.26 reaches EoL.

tenzen-y · 2023-05-04T11:17:02Z

Also, I'm working on adding the success policy feature similar to the TFJob's success policy to the batch/job.
However, I'm not sure when the success policy feature is graduated to beta. So, as a first step for introducing batch/job, it would be good to implement the feature, the same as currently TFJob on the traininig-operator side.

Moreover, I'm thinking of introducing the JobSet API instead of batch/job, although I think we need discussion whether we should introduce the JobSet API.

yowenter · 2023-05-05T03:12:18Z

Also, I'm working on adding the success policy feature similar to the TFJob's success policy to the batch/job. However, I'm not sure when the success policy feature is graduated to beta. So, as a first step for introducing batch/job, it would be good to implement the feature, the same as currently TFJob on the traininig-operator side.

Moreover, I'm thinking of introducing the JobSet API instead of batch/job, although I think we need discussion whether we should introduce the JobSet API.

Good, I'm closing this pr for now.

google-oss-prow bot requested review from gaocegege and Jeffwan April 26, 2023 12:20

google-oss-prow bot added the size/S label Apr 26, 2023

yowenter mentioned this pull request Apr 26, 2023

GetPodsForJob check the pod owner reference job uid kubeflow/training-operator#1796

Merged

1 task

google-oss-prow bot requested a review from tenzen-y April 26, 2023 15:28

yowenter force-pushed the master branch from 9080844 to 652dc75 Compare April 27, 2023 02:33

Pod name using generated name

6cd028a

Signed-off-by: yowenter <[email protected]>

yowenter force-pushed the master branch from 652dc75 to 6cd028a Compare April 27, 2023 02:38

google-oss-prow bot requested a review from a team April 27, 2023 07:22

terrytangyuan reviewed Apr 30, 2023

View reviewed changes

yowenter closed this May 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pod name using generated name #215

Pod name using generated name #215

yowenter commented Apr 26, 2023

google-oss-prow bot commented Apr 26, 2023

tenzen-y commented Apr 26, 2023

gaocegege commented Apr 27, 2023

yowenter commented Apr 27, 2023

yowenter commented Apr 27, 2023 •

edited

Loading

gaocegege commented Apr 27, 2023

terrytangyuan left a comment

tenzen-y commented May 2, 2023

tenzen-y commented May 2, 2023

yowenter commented May 4, 2023

tenzen-y commented May 4, 2023

tenzen-y commented May 4, 2023

yowenter commented May 5, 2023

Pod name using generated name #215

Pod name using generated name #215

Conversation

yowenter commented Apr 26, 2023

google-oss-prow bot commented Apr 26, 2023

tenzen-y commented Apr 26, 2023

gaocegege commented Apr 27, 2023

yowenter commented Apr 27, 2023

yowenter commented Apr 27, 2023 • edited Loading

gaocegege commented Apr 27, 2023

terrytangyuan left a comment

Choose a reason for hiding this comment

tenzen-y commented May 2, 2023

tenzen-y commented May 2, 2023

yowenter commented May 4, 2023

tenzen-y commented May 4, 2023

tenzen-y commented May 4, 2023

yowenter commented May 5, 2023

yowenter commented Apr 27, 2023 •

edited

Loading