Skip to content
This repository has been archived by the owner on Sep 12, 2023. It is now read-only.

Pod name using generated name #215

Closed
wants to merge 1 commit into from
Closed

Conversation

yowenter
Copy link
Contributor

If a job is recreated using the deleted job name, while the deleted job pods are still in Terminating state, the new job pod will create fail. So we'd better using generated pod name. I found the kubernetes Job implementation is also using the generated pod name.

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign gaocegege for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tenzen-y
Copy link
Member

/cc

@gaocegege
Copy link
Member

Will it break the existing code? E.g. the TFConfig generation.

@yowenter
Copy link
Contributor Author

Will it break the existing code? E.g. the TFConfig generation.

Tensorflow has override ReconcilePods func, the pod name is not generated, so it will be ok.
However, the mxnet job has dependency on pod name. let me see how to fix it.

@yowenter
Copy link
Contributor Author

yowenter commented Apr 27, 2023

hi @gaocegege , I found the mxnet implementation relies on the pod host,

// genClusterSpec will generate ClusterSpec.
func genClusterSpec(mxjob *kubeflowv1.MXJob) (ClusterSpec, error) {
	clusterSpec := make(ClusterSpec)

	for rtype, spec := range mxjob.Spec.MXReplicaSpecs {
		rt := strings.ToLower(string(rtype))
		replicaNames := make([]UrlPort, 0, *spec.Replicas)

		port, err := getPortFromMXJob(mxjob, rtype)
		if err != nil {
			return nil, err
		}
		for i := int32(0); i < *spec.Replicas; i++ {
			host := UrlPort{
				Url:  common.GenGeneralName(mxjob.Name, rt, fmt.Sprintf("%d", i)),
				Port: int(port),
			}
			replicaNames = append(replicaNames, host)
		}

		clusterSpec[rt] = replicaNames
	}

	return clusterSpec, nil
}

So, the mxnet job pod name must be specified. Maybe I need add a implementation mxnet createnewpod.
How do you think about it ?

@gaocegege
Copy link
Member

/cc @kubeflow/wg-training-leads

@google-oss-prow google-oss-prow bot requested a review from a team April 27, 2023 07:22
Copy link
Member

@terrytangyuan terrytangyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this break any existing elastic training functionality?

@tenzen-y
Copy link
Member

tenzen-y commented May 2, 2023

Will it break the existing code? E.g. the TFConfig generation.

Tensorflow has override ReconcilePods func, the pod name is not generated, so it will be ok. However, the mxnet job has dependency on pod name. let me see how to fix it.

We are considering fully consolidating the tf-operator to the training-operator. So, this change will be affected on the TFJob.

kubeflow/training-operator#1727

@tenzen-y
Copy link
Member

tenzen-y commented May 2, 2023

Either way, we can improve handling the terminating Pod once we introduce batch/job API.

kubeflow/training-operator#1718

@yowenter
Copy link
Contributor Author

yowenter commented May 4, 2023

Either way, we can improve handling the terminating Pod once we introduce batch/job API.

kubeflow/training-operator#1718

@tenzen-y It's good if training operator reuse the kubernetes batchjob api.
By the way, when will the refactored job feature be released ?

@tenzen-y
Copy link
Member

tenzen-y commented May 4, 2023

By the way, when will the refactored job feature be released ?

As the training operator needs the elastic indexed job feature available since k8s v1.27, we will introduce the batch/job after K8s v1.26 reaches EoL.

@tenzen-y
Copy link
Member

tenzen-y commented May 4, 2023

Also, I'm working on adding the success policy feature similar to the TFJob's success policy to the batch/job.
However, I'm not sure when the success policy feature is graduated to beta. So, as a first step for introducing batch/job, it would be good to implement the feature, the same as currently TFJob on the traininig-operator side.

Moreover, I'm thinking of introducing the JobSet API instead of batch/job, although I think we need discussion whether we should introduce the JobSet API.

@yowenter
Copy link
Contributor Author

yowenter commented May 5, 2023

Also, I'm working on adding the success policy feature similar to the TFJob's success policy to the batch/job. However, I'm not sure when the success policy feature is graduated to beta. So, as a first step for introducing batch/job, it would be good to implement the feature, the same as currently TFJob on the traininig-operator side.

Moreover, I'm thinking of introducing the JobSet API instead of batch/job, although I think we need discussion whether we should introduce the JobSet API.

Good, I'm closing this pr for now.

@yowenter yowenter closed this May 5, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants