Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HPA support for PyTorch Elastic #1751

Closed
tsuiot opened this issue Feb 6, 2023 · 15 comments · Fixed by #1752
Closed

HPA support for PyTorch Elastic #1751

tsuiot opened this issue Feb 6, 2023 · 15 comments · Fixed by #1752

Comments

@tsuiot
Copy link

tsuiot commented Feb 6, 2023

1. Background

use training-operator/examples/pytorch/elastic/imagenet/imagenet.yaml to HPA for Pytorch elastic

2.ScaleUp and PytorchJob always in running state

Jobs list:
image
PytorchJob.Status:
image
HPA.status:
image

in this case, worker3 which scale up by HPA is pendding first, due to NotEnounghResources, and scheduled until worker0-2 is Successed。 this let PytorchJob always in Runing state, and workder3 in Running and Restart to communicate to worker0

@johnugeorge
Copy link
Member

Which image are you using ? Is this always reproducible?

@tsuiot
Copy link
Author

tsuiot commented Feb 6, 2023

build image myself from the master branch. yes

@johnugeorge
Copy link
Member

Does this happen when you get into a situation with "NotEnounghResources" ?What about the normal situations? Is it the situation that underlying elastic job got completed but Pytorchjob doesn't show succeeded ?

@tsuiot
Copy link
Author

tsuiot commented Feb 7, 2023

1.yes.
2.if all pods scheduled and completed, the pytorchjob will change to compeleted.
3.No, it appears when part of pods completed and another pendding or failed, and Pytorchjob doesn`t show succeeded. But actually the training has been completed.

@Syulin7
Copy link
Contributor

Syulin7 commented Feb 7, 2023

When we use PyTorch Elastic(eg. https://github.com/kubeflow/training-operator/blob/master/examples/pytorch/elastic/imagenet/imagenet.yaml):

  1. first worker replicas is two.
  2. HPA scale the worker replicas to 4.
  3. Due to some reasons(like NotEnounghResources), worker-2 is not running.
  4. Other workers completed (including worker-0), worker-2 cannot connect to worker-0.
# kubectl get pod
elastic-example-imagenet-worker-0   0/1     Completed           0                3h16m
elastic-example-imagenet-worker-1   0/1     Completed           0                3h16m
elastic-example-imagenet-worker-2   1/1     Running             20 (5m13s ago)   3h15m
elastic-example-imagenet-worker-3   0/1     Completed           0                3h15m

# kubectl logs elastic-example-imagenet-worker-2 -p
[E socket.cpp:860] [c10d] The client socket has timed out after 60s while trying to connect to (elastic-example-imagenet-worker-0, 23456).
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 155, in _create_tcp_store
    store = TCPStore(
TimeoutError: The client socket has timed out after 60s while trying to connect to (elastic-example-imagenet-worker-0, 23456).
  1. pytorchjob is still running because the success criteria are not met.
    if rtype == kubeflowv1.PyTorchJobReplicaTypeWorker {
    // TODO(gaocegege): Support SuccessPolicy
    if expected == 0 {
    msg := fmt.Sprintf("PyTorchJob %s/%s successfully completed.",
    pytorchjob.Namespace, pytorchjob.Name)
    r.recorder.Event(pytorchjob, corev1.EventTypeNormal, commonutil.JobSucceededReason, msg)
    if jobStatus.CompletionTime == nil {
    now := metav1.Now()
    jobStatus.CompletionTime = &now
    }
    err := commonutil.UpdateJobConditions(jobStatus,
    commonv1.JobSucceeded, commonutil.JobSucceededReason, msg)
    if err != nil {
    commonutil.LoggerForJob(pytorchjob).Infof("Append pytorchjob condition error: %v", err)
    return err
    }
    trainingoperatorcommon.SuccessfulJobsCounterInc(pytorchjob.Namespace, kubeflowv1.PytorchJobFrameworkName)
    } else if running > 0 {
    // Some workers are still running, leave a running condition.
    msg := fmt.Sprintf("PyTorchJob %s/%s is running.",
    pytorchjob.Namespace, pytorchjob.Name)
    err := commonutil.UpdateJobConditions(jobStatus, commonv1.JobRunning, commonutil.JobRunningReason, msg)
    if err != nil {
    commonutil.LoggerForJob(pytorchjob).Infof("Append pytorchjob condition error: %v", err)
    return err
    }

Support successPolicy on pytorchjob can fix it:

  1. SuccessPolicyAllWorkers - pytorchjob now default success policy.
  2. SuccessPolicyChiefWorker - the job is succeeded if worker 0 completed.(same as tfjob's SuccessPolicyDefault)
    @johnugeorge @tsuiot WDYT.

@johnugeorge
Copy link
Member

This is related to bug in reporting success for Elastic runs.

Related: #1711 (comment)

@Syulin7
Copy link
Contributor

Syulin7 commented Feb 7, 2023

For Elastic mode, if worker0 completes, we default to setting the pytorchjob to Succeeded?

@johnugeorge
Copy link
Member

johnugeorge commented Feb 7, 2023

Yes. I think, if any worker succeeds, we can mark it succeeded. Worker 0 is safer as it runs the c10d

@johnugeorge
Copy link
Member

@tsuiot Can you try the fix from #1752

@tsuiot
Copy link
Author

tsuiot commented Feb 8, 2023

Yes.

@Syulin7
Copy link
Contributor

Syulin7 commented Feb 8, 2023

@tsuiot Is there any problem with the test?
I tested it myself and the job shows succeeded. @johnugeorge

$ kubectl get pod
NAME                                READY   STATUS             RESTARTS          AGE
elastic-example-imagenet-worker-0   0/1     Completed          0                 19h
elastic-example-imagenet-worker-1   0/1     CrashLoopBackOff   188 (2m39s ago)   19h
elastic-example-imagenet-worker-2   0/1     Completed          0                 19h
elastic-example-imagenet-worker-3   0/1     Completed          0                 19h
$ kubectl get pytorchjob
NAME                       STATE       AGE
elastic-example-imagenet   Succeeded   19h

@tsuiot
Copy link
Author

tsuiot commented Feb 8, 2023

1.the pytorchJob change to Succeeded with #1752
2.new worker pod in running and restart, and should we delete the pods scaled and not in normal status when pytorchJob Succeeded?

@Syulin7
Copy link
Contributor

Syulin7 commented Feb 8, 2023

2.new worker pod in running and restart, and should we delete the pods scaled and not in normal status when pytorchJob Succeeded?

PytorchJob.RunPolicy.CleanPodPolicy defines the policy to kill pods after the job completes.(the deafut value is None for PytorchJob).

// SetDefaults_PyTorchJob sets any unspecified values to defaults.
func SetDefaults_PyTorchJob(job *PyTorchJob) {
// Set default cleanpod policy to None.
if job.Spec.RunPolicy.CleanPodPolicy == nil {
policy := commonv1.CleanPodPolicyNone
job.Spec.RunPolicy.CleanPodPolicy = &policy
}

But in the comments of common, the default value is Running, which is ambiguous.

type RunPolicy struct {
	// CleanPodPolicy defines the policy to kill pods after the job completes.
	// Default to Running.
	CleanPodPolicy *CleanPodPolicy `json:"cleanPodPolicy,omitempty"`

@tsuiot you can set PytorchJob.RunPolicy.CleanPodPolicy to Running and try again.

@johnugeorge should we set pytorjob CleanPodPolicy default to Running?

@tsuiot
Copy link
Author

tsuiot commented Feb 8, 2023

/LGTM

@johnugeorge
Copy link
Member

We can fix the comment for now. It is better to keep consistent for all jobs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants