-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handling cases where pod is stuck in Terminating state #576
Comments
We created the PodReplacementPolicy in the job api for this reason. it’s a beta feature in 1.29 and will only recreate a pod once it is fully terminated. rereading this not sure if this KEP would help here. Sounds like you want the job to be marked as failed if it goes to terminating. |
Yeah, I don't see how this KEP would help. |
Yeah I have experienced pods staying in terminating state for a long time when doing training on TPUs as well. One way we could get around this is setting some timeout on the Job foreground deletion call, and then forcibly delete all pods once we hit that timeout. However, this is not great since forcibly deleting the pod objects from etcd doesn't guarantee the underlying container process has been cleaned up - a problematic container process could still be holding a GPU/TPU resource for example, preventing a newly scheduled pod from using it. |
Totally agree with you, I'm currently using hand-crafted argo workflow for launching multi-node training which also requires force deleting pods stuck in Terminating state which just deletes them from etcd and often leads to silently weirdly behaving nodes. I ended up with tainting nodes before force deleting pods which kinda works but is really dirty hack. That was actually main reason why I wanted to find alternative (like JobSet) for synchronous jobs hoping that this problem will be solved already :) One possible implementation that comes to my mind (without need to forcefully delete workers) is to name Job created by JobSet with attempt number, like But at least one important problem I see here is headless service. As the pods for each attempt will be named differently, we have to force users to handle this in user code. One possible approach would to to env var similar to rank looking like this
and setting |
If I understand this correctly, it sounds like you want the Job to be failed as soon as a pod goes into terminating. I see that we could implement recreation in Jobset or we could allow a way to mark a job as failed as soon as a pod goes to terminating. @mimowo @alculquicondor any ideas here? Jobset only recreates jobs once they are failed. |
I think a Pod stuck in terminating is something we should eliminate in the first place. Or, at least, we need to understand what is the scenario to propose the best approach. Underneath JobSet the Pod is managed by the batch/Job controller, and there has been some fixes in the recent k8s versions. For example, when the node is gone, the pod phase should be transitioned from Running to Failed by PodGC in k8s 1.26+. What is your k8s version? Also, can you share your JobSet yaml, and the yaml for the stuck pod? |
Also, what does |
It would be better to understand the exact cases where the Pods are getting stuck and bring this up to kubernetes/kubernetes, instead of adding hacks in Jobset. |
Agreed. I want to prioritize this because it is actually particularly problematic for large scale distributed ML training workloads, as it can substantially increase e2e failure recovery latency. We use foreground deletion when deleting failed Jobs, to prevent exponential backoff of pod creation attempts when the pods from the previous Job iteration still exist. So when pods stay in terminating state, this prevents the JobSet controller from creating a new replacement Job until all pods are finally cleaned up, and only then can the rescheduling of all the new pods begin. For the cases I've seen, I think it may be due to SIGTERM signal handers in the training code which trigger auto-checkpointing logic on graceful shutdown, and so at least I also wonder if the container process is not releasing the accelerator chip cleanly/quickly for some reason. I will talk with some folks in SIG Node to get their take on this and try to drive a long-term solution for it. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale This is an important problem to solve. I did some benchmarking and found for 6k pod JobSet being restarted, the majority of the e2e restart latency was due to waiting for pods in |
Please share repro cases. It’s really hard to follow this without those |
Hi, I was wondering how to properly handle cases where worker pod is stuck in Terminating state.
From my experience, this may happen in various cases:
From my quick experiments with JobSet, if worker pod has stuck in Terminating state, JobSet will not trigger restart as it is waiting for underlying pods be terminated.
Quick workaround might be something like CronJob that periodically force deletes jobset-controlled pods that stuck in Terminating state for more than N minutes but this is suboptimal as you cannot subsequently manually investigate what actually happened with pod and why it has got stuck in Terminating state.
It would be great if I could specify something like "podTerminationTimeout" after which JobSet will create new Job without waiting for previous pods to be terminated.
The text was updated successfully, but these errors were encountered: