-
Notifications
You must be signed in to change notification settings - Fork 710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pods terminated without waiting #267
Comments
Only the chief (master) should terminate. Workers shouldn't exit. The problem is that TensorFlow blocks unless there are gRPC servers corresponding to all the replicas (master, ps, workers) are running a gRPC server. So when your workers are finished they should just block forever; only the master should exit. |
@jlewi thanks for the reply. If the workers block forever, the pod cannot be terminated by kubernetes, meaning that that GPU is always occupied by that pod, and other pending pods requiring GPU resources cannot get scheduled. |
The expectation is that the TfJob controller will terminate the pods, thus releasing the resources when the job is done. There's a current limitation in how TfJob controller works in that pods are only terminated when the TfJob #128 but we plan on changing that. The fact that all replicas in a TF cluster need to be available for any to do work is an outcome of the way TF works. There are some things you can do to avoid that but in practice I'm not sure how useful that is. I expect in most cases it would be rare that one worker would finish long before other workers. So for now I think its fine if we only clean up the workers all at once when a job finishes. |
Master, worker, PS can start successfully. They can also train the model, but after that, something unusual happened. For example, I have the following pods deployed:
After training finished in one worker, it is terminated. Then other nodes, such as master pod, stuck at the place:
All these pods are running in different nodes.
Any suggestion?
The text was updated successfully, but these errors were encountered: