Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pods terminated without waiting #267

Closed
robinlu1984 opened this issue Jan 5, 2018 · 3 comments
Closed

Pods terminated without waiting #267

robinlu1984 opened this issue Jan 5, 2018 · 3 comments

Comments

@robinlu1984
Copy link

Master, worker, PS can start successfully. They can also train the model, but after that, something unusual happened. For example, I have the following pods deployed:

NAMESPACE     NAME                                                                  READY     STATUS    RESTARTS   AGE
default       tensorflow-object-detection-master-e3ou-0-cx76c                       1/1       Running   0          1m
default       tensorflow-object-detection-ps-e3ou-0-tw5rv                           1/1       Running   0          1m
default       tensorflow-object-detection-ps-e3ou-1-sszw4                           1/1       Running   0          1m
default       tensorflow-object-detection-tensorboard-e3ou-f5855cc77-jl54x          1/1       Running   0          1m
default       tensorflow-object-detection-worker-e3ou-0-wdk2s                       1/1       Running   0          1m
default       tensorflow-object-detection-worker-e3ou-1-4vnvb                       1/1       Running   0      

After training finished in one worker, it is terminated. Then other nodes, such as master pod, stuck at the place:

createsession still waiting for response from worker: /job:worker/replica:0/task:0

All these pods are running in different nodes.

Any suggestion?

@jlewi
Copy link
Contributor

jlewi commented Jan 5, 2018

Only the chief (master) should terminate. Workers shouldn't exit.

The problem is that TensorFlow blocks unless there are gRPC servers corresponding to all the replicas (master, ps, workers) are running a gRPC server.

So when your workers are finished they should just block forever; only the master should exit.

@robinlu1984
Copy link
Author

@jlewi thanks for the reply. If the workers block forever, the pod cannot be terminated by kubernetes, meaning that that GPU is always occupied by that pod, and other pending pods requiring GPU resources cannot get scheduled.

@jlewi
Copy link
Contributor

jlewi commented Jan 7, 2018

The expectation is that the TfJob controller will terminate the pods, thus releasing the resources when the job is done.

There's a current limitation in how TfJob controller works in that pods are only terminated when the TfJob #128 but we plan on changing that.

The fact that all replicas in a TF cluster need to be available for any to do work is an outcome of the way TF works. There are some things you can do to avoid that but in practice I'm not sure how useful that is. I expect in most cases it would be rare that one worker would finish long before other workers. So for now I think its fine if we only clean up the workers all at once when a job finishes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants