Pods terminated without waiting #267

robinlu1984 · 2018-01-05T18:28:58Z

Master, worker, PS can start successfully. They can also train the model, but after that, something unusual happened. For example, I have the following pods deployed:

NAMESPACE     NAME                                                                  READY     STATUS    RESTARTS   AGE
default       tensorflow-object-detection-master-e3ou-0-cx76c                       1/1       Running   0          1m
default       tensorflow-object-detection-ps-e3ou-0-tw5rv                           1/1       Running   0          1m
default       tensorflow-object-detection-ps-e3ou-1-sszw4                           1/1       Running   0          1m
default       tensorflow-object-detection-tensorboard-e3ou-f5855cc77-jl54x          1/1       Running   0          1m
default       tensorflow-object-detection-worker-e3ou-0-wdk2s                       1/1       Running   0          1m
default       tensorflow-object-detection-worker-e3ou-1-4vnvb                       1/1       Running   0

After training finished in one worker, it is terminated. Then other nodes, such as master pod, stuck at the place:

createsession still waiting for response from worker: /job:worker/replica:0/task:0

All these pods are running in different nodes.

Any suggestion?

The text was updated successfully, but these errors were encountered:

jlewi · 2018-01-05T21:06:17Z

Only the chief (master) should terminate. Workers shouldn't exit.

The problem is that TensorFlow blocks unless there are gRPC servers corresponding to all the replicas (master, ps, workers) are running a gRPC server.

So when your workers are finished they should just block forever; only the master should exit.

robinlu1984 · 2018-01-05T21:15:06Z

@jlewi thanks for the reply. If the workers block forever, the pod cannot be terminated by kubernetes, meaning that that GPU is always occupied by that pod, and other pending pods requiring GPU resources cannot get scheduled.

jlewi · 2018-01-07T20:42:50Z

The expectation is that the TfJob controller will terminate the pods, thus releasing the resources when the job is done.

There's a current limitation in how TfJob controller works in that pods are only terminated when the TfJob #128 but we plan on changing that.

The fact that all replicas in a TF cluster need to be available for any to do work is an outcome of the way TF works. There are some things you can do to avoid that but in practice I'm not sure how useful that is. I expect in most cases it would be rare that one worker would finish long before other workers. So for now I think its fine if we only clean up the workers all at once when a job finishes.

jlewi closed this as completed Jan 25, 2018

jlewi mentioned this issue Mar 9, 2018

Don't leave pods running just to get logs #128

Closed

happy2048 mentioned this issue Dec 26, 2019

CreateSession still waiting for response from worker kubeflow/arena#272

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pods terminated without waiting #267

Pods terminated without waiting #267

robinlu1984 commented Jan 5, 2018

jlewi commented Jan 5, 2018

robinlu1984 commented Jan 5, 2018

jlewi commented Jan 7, 2018 •

edited

Loading

Pods terminated without waiting #267

Pods terminated without waiting #267

Comments

robinlu1984 commented Jan 5, 2018

jlewi commented Jan 5, 2018

robinlu1984 commented Jan 5, 2018

jlewi commented Jan 7, 2018 • edited Loading

jlewi commented Jan 7, 2018 •

edited

Loading