tf_smoke.py distributed computing doesn't work on minikube #238

EdwardZhang88 · 2017-12-21T10:26:49Z

I just started to learn K8S this month and found this CRD initiative to support Tensorflow on K8S really helpful. However, while I am able to run a local MNIST test on a single Pod in my minikube successfully, I am now stuck with distributed task tf_smoke.py. Here are the details:
minikube version: v0.24.1
tensorflow:1.4.0
tf_operator:gcr.io/tf-on-k8s-dogfood/tf_operator:v20171215-eb0fd5f

Below is the tf_smoke.yaml I used to create this tfjob. Since I already had a local copy of tensorflow 1.4, I just specified imagePullPolicy as Never.
apiVersion: "tensorflow.org/v1alpha1"
kind: "TfJob"
metadata:
name: "tf-smoke"
spec:
replicaSpecs:
- replicas: 1
tfReplicaType: MASTER
template:
spec:
containers:
- image: registry.cn-hangzhou.aliyuncs.com/denverdino/tensorflow:1.4.0
imagePullPolicy: Never
name: tensorflow
command: ["python"]
args: [/workdir/script/tf_smoke.py]
volumeMounts:
- name: workdir
mountPath: /workdir
volumes:
- name: workdir
hostPath:
path: /home/docker/tensorflow_training
restartPolicy: OnFailure
- replicas: 2
tfReplicaType: WORKER
template:
spec:
containers:
- image: registry.cn-hangzhou.aliyuncs.com/denverdino/tensorflow:1.4.0
imagePullPolicy: Never
name: tensorflow
command: ["python"]
args: [/workdir/script/tf_smoke.py]
volumeMounts:
- name: workdir
mountPath: /workdir
volumes:
- name: workdir
hostPath:
path: /home/docker/tensorflow_training
restartPolicy: OnFailure
- replicas: 1
tfReplicaType: PS

Once I submit it by kubectl create -f tf_smoke.yaml, the master is constantly displaying

2017-12-21 10:03:28.949380: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0
2017-12-21 10:03:28.949441: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:0
2017-12-21 10:03:28.949450: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:1

And the other 2 workers are just hanging there at 'INFO:root:Running Worker code'.

There are 2 things I don't understand.

The logs show all 3 pods have the same 'Started server with target: grpc://localhost:2222' even though minikube generated the cluster spec with unique hostname for each pod. Is this why the master fails to contact the other 2 workers since 2222 is occupied already by master?

{"cluster":{"master":["tf-smoke-master-jyab-0:2222"],"ps":["tf-smoke-ps-jyab-0:2222"],"worker":["tf-smoke-worker-jyab-0:2222","tf-smoke-worker-jyab-1:2222"]}

Why do we need to put the graph and ops in the master rather than worker?
if job_type == "ps":
logging.info("Running PS code.")
server.join()
elif job_type == "worker":
logging.info("Running Worker code.")
The worker just blocks because we let the master assign all ops.
server.join()
elif job_type == "master" or not job_type:
logging.info("Running master.")
with tf.device(device_func):
run(server=server, cluster_spec=cluster_spec)

Initially, I thought it was because I didn't create the ConfigMap which specifies grpcServerFilePath: /opt/mlkube/grpc_tensorflow_server/grpc_tensorflow_server.py. However, it still doesn't work even after I re-install tf-job with this tf-job-operator-config ConfigMap.
If anyone knows what is going wrong, please leave a comment. Thanks!

jlewi · 2017-12-21T14:17:42Z

My guess would be some sort of networking issue with minikube. Can you try pinging the master from the workers e.g.

kubectl exec -ti tf-smoke-worker-jyab-0:2222 /bin/bash
ping tf-smoke-master-jyab-0

When you call tf.Server you are starting a TensorFlow server on port 2222 in another process. The python code that is executing locally on that work then connects to that gRPC server and uses that server to talk to the other gRPC servers in the cluster. It uses "localhost" because the server is running on the same machine (or virtual machine in the case of K8s POD).

When you start a distributed training job, the master waits for all the other replicas in the cluster (parameter servers, and workers) to check in before the job starts. That's what the log messages from the master indicates.

The log messages indicate the workers should be starting the server and waiting to be told what to do by the master.

So it looks like there is a problem with communication between the workers and master.

Can you provide a complete set of logs from the master and worker?

Are any of the containers getting restarted?

EdwardZhang88 · 2017-12-21T16:28:14Z

@jlewi Thanks for the reply. First of all, all the Pods are running without any container restart.

I also tried to verify the network connectivity. Surprisingly, It turns out that ping is not installed in the original tensorflow docker image. So I did this to get ping installed first.

echo "nameserver 61.132.163.68" > /etc/resolv.conf
apt-get update
apt-get install inetutils-ping
Below are the hosts info on each Pod.
From master to workers

From work0 to master

So it looks like the 'virtual' IP addresses all pingable from one another. But since this is a minikube environment, localhost is used so I don't know if we can tell anything from this. As for the log, workers logs are similar. Below is worker0 log INFO:root:Tensorflow version: 1.4.0 INFO:root:Tensorflow git version: v1.4.0-rc1-11-g130a514 INFO:root:tf_config: {u'environment': u'cloud', u'cluster': {u'worker': [u'tf-smoke-worker-0uf4-0:2222', u'tf-smoke-worker-0uf4-1:2222'], u'ps': [u'tf-smoke-ps-0uf4-0:2222'], u'master': [u'tf-smoke-master-0uf4-0:2222']}, u'task': {u'index': 0, u'type': u'worker'}} INFO:root:task: {u'index': 0, u'type': u'worker'} INFO:root:cluster_spec: {u'worker': [u'tf-smoke-worker-0uf4-0:2222', u'tf-smoke-worker-0uf4-1:2222'], u'ps': [u'tf-smoke-ps-0uf4-0:2222'], u'master': [u'tf-smoke-master-0uf4-0:2222']} INFO:root:server_def: cluster { job { name: "master" tasks { value: "tf-smoke-master-0uf4-0:2222" } } job { name: "ps" tasks { value: "tf-smoke-ps-0uf4-0:2222" } } job { name: "worker" tasks { value: "tf-smoke-worker-0uf4-0:2222" } tasks { key: 1 value: "tf-smoke-worker-0uf4-1:2222" and for master, protocol: "grpc" INFO:root:Building server. 2017-12-21 14:50:16.303408: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX E1221 14:50:16.303784801 1 ev_epoll1_linux.c:1051] grpc epoll fd: 3 2017-12-21 14:50:16.321366: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job master -> {0 -> localhost:2222} 2017-12-21 14:50:16.321455: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> tf-smoke-ps-0uf4-0:2222} 2017-12-21 14:50:16.321477: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> tf-smoke-worker-0uf4-0:2222, 1 -> tf-smoke-worker-0uf4-1:2222} 2017-12-21 14:50:16.321770: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:2222 INFO:root:Finished building server. INFO:root:Running master. INFO:root:Server target: grpc://localhost:2222 2017-12-21 14:50:26.352030: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0 2017-12-21 14:50:26.352153: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:0 2017-12-21 14:50:26.352163: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:1 2017-12-21 16:08:16.663648: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0 2017-12-21 16:08:16.663843: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:0 2017-12-21 16:08:16.664279: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:1

Let me know your thought about this. In the meanwhile, I am also setting up a real cluster env to try this out. Thanks!

jlewi · 2017-12-22T04:52:12Z

It looks in your shells you were pinging by ip address and not hostname. And I didn't see any entry in the hosts files for remote replicas. So maybe there is an issue with DNS lookup?

We also want to use the hostname for the services not the pods themselves. So in the example you gave the service names would be

tf-smoke-ps-0uf4-0
tf-smoke-master-0uf4-0
...

Can you run:

nslookup ${HOSTNAME}

for the names of some the services on one of the pods? And also try pinging that host.

EdwardZhang88 · 2017-12-25T09:43:22Z

@jlewi So I tried the same YAML and tf_smoke.py on a single K8S node and the issue was gone. I guess it's just an issue with Minikube then.

EdwardZhang88 closed this as completed Dec 25, 2017

happy2048 mentioned this issue Dec 26, 2019

CreateSession still waiting for response from worker kubeflow/arena#272

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tf_smoke.py distributed computing doesn't work on minikube #238

tf_smoke.py distributed computing doesn't work on minikube #238

EdwardZhang88 commented Dec 21, 2017

The worker just blocks because we let the master assign all ops.

jlewi commented Dec 21, 2017

EdwardZhang88 commented Dec 21, 2017

jlewi commented Dec 22, 2017

EdwardZhang88 commented Dec 25, 2017

tf_smoke.py distributed computing doesn't work on minikube #238

tf_smoke.py distributed computing doesn't work on minikube #238

Comments

EdwardZhang88 commented Dec 21, 2017

The worker just blocks because we let the master assign all ops.

jlewi commented Dec 21, 2017

EdwardZhang88 commented Dec 21, 2017

jlewi commented Dec 22, 2017

EdwardZhang88 commented Dec 25, 2017