Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tf_smoke.py distributed computing doesn't work on minikube #238

Closed
EdwardZhang88 opened this issue Dec 21, 2017 · 4 comments
Closed

tf_smoke.py distributed computing doesn't work on minikube #238

EdwardZhang88 opened this issue Dec 21, 2017 · 4 comments

Comments

@EdwardZhang88
Copy link

I just started to learn K8S this month and found this CRD initiative to support Tensorflow on K8S really helpful. However, while I am able to run a local MNIST test on a single Pod in my minikube successfully, I am now stuck with distributed task tf_smoke.py. Here are the details:
minikube version: v0.24.1
tensorflow:1.4.0
tf_operator:gcr.io/tf-on-k8s-dogfood/tf_operator:v20171215-eb0fd5f

Below is the tf_smoke.yaml I used to create this tfjob. Since I already had a local copy of tensorflow 1.4, I just specified imagePullPolicy as Never.
apiVersion: "tensorflow.org/v1alpha1"
kind: "TfJob"
metadata:
name: "tf-smoke"
spec:
replicaSpecs:
- replicas: 1
tfReplicaType: MASTER
template:
spec:
containers:
- image: registry.cn-hangzhou.aliyuncs.com/denverdino/tensorflow:1.4.0
imagePullPolicy: Never
name: tensorflow
command: ["python"]
args: [/workdir/script/tf_smoke.py]
volumeMounts:
- name: workdir
mountPath: /workdir
volumes:
- name: workdir
hostPath:
path: /home/docker/tensorflow_training
restartPolicy: OnFailure
- replicas: 2
tfReplicaType: WORKER
template:
spec:
containers:
- image: registry.cn-hangzhou.aliyuncs.com/denverdino/tensorflow:1.4.0
imagePullPolicy: Never
name: tensorflow
command: ["python"]
args: [/workdir/script/tf_smoke.py]
volumeMounts:
- name: workdir
mountPath: /workdir
volumes:
- name: workdir
hostPath:
path: /home/docker/tensorflow_training
restartPolicy: OnFailure
- replicas: 1
tfReplicaType: PS

Once I submit it by kubectl create -f tf_smoke.yaml, the master is constantly displaying

2017-12-21 10:03:28.949380: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0
2017-12-21 10:03:28.949441: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:0
2017-12-21 10:03:28.949450: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:1

And the other 2 workers are just hanging there at 'INFO:root:Running Worker code'.

There are 2 things I don't understand.

  1. The logs show all 3 pods have the same 'Started server with target: grpc://localhost:2222' even though minikube generated the cluster spec with unique hostname for each pod. Is this why the master fails to contact the other 2 workers since 2222 is occupied already by master?

{"cluster":{"master":["tf-smoke-master-jyab-0:2222"],"ps":["tf-smoke-ps-jyab-0:2222"],"worker":["tf-smoke-worker-jyab-0:2222","tf-smoke-worker-jyab-1:2222"]}

  1. Why do we need to put the graph and ops in the master rather than worker?
    if job_type == "ps":
    logging.info("Running PS code.")
    server.join()
    elif job_type == "worker":
    logging.info("Running Worker code.")

    The worker just blocks because we let the master assign all ops.

    server.join()
    elif job_type == "master" or not job_type:
    logging.info("Running master.")
    with tf.device(device_func):
    run(server=server, cluster_spec=cluster_spec)

Initially, I thought it was because I didn't create the ConfigMap which specifies grpcServerFilePath: /opt/mlkube/grpc_tensorflow_server/grpc_tensorflow_server.py. However, it still doesn't work even after I re-install tf-job with this tf-job-operator-config ConfigMap.
If anyone knows what is going wrong, please leave a comment. Thanks!

@jlewi
Copy link
Contributor

jlewi commented Dec 21, 2017

My guess would be some sort of networking issue with minikube. Can you try pinging the master from the workers e.g.

kubectl exec -ti tf-smoke-worker-jyab-0:2222 /bin/bash
ping tf-smoke-master-jyab-0

When you call tf.Server you are starting a TensorFlow server on port 2222 in another process. The python code that is executing locally on that work then connects to that gRPC server and uses that server to talk to the other gRPC servers in the cluster. It uses "localhost" because the server is running on the same machine (or virtual machine in the case of K8s POD).

When you start a distributed training job, the master waits for all the other replicas in the cluster (parameter servers, and workers) to check in before the job starts. That's what the log messages from the master indicates.

The log messages indicate the workers should be starting the server and waiting to be told what to do by the master.

So it looks like there is a problem with communication between the workers and master.

Can you provide a complete set of logs from the master and worker?

Are any of the containers getting restarted?

@EdwardZhang88
Copy link
Author

@jlewi Thanks for the reply. First of all, all the Pods are running without any container restart.
screen shot 2017-12-22 at 12 11 53 am
I also tried to verify the network connectivity. Surprisingly, It turns out that ping is not installed in the original tensorflow docker image. So I did this to get ping installed first.

  1. echo "nameserver 61.132.163.68" > /etc/resolv.conf
  2. apt-get update
  3. apt-get install inetutils-ping
    Below are the hosts info on each Pod.
    From master to workers

screen shot 2017-12-22 at 12 18 23 am

From work0 to master

screen shot 2017-12-22 at 12 19 35 am

So it looks like the 'virtual' IP addresses all pingable from one another. But since this is a minikube environment, localhost is used so I don't know if we can tell anything from this. As for the log, workers logs are similar. Below is worker0 log INFO:root:Tensorflow version: 1.4.0 INFO:root:Tensorflow git version: v1.4.0-rc1-11-g130a514 INFO:root:tf_config: {u'environment': u'cloud', u'cluster': {u'worker': [u'tf-smoke-worker-0uf4-0:2222', u'tf-smoke-worker-0uf4-1:2222'], u'ps': [u'tf-smoke-ps-0uf4-0:2222'], u'master': [u'tf-smoke-master-0uf4-0:2222']}, u'task': {u'index': 0, u'type': u'worker'}} INFO:root:task: {u'index': 0, u'type': u'worker'} INFO:root:cluster_spec: {u'worker': [u'tf-smoke-worker-0uf4-0:2222', u'tf-smoke-worker-0uf4-1:2222'], u'ps': [u'tf-smoke-ps-0uf4-0:2222'], u'master': [u'tf-smoke-master-0uf4-0:2222']} INFO:root:server_def: cluster { job { name: "master" tasks { value: "tf-smoke-master-0uf4-0:2222" } } job { name: "ps" tasks { value: "tf-smoke-ps-0uf4-0:2222" } } job { name: "worker" tasks { value: "tf-smoke-worker-0uf4-0:2222" } tasks { key: 1 value: "tf-smoke-worker-0uf4-1:2222" and for master, protocol: "grpc" INFO:root:Building server. 2017-12-21 14:50:16.303408: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX E1221 14:50:16.303784801 1 ev_epoll1_linux.c:1051] grpc epoll fd: 3 2017-12-21 14:50:16.321366: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job master -> {0 -> localhost:2222} 2017-12-21 14:50:16.321455: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> tf-smoke-ps-0uf4-0:2222} 2017-12-21 14:50:16.321477: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> tf-smoke-worker-0uf4-0:2222, 1 -> tf-smoke-worker-0uf4-1:2222} 2017-12-21 14:50:16.321770: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:2222 INFO:root:Finished building server. INFO:root:Running master. INFO:root:Server target: grpc://localhost:2222 2017-12-21 14:50:26.352030: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0 2017-12-21 14:50:26.352153: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:0 2017-12-21 14:50:26.352163: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:1 2017-12-21 16:08:16.663648: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0 2017-12-21 16:08:16.663843: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:0 2017-12-21 16:08:16.664279: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:1

Let me know your thought about this. In the meanwhile, I am also setting up a real cluster env to try this out. Thanks!

@jlewi
Copy link
Contributor

jlewi commented Dec 22, 2017

It looks in your shells you were pinging by ip address and not hostname. And I didn't see any entry in the hosts files for remote replicas. So maybe there is an issue with DNS lookup?

We also want to use the hostname for the services not the pods themselves. So in the example you gave the service names would be

tf-smoke-ps-0uf4-0
tf-smoke-master-0uf4-0
...

Can you run:

nslookup ${HOSTNAME}

for the names of some the services on one of the pods? And also try pinging that host.

@EdwardZhang88
Copy link
Author

@jlewi So I tried the same YAML and tf_smoke.py on a single K8S node and the issue was gone. I guess it's just an issue with Minikube then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants