-
Notifications
You must be signed in to change notification settings - Fork 710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tf_smoke.py distributed computing doesn't work on minikube #238
Comments
My guess would be some sort of networking issue with minikube. Can you try pinging the master from the workers e.g.
When you call tf.Server you are starting a TensorFlow server on port 2222 in another process. The python code that is executing locally on that work then connects to that gRPC server and uses that server to talk to the other gRPC servers in the cluster. It uses "localhost" because the server is running on the same machine (or virtual machine in the case of K8s POD). When you start a distributed training job, the master waits for all the other replicas in the cluster (parameter servers, and workers) to check in before the job starts. That's what the log messages from the master indicates. The log messages indicate the workers should be starting the server and waiting to be told what to do by the master. So it looks like there is a problem with communication between the workers and master. Can you provide a complete set of logs from the master and worker? Are any of the containers getting restarted? |
@jlewi Thanks for the reply. First of all, all the Pods are running without any container restart.
Let me know your thought about this. In the meanwhile, I am also setting up a real cluster env to try this out. Thanks! |
It looks in your shells you were pinging by ip address and not hostname. And I didn't see any entry in the hosts files for remote replicas. So maybe there is an issue with DNS lookup? We also want to use the hostname for the services not the pods themselves. So in the example you gave the service names would be
Can you run: nslookup ${HOSTNAME} for the names of some the services on one of the pods? And also try pinging that host. |
@jlewi So I tried the same YAML and tf_smoke.py on a single K8S node and the issue was gone. I guess it's just an issue with Minikube then. |
I just started to learn K8S this month and found this CRD initiative to support Tensorflow on K8S really helpful. However, while I am able to run a local MNIST test on a single Pod in my minikube successfully, I am now stuck with distributed task tf_smoke.py. Here are the details:
minikube version: v0.24.1
tensorflow:1.4.0
tf_operator:gcr.io/tf-on-k8s-dogfood/tf_operator:v20171215-eb0fd5f
Below is the tf_smoke.yaml I used to create this tfjob. Since I already had a local copy of tensorflow 1.4, I just specified imagePullPolicy as Never.
apiVersion: "tensorflow.org/v1alpha1"
kind: "TfJob"
metadata:
name: "tf-smoke"
spec:
replicaSpecs:
- replicas: 1
tfReplicaType: MASTER
template:
spec:
containers:
- image: registry.cn-hangzhou.aliyuncs.com/denverdino/tensorflow:1.4.0
imagePullPolicy: Never
name: tensorflow
command: ["python"]
args: [/workdir/script/tf_smoke.py]
volumeMounts:
- name: workdir
mountPath: /workdir
volumes:
- name: workdir
hostPath:
path: /home/docker/tensorflow_training
restartPolicy: OnFailure
- replicas: 2
tfReplicaType: WORKER
template:
spec:
containers:
- image: registry.cn-hangzhou.aliyuncs.com/denverdino/tensorflow:1.4.0
imagePullPolicy: Never
name: tensorflow
command: ["python"]
args: [/workdir/script/tf_smoke.py]
volumeMounts:
- name: workdir
mountPath: /workdir
volumes:
- name: workdir
hostPath:
path: /home/docker/tensorflow_training
restartPolicy: OnFailure
- replicas: 1
tfReplicaType: PS
Once I submit it by kubectl create -f tf_smoke.yaml, the master is constantly displaying
2017-12-21 10:03:28.949380: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0
2017-12-21 10:03:28.949441: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:0
2017-12-21 10:03:28.949450: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:1
And the other 2 workers are just hanging there at 'INFO:root:Running Worker code'.
There are 2 things I don't understand.
{"cluster":{"master":["tf-smoke-master-jyab-0:2222"],"ps":["tf-smoke-ps-jyab-0:2222"],"worker":["tf-smoke-worker-jyab-0:2222","tf-smoke-worker-jyab-1:2222"]}
if job_type == "ps":
logging.info("Running PS code.")
server.join()
elif job_type == "worker":
logging.info("Running Worker code.")
The worker just blocks because we let the master assign all ops.
server.join()elif job_type == "master" or not job_type:
logging.info("Running master.")
with tf.device(device_func):
run(server=server, cluster_spec=cluster_spec)
Initially, I thought it was because I didn't create the ConfigMap which specifies grpcServerFilePath: /opt/mlkube/grpc_tensorflow_server/grpc_tensorflow_server.py. However, it still doesn't work even after I re-install tf-job with this tf-job-operator-config ConfigMap.
If anyone knows what is going wrong, please leave a comment. Thanks!
The text was updated successfully, but these errors were encountered: