Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run tfjob failded with self build image #840

Closed
Suozz opened this issue Oct 8, 2018 · 11 comments
Closed

run tfjob failded with self build image #840

Suozz opened this issue Oct 8, 2018 · 11 comments
Assignees

Comments

@Suozz
Copy link

Suozz commented Oct 8, 2018

I build an image by a Dockerfile which is the same as tf-operator/examples/tf_sample/Dockerfile. When I try to apply and run a tfjob. The container is failed without any logs.....

@Suozz Suozz changed the title build tfjob image failded run tfjob failded with self build image Oct 8, 2018
@gaocegege
Copy link
Member

Can you give us the detail of the pod: kubectl describe pods <pod>

@Suozz
Copy link
Author

Suozz commented Oct 8, 2018

Can you give us the detail of the pod: kubectl describe pods <pod>

the description of master pod:

Name: example-job1-master-0
Namespace: default
Node: 192.168.9.158/192.168.9.158
Start Time: Mon, 08 Oct 2018 15:15:27 +0800
Labels: group_name=kubeflow.org
tf-replica-index=0
tf-replica-type=master
tf_job_name=example-job1
Annotations:
Status: Failed
IP: 172.127.61.10
Controlled By: TFJob/example-job1
Containers:
tensorflow:
Container ID: docker://af68b1a2a1f46c27ed35d98c9280e52f13b54a75bfe9441402c7edfa38a65d0e
Image: 192.168.9.155:5000/szz/tfjobtest:1.0
Image ID: docker-pullable://192.168.9.155:5000/szz/tfjobtest@sha256:cd56d14bd7a97ae3776aae696678fbac6b81298588fd50237205fe777d35f1fb
Port: 2222/TCP
Host Port: 0/TCP
State: Terminated
Reason: Error
Exit Code: 132
Started: Mon, 08 Oct 2018 15:15:30 +0800
Finished: Mon, 08 Oct 2018 15:15:30 +0800
Ready: False
Restart Count: 0
Environment:
TF_CONFIG: {"cluster":{"master":["example-job1-master-0:2222"],"ps":["example-job1-ps-0:2222","example-job1-ps-1:2222"],"worker":["example-job1-worker-0:2222"]},"task":{"type":"master","index":0},"environment":"cloud"}
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-5rl6d (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
default-token-5rl6d:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-5rl6d
Optional: false
QoS Class: BestEffort
Node-Selectors:
Tolerations:
Events:

@gaocegege
Copy link
Member

I think the error is from your code, since Exit Code: 132

@gaocegege
Copy link
Member

Maybe you could output something in the container, to debug.

@Suozz
Copy link
Author

Suozz commented Oct 8, 2018

I think the error is from your code, since Exit Code: 132

my code is the same as the file tf-operator/examples/tf_sample/tf_smoke.py
There is a lot of log output statement, but no output log。

@gaocegege
Copy link
Member

I think the error is from the image, then.

PTAL tensorflow/tensorflow#19584

@Suozz
Copy link
Author

Suozz commented Oct 8, 2018

I think the error is from the image, then.

PTAL tensorflow/tensorflow#19584

But when I use the sample image from the official gcr.io/tf-on-k8s-dogfood/tf_sample:dc944ff, the code could run off. Do you know how to build this official image?

@gaocegege
Copy link
Member

@Suozz
Copy link
Author

Suozz commented Oct 8, 2018

https://github.com/kubeflow/tf-operator/blob/master/examples/tf_sample/Dockerfile

I have built this Dockerfile before I open this issue. My error is caused by this image. So how to build an useful image for tfjob?

@gaocegege
Copy link
Member

I think you just need to make sure that the tensorflow is installed correctly in the image.

@Suozz
Copy link
Author

Suozz commented Oct 9, 2018

I think you just need to make sure that the tensorflow is installed correctly in the image.

Thank you,I solve this problem. After I used the tensorflow/tensorflow:1.5.0 instead of tensorflow/tensorflow:1.8.0 in the Dockerfile, my image run off.

@Suozz Suozz closed this as completed Oct 9, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants