Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

no matches for tensorflow.org/, Kind=TfJob #173

Closed
DjangoPeng opened this issue Nov 25, 2017 · 11 comments
Closed

no matches for tensorflow.org/, Kind=TfJob #173

DjangoPeng opened this issue Nov 25, 2017 · 11 comments

Comments

@DjangoPeng
Copy link
Member

I have set a configMap for bare metal environment, and got a problem when I created a TF job. I'm sure that I followed the instructions of README.md file step by step. I attached the error log as below:

error: unable to recognize "tf_job.yaml": no matches for tensorflow.org/, Kind=TfJob
@jlewi
Copy link
Contributor

jlewi commented Nov 25, 2017

This error indicates that the Custom Resource for TfJob could not be created.

Please take a look at this comment for some initial steps to try to identify the particular cause.

There's a high likelihood that you are hitting the same issue with the ConfigMap not being found as in #149.

@DjangoPeng
Copy link
Member Author

Thanks for replying. Let me have a look at the comment and refs.

@DjangoPeng
Copy link
Member Author

DjangoPeng commented Nov 27, 2017

@jlewi I checked my case according to your comment in #149 . The TF Job CRD is not created actually. But, the deployment and pod of TF Job are running. I took a look at the logs of TF Job pod, and found that it may be a permission problem of namespace.

E1127 03:01:44.508964       1 election.go:226] error retrieving resource lock default/tf-operator: endpoints "tf-operator" is forbidden: User "system:serviceaccount:default:default" cannot get endpoints in the namespace "default"

@DjangoPeng
Copy link
Member Author

DjangoPeng commented Nov 27, 2017

I forgot to set rbac.install=true when I run the helm install instruction. After setting that, the tf-operator is running well. But, I failed to docker pull gcr.io/tf-on-k8s-dogfood/tf_sample:dc944ff. Could you have a look? @jlewi

Error response from daemon: Get https://gcr.io/v1/_ping: dial tcp 64.233.189.82:443: i/o timeout

@jlewi
Copy link
Contributor

jlewi commented Nov 28, 2017

I confirmed the docker image exists. I also confirmed the bucket is public. Could this be an issue with gcr.io being blocked by the firewall?

In the meantime you should be able to build the Docker image from the repo using https://github.com/tensorflow/k8s/blob/master/examples/tf_sample/build_and_push.py

@DjangoPeng
Copy link
Member Author

Ok, I'm going to build the Docker image. Thanks!

@DjangoPeng
Copy link
Member Author

@jlewi I noticed that we directly call subprocess to build and push Docker image in build_and_push.py file. I think replace it with Docker SDK would be better and more portable. What do you think?

@jimexist
Copy link
Member

gcr.io is blocked, one needs VPN to access.

@DjangoPeng
Copy link
Member Author

DjangoPeng commented Nov 28, 2017

Anyway, the issue of no matches for tensorflow.org/, Kind=TfJob has been figured out. Close this issue. Others who facing same problem can trace #174 , and I will create a PR to add troubleshooting guide this week.

@wydwww
Copy link

wydwww commented Jan 10, 2018

Hi @jlewi Is there any solution for the same issue when tf-job-operator is deployed by Kubeflow instead of helm?
New issue is kubeflow/kubeflow#109
Thanks

@alien-xt
Copy link

I have same problem, how can i fix?

describe:
"
the provided version "kubeflow.org/v1alpha2" has no relevant versions: group kubeflow.org has not been registered
no matches for kubeflow.org/, Kind=TFJob
"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants