Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tf-job-operator RBAC #929

Closed
llegolas opened this issue Feb 2, 2019 · 12 comments
Closed

tf-job-operator RBAC #929

llegolas opened this issue Feb 2, 2019 · 12 comments

Comments

@llegolas
Copy link

llegolas commented Feb 2, 2019

The environment is Openshift 3.11 /kubeflow v0.4.1
tf-job-operator and pytorch-operator are set with deploymentScope: namespace
jsonnete logic says that in this case we will have roles and rolebindings instead of clusterRoles and clusterrolebindings
As a result both pods are failng with similar error

$  oc logs tf-job-operator-v1beta1-6485984df7-n95lg
{"filename":"app/server.go:71","level":"info","msg":"Scoping operator to namespace kubeflow41","time":"2019-02-02T14:20:15Z"}
{"filename":"app/server.go:75","level":"info","msg":"[API Version: v1beta1 Version: v0.1.0-alpha Git SHA: f9d5f1e Go Version: go1.9.2 Go OS/Arch: linux/amd64]","time":"2019-02-02T14:20:15Z"}
W0202 14:20:15.849199       1 client_config.go:533] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
{"filename":"app/server.go:195","level":"error","msg":"customresourcedefinitions.apiextensions.k8s.io \"tfjobs.kubeflow.org\" is forbidden: User \"system:serviceaccount:kubeflow41:tf-job-operator\" cannot get customresourcedefinitions.apiextensions.k8s.io at the cluster scope: no RBAC policy matched","time":"2019-02-02T14:20:15Z"}

If i reconfigure the roles(tf-job-operator/pytorch-operator) to ClusterRoles and create the corresponding cluster role bindings all works as expected.
Any ideas?

@jlewi
Copy link
Contributor

jlewi commented Feb 4, 2019

@johnugeorge @richardsliu

@johnugeorge
Copy link
Member

looks like a bug that got introduced with the CRDExists check.
https://github.com/kubeflow/tf-operator/blob/master/cmd/tf-operator.v1beta1/app/server.go#L193

@cndaimin
Copy link
Contributor

cndaimin commented Mar 1, 2019

What RBAC rules were configured? @llegolas

@johnugeorge
Copy link
Member

@gaocegege @richardsliu @Muzry

Currently, namespace scoped deployment fails because CRD check in operator bootup code needs cluster scope.

Issue: #710
PR: #820

@gaocegege
Copy link
Member

It is an interesting problem. It seems that we cannot check the crd if the controller is not in cluster scope.

I cannot figure out a better idea than deleting the check.

@richardsliu
Copy link
Contributor

Can we handle that error and print a warning instead?

@gaocegege
Copy link
Member

gaocegege commented Mar 19, 2019

The warning does not work, IMO. The controller will not work if the crd is not registered.

Maybe we could do what kubebuilder-generated controller does: leave an error to user and sys.Exit(1):

{"level":"error","ts":1552974825.1213772,"logger":"kubebuilder.source","msg":"if kind is a CRD, it should
 be installed before calling Start"}

@johnugeorge WDYT, is there any other suggestion?

@johnugeorge
Copy link
Member

@gaocegege How does the controller detect this error without the check?

@gaocegege
Copy link
Member

I think the informer will return an error if the kind is not registered. When we did not have the check before, I always meet with the error.

@richardsliu
Copy link
Contributor

How did namespace deploymentScope work before we introduced the crd check?

@richardsliu
Copy link
Contributor

@johnugeorge Any recommendations here? Should this still be a blocker for 0.5.0?

@johnugeorge
Copy link
Member

Yes. Without a crd client check, we can check the error from TFJob List API. I will raise PR soon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants