Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restrict privilege of Kubeflow services accounts such as tf-job-operator to namespace level #1213

Closed
technolynx opened this issue Jul 16, 2018 · 11 comments
Labels

Comments

@technolynx
Copy link

We are trying to setup generic Kubernetes clusters on bare-metal machines. Our cluster serves multiple other teams that are separated by K8s namespaces. One of the teams wants to use Kubeflow for TF model training, but as we were installing Kubeflow on the cluster, we discovered that some service accounts such as tf-job-operator is requesting cluster-level access to most of the resources. For the fear of compromising cluster security, we stalled the installation.

Can we limit the privilege of Kubeflow SAs to namespace level? We cluster admins can help the namespace owners to setup Kubeflow once. Afterwards we will hand-off to the namespace owners for their day-to-day operations. So long as Kubeflow doesn't affect other namespaces we are good.

@jlewi jlewi added area/tfjob Issues related to TFJobs. priority/p1 area/0.3.0 and removed area/tfjob Issues related to TFJobs. labels Jul 17, 2018
@jlewi
Copy link
Contributor

jlewi commented Jul 17, 2018

We'll also need to restrict the operators to only claim resources in specific namespaces. Not sure whether that's possible today.

/cc @johnugeorge @gaocegege @ankushagarwal

@gaocegege
Copy link
Member

Do we want to handle all TFJobs in one operator and create pods in the specified namespace or one operator for one namespace?

@technolynx
Copy link
Author

@gaocegege We are in a situation where 1. cluster admins don't own Kubeflow deployment 2. only one cluster tenant needs to run TF jobs. So there is no need to have an omnipotent operator to own all TFJobs across namespaces.

In the future it may be desirable for cluster admins to take ownership of TF operator, to consolidate Kubeflow into cluster infrastructure. That would however requires cluster admins to take on additional workload and training.

@jlewi
Copy link
Contributor

jlewi commented Aug 24, 2018

After kubeflow/training-operator#789 tf-operator can handle two cases

  1. operator handles jobs for all namespaces
  2. operator handles a single namespace

There is the remaining case where we want the TFJob operator to handle a subset of namespaces.
I think a reasonable solution is to just create 1 operator per namespace.

/cc @johnugeorge

@jlewi
Copy link
Contributor

jlewi commented Aug 24, 2018

@ankushagarwal Now that kubeflow/training-operator#789 is submitted what are the next steps? Should we add options to our ksonnet prototypes to allow scoping of service accounts?

@ankushagarwal
Copy link
Contributor

This is done.

I created a new image for tf operator and updated ksonnet prototypes for tf-job to namespace scope the operator.

#1403

@jlewi jlewi closed this as completed Aug 24, 2018
@technolynx
Copy link
Author

Thank you Jeremy and Ankush!

@johnugeorge
Copy link
Member

I will add this to pytorch operator after the initial structure is added

@jlewi
Copy link
Contributor

jlewi commented Aug 31, 2018

I used grep to search for all the ClusterRoles.

There's quite a few others defined. Here are the ones I think are most important (as opposed to optional components)

I don't think we will be able to get this fixed in 0.3; 0.3 is already over subscribed so I don't think there's any room for new items.

Additionally I think we need a way to make it easy for users to scope Kubeflow to a particular namespace when its installed. We could then use this to create an E2E test for Kubeflow scoped to a namespace.

We might also want to configure Kubeflow so that users work in a namespace that's different from the one where Kubeflow is installed.

Complete list

grep -r "ClusterRole"
weaveflux/all.libsonnet:      kind: "ClusterRole",
weaveflux/all.libsonnet:      kind: "ClusterRoleBinding",
weaveflux/all.libsonnet:        kind: "ClusterRole",
pachyderm/all.libsonnet:      kind: "ClusterRole",
pachyderm/all.libsonnet:      kind: "ClusterRoleBinding",
pachyderm/all.libsonnet:        kind: "ClusterRole",
openmpi/prototypes/openmpi.jsonnet:// @optionalParam serviceAccountName string null the service account name to run pods. The service account should have clusterRoleBinding for "view" ClusterRole.  If it was not set, service account and its role binding will be created automatically.
openmpi/serviceaccount.libsonnet:    kind: "ClusterRoleBinding",
openmpi/serviceaccount.libsonnet:      kind: "ClusterRole",
pytorch-job/pytorch-operator.libsonnet:      kind: "ClusterRole",
pytorch-job/pytorch-operator.libsonnet:      kind: "ClusterRoleBinding",
pytorch-job/pytorch-operator.libsonnet:        kind: "ClusterRole",
core/cloud-endpoints.libsonnet:      $.parts(namespace).metaClusterRole,
core/cloud-endpoints.libsonnet:      $.parts(namespace).metaClusterRoleBinding,
core/cloud-endpoints.libsonnet:      $.parts(namespace).endpointsClusterRole,
core/cloud-endpoints.libsonnet:      $.parts(namespace).endpointsClusterRoleBinding,
core/cloud-endpoints.libsonnet:    metaClusterRole:: {
core/cloud-endpoints.libsonnet:      kind: "ClusterRole",
core/cloud-endpoints.libsonnet:    },  // metaClusterRole
core/cloud-endpoints.libsonnet:    metaClusterRoleBinding:: {
core/cloud-endpoints.libsonnet:      kind: "ClusterRoleBinding",
core/cloud-endpoints.libsonnet:        kind: "ClusterRole",
core/cloud-endpoints.libsonnet:    },  // metaClusterRoleBinding
core/cloud-endpoints.libsonnet:    endpointsClusterRole:: {
core/cloud-endpoints.libsonnet:      kind: "ClusterRole",
core/cloud-endpoints.libsonnet:    },  // endpointsClusterRole
core/cloud-endpoints.libsonnet:    endpointsClusterRoleBinding:: {
core/cloud-endpoints.libsonnet:      kind: "ClusterRoleBinding",
core/cloud-endpoints.libsonnet:        kind: "ClusterRole",
core/cloud-endpoints.libsonnet:    },  // endpointsClusterRoleBinding
core/cert-manager.libsonnet:      kind: "ClusterRole",
core/cert-manager.libsonnet:      kind: "ClusterRoleBinding",
core/cert-manager.libsonnet:        kind: "ClusterRole",
core/tests/tf-job_test.jsonnet:    kind: "ClusterRole",
core/tests/tf-job_test.jsonnet:    kind: "ClusterRoleBinding",
core/tests/tf-job_test.jsonnet:      kind: "ClusterRole",
core/iap.libsonnet:      $.parts(namespace).initClusterRoleBinding,
core/iap.libsonnet:      $.parts(namespace).initClusterRole,
core/iap.libsonnet:    initClusterRoleBinding:: {
core/iap.libsonnet:      kind: "ClusterRoleBinding",
core/iap.libsonnet:        kind: "ClusterRole",
core/iap.libsonnet:    },  // initClusterRoleBinding
core/iap.libsonnet:    initClusterRole:: {
core/iap.libsonnet:      kind: "ClusterRole",
core/iap.libsonnet:    },  // initClusterRoleBinding
core/tf-job-operator.libsonnet:      local roleType = if deploymentScope == "cluster" then "ClusterRole" else "Role",
core/tf-job-operator.libsonnet:      local bindingType = if deploymentScope == "cluster" then "ClusterRoleBinding" else "RoleBinding",
core/tf-job-operator.libsonnet:      local roleType = if deploymentScope == "cluster" then "ClusterRole" else "Role",
core/tf-job-operator.libsonnet:      kind: "ClusterRole",
core/tf-job-operator.libsonnet:      kind: "ClusterRoleBinding",
core/tf-job-operator.libsonnet:        kind: "ClusterRole",
core/prometheus.libsonnet:      kind: "ClusterRole",
core/prometheus.libsonnet:      kind: "ClusterRoleBinding",
core/prometheus.libsonnet:        kind: "ClusterRole",
core/prototypes/centraldashboard.jsonnet:    kind: "ClusterRole",
core/prototypes/centraldashboard.jsonnet:    kind: "ClusterRoleBinding",
core/prototypes/centraldashboard.jsonnet:      kind: "ClusterRole",
core/prototypes/spartakus.jsonnet:    kind: "ClusterRole",
core/prototypes/spartakus.jsonnet:    kind: "ClusterRoleBinding",
core/prototypes/spartakus.jsonnet:      kind: "ClusterRole",
core/metric-collector.libsonnet:      kind: "ClusterRole",
core/metric-collector.libsonnet:      kind: "ClusterRoleBinding",
core/metric-collector.libsonnet:        kind: "ClusterRole",
argo/argo.libsonnet:      kind: "ClusterRole",
argo/argo.libsonnet:      kind: "ClusterRoleBinding",
argo/argo.libsonnet:        kind: "ClusterRole",
argo/argo.libsonnet:      kind: "ClusterRole",
argo/argo.libsonnet:      kind: "ClusterRoleBinding",
argo/argo.libsonnet:        kind: "ClusterRole",
katib/vizier.libsonnet:      kind: "ClusterRoleBinding",
katib/vizier.libsonnet:        kind: "ClusterRole",
katib/vizier.libsonnet:      kind: "ClusterRole",
seldon/json/template_0.1.json:                "kind": "ClusterRole",
seldon/json/template_0.2.json:            "kind": "ClusterRole",
seldon/json/template_0.2.json:            "kind": "ClusterRoleBinding",
seldon/json/template_0.2.json:                "kind": "ClusterRole",
seldon/prototypes/core.jsonnet:  core.parts(name, namespace, seldonVersion).rbacClusterRole(),
seldon/prototypes/core.jsonnet:  core.parts(name, namespace, seldonVersion).rbacClusterRoleBinding(),
seldon/core.libsonnet:local getClusterRole(x) = x.kind == "ClusterRole";
seldon/core.libsonnet:local getClusterRoleBinding(x) = x.kind == "ClusterRoleBinding";
seldon/core.libsonnet:      rbacClusterRole():
seldon/core.libsonnet:        local clusterRole = std.filter(getClusterRole, seldonTemplate.items)[0];
seldon/core.libsonnet:      rbacClusterRoleBinding():
seldon/core.libsonnet:        local rbacClusterRoleBinding = std.filter(getClusterRoleBinding, seldonTemplate.items)[0];
seldon/core.libsonnet:        local subject = rbacClusterRoleBinding.subjects[0]
seldon/core.libsonnet:        rbacClusterRoleBinding +
mpi-job/mpi-operator.libsonnet:      kind: "ClusterRole",
mpi-job/mpi-operator.libsonnet:      kind: "ClusterRoleBinding",
mpi-job/mpi-operator.libsonnet:        kind: "ClusterRole",
chainer-job/chainer-operator.libsonnet:      kind: "ClusterRole",
chainer-job/chainer-operator.libsonnet:      kind: "ClusterRoleBinding",
chainer-job/chainer-operator.libsonnet:        kind: "ClusterRole",
chainer-job/prototypes/chainer-operator.jsonnet:// @optionalParam createRbac string true If true (default), create ServiceAccount, ClusterRole and ClusterRoleBinding for the operator.  Otherwise, you have to create them manually.  Please see https://github.com/kubeflow/chainer-operator for required authorization for the operator.
mxnet-job/mxnet-operator.libsonnet:      kind: "ClusterRole",
mxnet-job/mxnet-operator.libsonnet:      kind: "ClusterRoleBinding",
mxnet-job/mxnet-operator.libsonnet:        kind: "ClusterRole",

@technolynx
Copy link
Author

@jlewi Thank you Jeremy for the follow up. I didn't realize re-scoping the operators could incur so much work. We also tried tweaking the configs to use RoleBinding instead of ClusterRoleBinding on our side, but also arrived at a dead end. I think I may have raised a bad feature request that goes against Kubernetes' operator design pattern.

Our project requirement has recently changed. We are now exploring the single-tenant option as requested by the client. Our need to narrow the privilege scope of Kubeflow operators has diminished. Since this issue is no longer blocking, I think we can re-pri or close it if there's no other party requesting this feature.

Really sorry for the trouble caused.

@ooverandout
Copy link

Hello Guys!
Is this issue somehow resolved? Basically I have the same issue as @kunpengprod did, meaning needing to install Kubeflow on one Kubernetes Namespace due to clusterrolebinding error on central dashboard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants