-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Proposal] - Dynamic Cache for Gracefully Handling RBAC Changes #2176
Comments
cc @vincepri |
Could you define what exactly |
IMO this means that when changes are made to the RBAC on a SA used by a controller, that controller won't crash or enter a blocking loop and will continue to operate as expected. Maybe this is better phrased as
TLDR:
A more detailed answer: I'm going to try answering this question with a scenario that will hopefully show what should happen when adding/removing RBAC. Let's consider a controller that reconciles the When installing this controller a cluster admin creates the RBAC that allows the controller to CRUD a So, lets see what should happen when a
Now, what should happen if a
Now lets consider what happens if a cluster admin says "Oops, I meant for it to be allowed to operate in the
For just an added bonus scenario, what happens if the cluster admin wants to no longer allow the controller to have cluster-wide permissions for
One thing I just realized I omitted in the original proposal comment is this also introduces the idea of a controller having a set of minimal required permissions (i.e CRUD its own CRs) and a set of optional or scopeable permissions. This is all open to discussion and subject to change based on said discussion, but these are the concepts that my proof of concept and example operator currently use. @alvaroaleman I hope this answers your questions, but I'm happy to answer any follow up questions or make any clarifications if needed. |
We can't update the status of something we can't access. IMHO this is still a bit too vague, In general, there is some on-going work to allow removing watches at runtime, but I guess I am not sure it is a good idea to do this automatically under a controller by making some assumptions about how the new RBAC might potentially look like: #2159 To me personally it seems more correct to have a component that manages the RBAC and that will update the configuration of the affected controllers when it changes RBAC. |
Correct, but in this particular scenario the controller would have access to the
Fair enough, I may just not be wording this the best but what my aim for this is to make it so that operator/controller authors can configure their controllers to use this caching layer to prevent their operator/controller from crashing, panicking, or entering some unrecoverable state when RBAC has changed.
This might be a misunderstanding due to not providing enough context on my end. With how I have my PoC implemented the watches/informers will be killed automatically (although we could make this configurable) but using this cache would result in a bit of a new pattern for operator/controller authors to adopt when reconciling a CR. The operator/controller author would be responsible for checking if the necessary watches/informers exist in the cache and if they don't recreate them, the cache won't recreate any watches/informers.
Thanks for sharing this PR, I'll definitely take a look at this. IMO I think it would be ok for the cache to automatically remove watches from the cache if they crash due to a permissions error as long as it is well documented and an operator/controller author is informed of this behavior if they decide to use this cache implementation. That being said, I think we could also discuss changing that behavior to make that configurable via options.
So essentially restart the operator/controller in it's entirety if the configuration is modified? I'm not sure this would resolve the core problem of preventing the operator from entering an unrecoverable state if the permissions are changed, but I am open to discussing this more if it is agreed that this is the better route for functionality like this. |
I just wanted to update this issue with a link to a brief design document that may help outline the goals and non-goals of this a bit better. I also attempted to highlight how the changes in #2159 could potentially be built upon to accomplish the goals of this dynamic cache. cc'ing @varshaprasad96 @joelanford @vincepri to try and get some more eyes on this. Thanks! |
Looks like #2231 was recently created that asks for the functionality to set the WatchErrorHandler - which is a proposed step in my design document |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
@everettraven It sounds like the same mechanism can be used to only watch when a CRD is installed? |
I would assume so, but this is a use case I have not tested. As long as you check the CRD is installed in your reconciliation loop and then if it is create the watch it should work. |
Interesting! There is another issue potentially in the same basket; EDIT: My bad, this is due to using discovery mapper instead of dynamic. |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close not-planned |
@k8s-triage-robot: Closing this issue, marking it as "Not Planned". In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/reopen |
@vincepri: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/lifecycle frozen |
Problem Statement: As an operator author I want to develop an operator that can handle changing permissions so that cluster admins can use Role Based Access Control (RBAC) to scope the permissions given to my operator.
In order to grant cluster admins the ability to scope the permissions given to operators, we need to provide operator authors an easy way to handle dynamically changing permissions.
Background/Context
There has recently been questions posed in various avenues regarding the ability to scope operator permissions. There is currently a limitation with the existing cache implementations in that they are not able to reliably handle changes in RBAC and require specific permissions to be given to it's associated ServiceAccount at all times. If the permissions are removed the controller will either crash or enter an infinite loop that blocks reconciliation.
This proposal introduces the concept of adding a cache implementation that would allow for dynamically handling changes in RBAC and will cover advantages, disadvantages, and introduce an existing Proof-of-Concept.
Advantages
For Operator/Controller Authors
For Cluster Admins
For the Operator/Controller Ecosystem
There are likely more advantages that are not listed here.
Disadvantages
For Operator/Controller Authors
There are likely more disadvantages that are not listed here.
Proof of Concept
I have worked on a few iterations of a PoC for this, but the latest and most promising one can be found here: https://github.com/everettraven/telescopia
There is also a sample operator that uses the telescopia library here: https://github.com/everettraven/scoped-operator-poc/tree/poc/telescopia (specifically the
poc/telescopia
branch). There is a demo in the README that I highly recommend taking a look at to get a better idea of how an operator/controller may behave with this concept implemented.I would love to get feedback and thoughts on the advantages and disadvantages of this as well as if implementing this is something that would be of value to the controller-runtime project and the community.
I look forward to discussing this further!
The text was updated successfully, but these errors were encountered: