-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
virt-handler: Restrict write access to nodes #246
Conversation
Signed-off-by: Luboslav Pivarc <[email protected]>
Skipping CI for Draft Pull Request. |
design-proposals/shadow-node.md
Outdated
|
||
Kubevirt/Kubevirt | ||
|
||
# Design |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One alternative design that was proposed the sig-scale call, that may solve the same problem is as follows:
- Introduce a new controller, lets call it token-distributor-controller.
- The token-distributor-controller has a RBAC Template CR that creates a unique ClusterRole, ClusterRolebinding and ServiceAccount for each virt-handler pod. Example of the template from kubevirt-handler clusterrole
- apiGroups:
- ""
resources:
- nodes
verbs:
- list
- watch
- get
- apiGroups:
- ""
resources:
- nodes
resourceNames:
- <node-name>
verbs:
- patch
- The token-distributor-controller will then create a token for the pod via
TokenRequest
API [1][2] pod on the node and deliver the token on the node. - The virt-handler pod will use a local volume mount to receive the token. An init container could be used to make sure the file is delivered by token-distributor-controller
- Virt-handler will start with restricted permissions that could be carefully modified so that it will work on all the objects that are related to the node
This could be a generic solution that could help all daemonsets that are deployed with privileged RBAC and could be a separate project all together.
Href:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that this is addressing the probelm fully. There are two thing which need to be fixed:
- being able to modify other nodes (that can be addressed like above)
- Not being able to modify non-kubevirt-owned fields on the node (this includes not being able to change the spec or any label/annotation which is not stricty kubevirt owned)
(2) still allows me to easily lure in privileged CP components and then just steal their credentials.
We could definitely create unique roles for each node, and in addition deploy a node webhook which would ensure that a service account only, that is also what the apiserver does internally for each kubelet, but that has the severe disadvantage that we would now need a required webhook on a core k8s resource. If our node webhook is ever unavailable, no kubelet, or anythig else can actually continue to modify the node. That's a no-go.
THere are other positive side-effects if we fix it with the "shadow nodes":
- heartbeat can be moved to the shadow node, meaning we create much less load on all the controllers watching nodes.
- we finally have a clear api to express our node-needs and don't need to press our API in annotations and labels on nodes.
One may now think that now virt-controller still can modify all nodes, and that is true. However now we only have one component which can do that instead of n, further a token-distributor-controller
controller which would create those rbac roles has to have the exact same permissions as well, otherwise it would not be able to create the roles in the first place ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alaypatel07 thanks for bringing this up.
Added the alternative to the proposal body.
Added to the goals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rmohr we have to address a third element
- Prevent that the daemonset can modify other KubeVirt owned CRs related to other nodes
IOW 3. Restrict the damemonset to only modify KubeVirt owned CRs related to it's own node
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that third element is true, but I would qualify it as a follow-up step, not needed to address the CVE. Now virt-handler cannot blindly modify the node topology anymore. It can only modify hardware capability availability. It can not lure in privileged or sensitive workloads anymore by just applying labels which they are waiting for, or by making a node a control-plane node.
The follow up would be either cert based auth for virt-handler, or maybe we can still use tokens and get the IP in the webhooks for verification. Something like this ...
Let's add it to the doc as follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fabiand AFAIK the only kubevirt CR virt-handler can update is virtualmachineinstances
. Is this what you are referring to?
@rmohr I will add a Follow-up Steps
section in the end of the doc and include:
- (the one just suggested) Restrict the virt-handler daemonset to only modify KubeVirt owned CRs related to it's own node
- Restrict the virt-handler daemonse from being able to update/patch other nodes
Is that alright with you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RamLavi yes - if this is the only one, then yes. And: generally speaking: handler should just be able to touch anything which a) it should be permitted to touch and b) only if it's own it's "own node"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK. DONE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xpivarc since it is convincing from all the discussions that the ShadowNode API is the preferred design alternative, some nits/questions. PTAL, thanks
design-proposals/shadow-node.md
Outdated
|
||
## Goals | ||
|
||
Mitigate the advisory by default. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please expand this to add explicit goals? I think @rmohr summarized the goals clearly in the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 please expand
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DONE
design-proposals/shadow-node.md
Outdated
1. [run](https://github.com/kubevirt/kubevirt/blob/5b61a18f4c3b6c549d50fd201664c21720e3dbe2/pkg/virt-handler/node-labeller/node_labeller.go#L195) - irrelevant as this is read access | ||
|
||
2. [patchNode](https://github.com/kubevirt/kubevirt/blob/5b61a18f4c3b6c549d50fd201664c21720e3dbe2/pkg/virt-handler/node-labeller/node_labeller.go#L254) - node labeller patching node with labels for various informations which are mostly static in nature: cpu model, hyperv features, cpu timer, sev/-es, RT... | ||
This is triggered both by interval (3 min) and on changes of cluster config. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if the consumers for each of the above node labels and annotations should be mentioned alongside?
For example, the heartbeat in point 2. is consumed by the NodeController in virt-controller-app to check if node is healthy and raise an event.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think adding the consumers will not necessarily add clarity here. I did rephrase to make it more high level.
design-proposals/shadow-node.md
Outdated
|
||
## Update/Rollback Compatibility | ||
|
||
Virt-handler is going to continue writting to Node object, therefore update should be without compatibility issues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When upgrade rolls out, virt-controllers are upgraded first then the daemonsets followed by virt-api. There will be a timeperiod where new virt-controllers (with ShadowNode code) will work with old virt-handlers and old-apiserver without ShadowNode API. Can we mention here clear steps how:
- new virt-controllers will work with older virt-handlers and olde virt-api
- new virt-controllers and newer virt-handlers will work with older virt-api
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @alaypatel07, I'll be taking over addressing the issue this proposal is trying to mitigate. I plan to take the same course this proposal is choosing. I will answer your question in the next reply, but if you think I should re-admit the proposal with the answers addressed in them - please tell me and I will do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In regard to your question, I think the upgrade order is a bit different. Please keep me honest, but according to this - virt-operator upgrade order is like this:
CRDs --> RBACs --> Daemonsets (=virt-handler) --> Controller Deployments (=virt-controller)--> API Deployments (=virt-API)
Also note that:
- As mentioned above, the virt-handler will still keep trying to patch the node alongside the shadowNode, and at some point during the upgrade, it would start being blocked due to the removal of the patch ability. In order to mitigate this, the node patch will start ignoring RBAC "forbidden" error.
- Normally, new CRDs are blocked until the upgrade rollout is over. The shadowNode CRD is going to be excluded in order to avoid upgrade issues that could cause hearbeat timeout.
Having said the above, this is a specific mention to each upgrade scenario - before and after the upgrade itself:
During upgrade rollout:
- New CRD; old RBAC; old virt-handler; old virt-controller; old virt-API:
- Old RBAC: no upgrade issue
- Old virt-handler: no upgrade issue
- Old virt-controllers: no upgrade issue
- Old Virt-API: no upgrade issue
- New CRD, new RBAC; old virt-handler; old virt-controller; old virt-API:
- Old virt-handler: during upgrade, the old RBACs are kept as backup until upgrade is done. Hence, patching node should not fail so there is no upgrade issue.
- Old virt-controllers: no upgrade issue, virt controller RBAC is not changed due to this solution.
- Old Virt-API: no upgrade issue.
- New CRD; new RBAC; new virt-handler; old virt-controller; old virt-API:
- new virt-handler: shadowNode is written to but it is not synced to the node yet. However, virt-handler keeps patching the node directly, so there are no upgrade issue.
- [same as before] Old controllers: no upgrade issue, virt controller RBAC is not changed.
[same as before] Old Virt-API: no upgrade issue.
- New CRD; new RBAC; new virt-handler; new virt-controller; old virt-API:
- New virt-handler: no upgrade issue. virt-handler patches both node and shadow node.
- New controllers: will start getting shadowNode requests once they are created. no upgrade issue.
- [same as before] Old Virt-API: no upgrade issue.
After upgrade rollout:
virt-handler will keep trying (and failing) to patch the node since the backup RBACs are now removed. Virt-handler will ignore these cases by ignoring RBAC "forbidden" errors. This behavior should be kept until we no longer support upgrades from the current release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My suggestion, if possible, would be the following:
- Introduce a new status field in the KubeVirt CR
- Let virt-handlers continue patching the node as well
- After virt-handlers and virt-controller are rolled over, set in the kubevirt CR a status field indicating that shadow nodes are now the way to go
- virt-handler would pick it up right away over our config logic
- finally remove the RBAC permissions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There may be some neglectible overlap between virt-handlers picking up the config change and the rbac removal then, but that should worst case just result in a few rbac permission errors which we should then not mask.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rmohr, I respectfully disagree. Using SelfSubjectAccessReview isn't just masking errors; it's a direct and intentional way of asking the RBAC server, "Can I patch the node?" rather than just hoping things are right. It cheaper than patching the actual node then failing, and ensures actions are only taken when permitted, adding a layer of security.
I believe relying on the upgrade mechanism to signal a status field once ready could prove to be more costly in the long run, considering the potential corner scenarios mentioned earlier.
How about if I stop checking with the RBAC server after the first time I see that the node patch is forbidden (i.e. SelfSubjectAccessReview returns Allowed=false). --> This way each virt-handler will stop re-sending SelfSubjectAccessReview requests soon after node-patching is disabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rmohr, I respectfully disagree. Using SelfSubjectAccessReview isn't just masking errors; it's a direct and intentional way of asking the RBAC server, "Can I patch the node?" rather than just hoping things are right.
I am not sure. I tried to express a knowledge gap which we try to overcome with checking for forbidden error or the SelfSubjectAccessReview. In both cases we don't know if it is intentional and correct that we don't have the permission. With a clear config switch we know: Now I should do both, and from now on I should do one. Having a very brief period regarding to eventual consistency is fine where maybe one or two update requests would fail (like a low amount of conflict REST calls is fine too).
The information gap is what I consider "masking". Therefore the reviews add up quite a big front of additional uncertainties without making anything better (actually imo much more complex).
It cheaper than patching the actual node then failing, and ensures actions are only taken when permitted, adding a layer of security.
I am not sure what "cheap" means here, and giving additional permissions which may leak info which we did not have access to before opens security questions which I don't even want to have to think about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The information gap is what I consider "masking". Therefore the reviews add up quite a big front of additional uncertainties without making anything better (actually imo much more complex).
ah you and I interpret "masking errors" differently. I get what you mean now. "masking errors" in my book is just the fact that we are ignoring a specific error, which is in general no good practice.
I am not sure what "cheap" means here, and giving additional permissions which may leak info which we did not have access to before opens security questions which I don't even want to have to think about.
I take back the "cheap" part - we are not sending the entire node object on the SET action, so the difference should be negligible.
I get your concern in adding a new permission that may leak info in the future.
@rmohr I am still against using the upgrade as a sign to know if we can stop patching the node. The upgrade process is complex and state-full, and relying on it may introduce corner cases. In a way, Lubo's "try and ignore if fails on specific error" is much more stateless and clean.
Can we agree to continue with his approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alaypatel07 I updated the upgrade scenarios on the proposal's body as you suggested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rmohr I am still against using the upgrade as a sign to know if we can stop patching the node. The upgrade process is complex and state-full, and relying on it may introduce corner cases. In a way, Lubo's "try and ignore if fails on specific error" is much more stateless and clean.
Can we agree to continue with his approach?
It should basically be a check like this:
if freshInstall && supportsShadowNodes() {
kubevirtCR.Status.UseShadowNode = True
}
if fully installed && !switchedToShadowNode && supportsShadowNodes() {
kubevirtCR.Status.UseShadowNode = True
} else {
kubevirtCR.Status.UseShadowNode = False
}
where supportsShadowNodes()
can basically just check if the install strategy of the current version includes the shadow node CR, to ensure that the version we are upgrading downgrading to, has them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @xpivarc, I have a question.
If we create a shadow node that the virt-handler writes to, then what prevents a bad person that took control over virt-handler to write to shadow nodes of other Nodes? Couldn't this bad person still take control of the cluster by manipulating it?
To a degree yes, but it can only influence kubevirt.io/ scheduling constraints which solves the most important problem, that one can lure in CP workloads. VMs can still be tricked. If (and I am not sure if we get that info on an AdmissionReview) we also get the IP address in userinfo when eventuallly switching to cert based auth (which would containe the client IP like kubelet certs), we can also enforce by webhook that it can only write to its own node in a next step. That would now be possible because the webhook would only block our shadow nodes and not a core k8s entity. |
@RamLavi |
/hold I think there is no clear agreement if this approach is the right one. I'm putting a hold on it to signal that this requires more discussion and an agreed design. Important as @RamLavi is working on an implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we explore what other projects did in order to tackle this problem?
What was the feedback from K8s SIG API (I think this idea got discussed there)?
Not that I am proposing it, but I wonder if tools like OPA would be able to protect us fomr issues like this.
design-proposals/shadow-node.md
Outdated
|
||
## Goals | ||
|
||
Mitigate the advisory by default. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 please expand
design-proposals/shadow-node.md
Outdated
|
||
Kubevirt/Kubevirt | ||
|
||
# Design |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
design-proposals/shadow-node.md
Outdated
2. [patchNode](https://github.com/kubevirt/kubevirt/blob/5b61a18f4c3b6c549d50fd201664c21720e3dbe2/pkg/virt-handler/node-labeller/node_labeller.go#L254) - node labeller patching node with labels for various informations which are mostly static in nature: cpu model, hyperv features, cpu timer, sev/-es, RT... | ||
This is triggered both by interval (3 min) and on changes of cluster config. | ||
|
||
## Shadow node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fabiand I was not present during the time discussions on how other projects solved the issue. For obvious reasons, this is not something a project will advertise, so perhaps @xpivarc and @rmohr could shed more light on this. However, I did come across what I think is cilium's fix on the matter, and it looks like they went by the CRD approach, same as us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW I'd expect any author of such a feature to do some field research. Thus, yes, maybe you joined later, but that should not stop you from doing your own research :)
design-proposals/shadow-node.md
Outdated
2. [patchNode](https://github.com/kubevirt/kubevirt/blob/5b61a18f4c3b6c549d50fd201664c21720e3dbe2/pkg/virt-handler/node-labeller/node_labeller.go#L254) - node labeller patching node with labels for various informations which are mostly static in nature: cpu model, hyperv features, cpu timer, sev/-es, RT... | ||
This is triggered both by interval (3 min) and on changes of cluster config. | ||
|
||
## Shadow node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xpivarc iirc then this idea was taken to kube sig api- has it? what was the outcome of their discussion?
design-proposals/shadow-node.md
Outdated
Kubevirt/Kubevirt | ||
|
||
# Design | ||
(This should be brief and concise. We want just enough to get the point across) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me a high level design is missing here.
It could be something like: The security issue is being addressed by introducing ShadowNodes for every Kubernetes node. virt-handlers are only interacting with ShadowNodes - which allows us to drop any node access privieleges from the handler's SA - a controller will then map these ShadowNode requests to the real kubernetes nodes, concentrating the required permissions on the controller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! DONE.
Change: Addressing reviewers' comments. |
design-proposals/shadow-node.md
Outdated
* [same as before] Old Virt-API: no upgrade issue. | ||
|
||
### After upgrade rollout: | ||
Virt-handler will keep trying (and failing) to patch the node since the backup RBACs are now removed. Virt-handler will ignore these cases by ignoring RBAC "forbidden" errors. This behavior should be kept until we no longer support upgrades from the current release. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This reads like you would never stop trying to patch. In case I did not misread this, let's suppose we have 100 nodes, we update the node every 10s (can't remember the default interval), and we may be able to remove this code path in 12 months, that means we would now patch and fail for 12 months with 10 calls/second. So that is definitely not an option. Just think about metrics collection for instance which will record the failed call on virt-handler as well as on the API server.
The only way would be to stop patching when you get the "forbidden" error, but then we have again the issue, that we can't differentiate a config error from a real desire to no longer patch it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC interval of heartbeat is 1m, and nodeLabeller is 3m
Indeed stopping after first "forbidden" error is possible, but you are right that we can't differentiate a config error from a real desire to no longer patch it.
kind: ShadowNode | ||
metadata: | ||
name: <Equivalent to existing name> | ||
# Labels and Annotations only |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a great opportunity, to create a real spec for all that info. The reason we use for instance for heartbeat metadata is because we don't have a place on the k8s node object to properly express information. Now we can.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added to the pros list to the CRD approach
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, having a typed spec with a well defined list of things that we want on the node, looks to be an opportunity here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rmohr how about if we make it a bit wider. Instead of setting a CRD for this very specific node sync purpose, define a virt-handler CRD, that would hold this status field nodeInfo
(such as the labels and annotations) and also future stuff that could be used in future virt-handler communication with other components...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine either way.
@RamLavi will oyu be updating this spec according to Roman's comment about a a real spec? |
Hey, I wasn't online this past week (was sick, now better), so I'm not sure what spec update is needed. Will take a closer look later today |
design-proposals/shadow-node.md
Outdated
Virt-handler will keep read access to nodes but will lose write access. Virt-handler will instead write to the ShadowNodes object, and this will subsequently be selectively copied to the node by a new controller. This controller will be the only entity who writes to the node object. | ||
|
||
Pros: | ||
- Since the new controller runs virt-controller context which only runs on the control-place nodes, the risk of taking over the cluster is greatly mitigated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Since the new controller runs virt-controller context which only runs on the control-place nodes, the risk of taking over the cluster is greatly mitigated. | |
- Since the new controller runs virt-controller context which only runs on the control-plane nodes, the risk of taking over the cluster is reduced. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed.
design-proposals/shadow-node.md
Outdated
|
||
Pros: | ||
- Avoid introducing a new CRD to the project. | ||
- Allows for future expansions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have specific expansions in mind? This statement seems generic.
I would suggest to add here a security benefit: virt-handler does not need to access kube api at all, and cannot modify unrelated nodes or shadow nodes. virt-controller does all the risky work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have specific expansions in mind? This statement seems generic.
it is. Will remove.
I would suggest to add here a security benefit: virt-handler does not need to access kube api at all, and cannot modify unrelated nodes or shadow nodes. virt-controller does all the risky work.
Added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The research done as part of this proposal is impressive.
Thank you for sharing all the options and details explored.
I learned a lot from reviewing it.
|
||
## Non Goals | ||
|
||
Redesign/optimize existing virt-hanlder features to not update/patch the node in the first place. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, this is the proper solution for the long-run. Being the goal Kubevirt should strive to in order to be secure.
Several of the existing solutions that require patching the node should be reevaluated and optimized not to require access to nodes (or other cluster level core component).
Said that, I understand this may take more resources and time to reach, therefore the solutions provided in this proposal are still required. I just hope to see a continuation to this effort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that some of the solutions made here should occur anyway in parallel, as part of the goal of keeping kubevirt working properly.
However, these suggestions deserve a separate proposal, and for this specific CVE they are not strictly required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with both of you :) it should be explored but a bit out-of-scope.
One thing we should consider is whether the CPU information on the nodes is essential (I think the vast majority of our labels are dedicated to CPU info). After all, this information is fixed by nature and stays the same for each node, maybe except for very rare cases. Currently we use these labels as selectors, but we can save this information elsewhere (e.g. a configmap?), then virt-controller would target nodes by name (or the hostname label?) before creating the launcher pod.
But as you said it's out-of-scope for this PR, just something to keep in mind.
3. Virt-handler/node labeller [labels](https://github.com/kubevirt/kubevirt/blob/5b61a18f4c3b6c549d50fd201664c21720e3dbe2/pkg/virt-handler/node-labeller/node_labeller.go#L254) various information which are mostly static in nature: cpu model, hyperv features, cpu timer, sev/-es, RT... | ||
* This is triggered both by interval (3 min) and on changes of cluster config. However, as these labels usually don't change—this intervalled reconciliation is usually no-op. | ||
|
||
## Solution option #1: Use Validating Admission Policy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
This seems the most suitable solution out of the list presented here.
It will be great to see a PoC for it.
- (same set of parameters for annotations) | ||
- [matchConditions](https://kubernetes.io/docs/reference/access-authn-authz/validating-admission-policy/#matching-requests-matchconditions) rules (only if all are "true" then the validation will be evaluated): | ||
- `isPartitionedServiceAccount` == true | ||
- [validation](https://kubernetes.io/docs/reference/access-authn-authz/validating-admission-policy/#creating-a-validatingadmissionpolicy) rules (written in pseudocode to keep things readable): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to express something like: "reject-all-changes except these"?
I mean, without the need to compare other labels/annotation we do not care about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, it was nice if we only got the "changes" we made, but it doesn't work like that AFAIK: When setting a POST request (this is true for both webhooks and validationAdmissionPolicies) you can add validation rules to validate these entities (such as object
, oldObject
, request
- see more details here to learn how they look like)
so it's logical that you first have to gather the kubevirt-labels on both {object
,oldObject
} (or the non-kubevirt-own, as shown) , and then make sure that we "reject-all-changes except these"..
Please review this CEL example that models the validations that the valiationAdmissionPolicy will have here
You can play with it and if you find a better way to do it I'm all ears of course :)
design-proposals/shadow-node.md
Outdated
validationAdmissionPolicy [going GA](https://kubernetes.io/blog/2024/03/12/kubernetes-1-30-upcoming-changes/) on k8s 1.30, so it should be OK for deployment on Kubevirt main and next minor release branches. | ||
However, since kubevirt needs to be supported up to 2 k8s versions back, then the deployment of these new objects needs to be conditioned by a new feature gate "node restriction". HCO main and v1.3 branches will enable this feature gate by default. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we program the component that deploys Kubevirt (virt-operator?) to condition what and how to setup the cluster based on its version/capability?
Clusters that can support the new policy (e.g. in v1.30 or in v1.29 where the K8S FG is enabled), will have Kubevirt deployed in one way and in other clusters in the existing manner.
The condition will be dropped once we drop support for K8S 1.29.
Having a FG protecting this change (or any other) will drag this for at least 2 version (6 months), requiring users to act on making their cluster more secure. I hope we can take that burden on ourself by just giving them the best possible option given the setup.
If this is not possible, could you please elaborate on why not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure it's possible, will need to check.
But theoretically speaking, isn't that problematic to introduce such a dependency to a k8s version? It sounds unusual..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After thinking it through, I agree with you.
a FG is not a good fit for this, since this is not something we want a user (=cluster-admin) to need to switch on, as this is not a user facing feature.
We have 2 alternatives then:
- add a logic on runtime to check k8s version and fail only if the vesion is above 0.30.0
- always deploy and ignore on fail.
I prefer the former. Will fix the proposal accordingly
``` | ||
!`requestHasNodeName` OR (object.Name == request.nodeName) | ||
``` | ||
However, this field is [beta](https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#additional-metadata-in-pod-bound-tokens) on k8s 1.30 and is currently protected by the [ServiceAccountTokenPodNodeInfo](https://github.com/kubernetes/kubernetes/blob/c4bce63d9886e5f1fc00f8c3b5a13ea0d2bdf772/pkg/features/kube_features.go#L753) k8s feature gate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is considered a cons. Can you add it as such?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
049b850
to
b2d541c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @RamLavi and @xpivarc for this proposal! It's indeed impressive!
I tend to vote +1 to the Validating Admission Policy approach. I think it would be most straight forward and elegant solution to this CVE.
However I deeply agree with @EdDev that as a follow-up to this effort it would be great to see a discussion on how to reduce our dependency on the node object in general. As most of our node info is static anyhow I think we can find better ways to store the most of it (e.g. a config map). The heartbeat issue, as @RamLavi said before, can be solved otherwise with a readiness/liveliness probe.
design-proposals/shadow-node.md
Outdated
|
||
2. virt-handler/heartbeat: | ||
* [labels](https://github.com/kubevirt/kubevirt/blob/5b61a18f4c3b6c549d50fd201664c21720e3dbe2/pkg/virt-handler/heartbeat/heartbeat.go#L96) the node with `kubevirt.io/schedulable=false` when the stop channel closes (i.e. virt-handler exits). | ||
* [labels and annotates](https://github.com/kubevirt/kubevirt/blob/5b61a18f4c3b6c549d50fd201664c21720e3dbe2/pkg/virt-handler/heartbeat/heartbeat.go#L139) the node, updating NodeSchedulable, CPUManager, KSMEnabledLabel labels and VirtHandlerHeartbeat KSMHandlerManagedAnnotation annotations once per minute. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: typo
* [labels and annotates](https://github.com/kubevirt/kubevirt/blob/5b61a18f4c3b6c549d50fd201664c21720e3dbe2/pkg/virt-handler/heartbeat/heartbeat.go#L139) the node, updating NodeSchedulable, CPUManager, KSMEnabledLabel labels and VirtHandlerHeartbeat KSMHandlerManagedAnnotation annotations once per minute. | |
* [labels and annotations](https://github.com/kubevirt/kubevirt/blob/5b61a18f4c3b6c549d50fd201664c21720e3dbe2/pkg/virt-handler/heartbeat/heartbeat.go#L139) the node, updating NodeSchedulable, CPUManager, KSMEnabledLabel labels and VirtHandlerHeartbeat KSMHandlerManagedAnnotation annotations once per minute. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! DONE
|-----------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | ||
| object.spec == oldObject.spec | virt-handler user cannot modify spec of the nodes | | | ||
| object.status == oldObject.status | virt-handler user cannot modify status of the nodes | | | ||
| object.(not labels/annotations/resourceVersion/managedFields) == oldObject.(not labels/annotations/resourceVersion/managedFields) | virt-handler user can only change allowed sub-metadata fields. | | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is managedFields
necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, managed fields is a legitimate way to configure the object, therefore it shouldn't be restricted..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So virt-handler can change the fields managed by a node object? When would this be valuable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm I'm not so sure anymore it is needed. @xpivarc - can you re-explain why this is needed?
design-proposals/shadow-node.md
Outdated
### Backwards compatibility: | ||
validationAdmissionPolicy [going GA](https://kubernetes.io/blog/2024/03/12/kubernetes-1-30-upcoming-changes/) on k8s 1.30, so it should be OK for deployment on Kubevirt main and next minor release branches. | ||
However, since kubevirt needs to be supported up to 2 k8s versions back, then the deployment of these new objects needs to be conditioned by the k8s version deployed. | ||
The virt-operator deployment logic should be: | ||
``` | ||
if k8s-ver >= v1.30 | ||
then | ||
deploy the objects | ||
else | ||
deploy, but only log on failure | ||
``` | ||
Once k8s v1.29 and below is not supported on kubevirt, this logic should be removed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if Kubevirt is being installed on k8s 1.29 (where validationAdmissionPolicy is beta), but the feature gate is enabled? I think it's reasonable to deploy the objects on such scenario, and it might even be an acceptable workaround for Kubevirt users until they're able to upgrade thier k8s version.
Perhaps we can take a different approach of using dry-run to determine if the object is known or not.
That is:
- Create the objects with a dry-run option. Check if an error is returned.
- If a nil error is returned - re-create without the dry-run option.
- If an "object unknown" error is returned (this one?), log and continue.
- If a different error is returned - fail.
We can obviously drop the second step once we drop support for k8s 1.29.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, I'm not sure why we need to dry run - if the object is not installed it'll not deploy anyway right? .
But we do want to distinguish two behaviors:
For 1.29-, we want to deploy (or at least try), if we get error, log it.
If it's 1.30+, we do want to fail if errors, and not only log.
Ignoring an "object not found" error in the general case is wrong imo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ignoring an "object not found" error in the general case is wrong imo
I agree, but relying on k8s version is (as someone also said here 😉) also "unusual" :) The question is what's worse.
But TBH I initially thought that for 1.29- we're going to not even to deploy the objects, which sounded problematic. If you're going to simply try to create the objects on a best-effort basis and log on error it sounds better to me.
Thanks!
``` | ||
!`requestHasNodeName` OR (object.Name == request.nodeName) | ||
``` | ||
However, this field is [beta](https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#additional-metadata-in-pod-bound-tokens) on k8s 1.30 and is currently protected by the [ServiceAccountTokenPodNodeInfo](https://github.com/kubernetes/kubernetes/blob/c4bce63d9886e5f1fc00f8c3b5a13ea0d2bdf772/pkg/features/kube_features.go#L753) k8s feature gate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I suggest in one of the other comments, WDYT about supporting it on a best-effort basis? Admins that run older k8s versions might want to consider using the FG as a workaround.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrote at that comment, but note that's a minor implementation detail, we can continue this important discussion on the future PR.
## Solution option #2: ShadowNode CRD | ||
The security issue is being addressed by introducing a new light weight CRD: "ShadowNode" for each Kubernetes node. | ||
Virt-handler will keep read access to nodes but will lose write access. Virt-handler will instead write to the ShadowNodes object, and this will subsequently be selectively copied to the node by a new controller. This controller will be the only entity who writes to the node object. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I'm considering whether this solution could be viable, particularly if it's used alongside the scheduling readiness feature of k8s, which is expected to be GAed in k8s 1.30.
The flow I've been thinking about goes like this:
- virt-handler is deployed and sets the
ShadowNode
Objects with the proper annotations and labels, just like you mentioned. - Whenever a vmi is created, virt-api will mutate the virt-launcher pod with a dedicated SchedulingGate.
- virt-controller will detect virt-launcher pods with the SchedulingGate and do the following for each launcher pod:
- Calculate suitable nodes for scheduling from the kubevirt perspective, let's mark it as <nodeList>.
- Remove the SchedulingGate from the launcher pod and add node affinity to allow scheduling only to nodes in <nodeList>.
By leveraging scheduling readiness, we might be able to avoid node modifications entirely within KubeVirt, as suggested in this comment, and will also allow us to remove all the boilerplate and implementation details of KubeVirt from the nodes. Additionally, it might address concerns about users with node access gaining visibility into hardware information.
Any thoughts?
FYI @vladikr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Barakmor1 I think this is a very interesting idea! I'm not 100% sure we need a gate.
Virt-controller is the one that creates the pods, so it can calculate the list of suitable nodes before posting the Pod - and do the same during migration.
One possible issue is (as @xpivarc pointed out offline) that this list may become extensive.
@rmohr FYI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Barakmor1 I think this is a very interesting idea! I'm not 100% sure we need a gate.
Virt-controller is the one that creates the pods, so it can calculate the list of suitable nodes before posting the Pod - and do the same during migration.
But if no node matches a launcher, it will remain pending forever, as the <nodeList> will remain empty. Alternatively, it may be possible to sync the affinity when a launcher is in a scheduling state.
One possible issue is (as @xpivarc pointed out offline) that this list may become extensive.
I think that the maximum number of nodes is 5000. I wonder if affinity can handle this. Also, we could set a maximum number of nodes for the affinity and choose random nodes when we reach the maximum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My 2 cents: +1 to all efforts to use alternative ways to avoid the need to patch the node.
If we could find good alternatives to all the places virt-handler is patching the node - then we could remove the patch permission entirely.
But even if we remove the node-labeller component we currently still have other places where we patch the node (for example heartbeat) and until we remove them all - the need to restrict the node patches is still relevant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But even if we remove the node-labeller component we currently still have other places where we patch the node (for example heartbeat) and until we remove them all - the need to restrict the node patches is still relevant.
Can't we patch the shadowNode
with heartbeat labels as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vladikr @xpivarc i tested it and seems like kubernetes could handle pod with affinity for 5000 nodes this is how i generated the manifest:
#!/bin/bash
cat <<EOF > pod_with_node_affinity.yaml
apiVersion: v1
kind: Pod
metadata:
name: my-pod-with-resources
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
EOF
for i in {1..5000}; do
echo " - matchFields:" >> pod_with_node_affinity.yaml
echo " - key: metadata.name" >> pod_with_node_affinity.yaml
echo " operator: In" >> pod_with_node_affinity.yaml
echo " values:" >> pod_with_node_affinity.yaml
echo " - node${i}" >> pod_with_node_affinity.yaml
done
echo " containers:" >> pod_with_node_affinity.yaml
echo " - name: my-container" >> pod_with_node_affinity.yaml
echo " image: busybox" >> pod_with_node_affinity.yaml
echo " command: [\"sleep\", \"3600\"]" >> pod_with_node_affinity.yaml
echo " resources:" >> pod_with_node_affinity.yaml
echo " requests:" >> pod_with_node_affinity.yaml
echo " memory: \"256Mi\"" >> pod_with_node_affinity.yaml
echo " cpu: \"200m\"" >> pod_with_node_affinity.yaml
Edit: I found that nodeAffinity cannot be updated in Pending state so it would probably be to complex to support as we might have to delete pending launchers if the nodes change :\
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But even if we remove the node-labeller component we currently still have other places where we patch the node (for example heartbeat) and until we remove them all - the need to restrict the node patches is still relevant.
Can't we patch the
shadowNode
with heartbeat labels as well?
We can, but I believe we can solve it in a simpler, more k8s approach. Will issue a proposal about it in the near future.
Note: This proposal is now mature enough and is waiting for approval. |
Thanks for the update Ram. @RamLavi please resolve all open conversations in this document in order to move forward. |
Technically speaking, I don't think I can resolve conversations I didn't open, but I made sure I answered them. |
Signed-off-by: Ram Lavi <[email protected]>
Thanks a lot @RamLavi for this proposal! /lgtm |
In general, the However, I'm not a blocking this - especially if @iholder101 and @jean-edouard are happy to support this. |
Hey @vladikr!
@Barakmor1 and I were discussing this approach offline. The steps @Barakmor1 was mentioning above are:
Unfortunately, the last bullet under (3) is not possible, since after a pod is created, we simply cannot set new affinities. As you said rightfully, we can skip the step of adding a scheduling gate. But even then we're facing corner cases, such as:
In this (and similar) cases we would have to delete the pod and re-create it. Bottom line is: to support such an approach we'd have to introduce a mechanism that monitors the scheduling pods and re-creates them when necessary.
That's a fair point.
I 100% agree. This would definitely be ideal. The sad truth is that the vast vast majority of our ~200 labels are completely static in nature, and it's very tempting to save them elsewhere (either the shadow node CRD or even a configmap). What concerns me around this is that we essentially have two options AFAICT:
@vladikr what are your thoughts regarding that? Again if you think it's a reasonable path forward, let's discuss it! |
I want to stress again that I am all forward to any change that will reduce patched to the node. IMO virt-handler should not patch the node at all. But the fact is it does patch it (even if we remove node-labeller), and in order to solve the CVE, the most direct way to solve it is to add a VAP.
|
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: vladikr The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold cancel |
Awesome 10x 🚀 @xpivarc we still need to follow up to limit a handlers powers to the workloads on it's own node. Riht? Can VAP help here as well? |
Yes the VAP could be used here but it is not strict requirement as we still have webhooks for the resources. Therefore it is matter of extending them to check the right handler is making the request. |
This PR is addressing a security issue [0][1] that exists because virt-handler is a daemonset that can patch the node. This proposal is presenting possible proposals to reduce virt-handler's node patching abilities to mitigate this issue.
[0] GHSA-cp96-jpmq-xrr2
[1] kubevirt/kubevirt#9109