Skip to content

Commit

Permalink
Addressing reviewers' comments.
Browse files Browse the repository at this point in the history
Signed-off-by: Ram Lavi <[email protected]>
  • Loading branch information
RamLavi committed Mar 4, 2024
1 parent c1b779a commit d356627
Showing 1 changed file with 120 additions and 26 deletions.
146 changes: 120 additions & 26 deletions design-proposals/shadow-node.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,10 +10,13 @@ Kubevirt should be secure by default. While mitigation is available, it is not p

## Goals

Mitigate the advisory by default.
- Prevent a malicious user that has taken over a Kubernetes node where virt-handler daemonset is deployed from exploiting the daemonset's RBAC permissions in order to elevate privileges beyond the node until potentially having full privileged access to the whole cluster.
- Prevent virt-handler from modifying non-kubevirt-owned fields on the node. This includes not being able to change the spec or any label/annotation which are not strictly kubevirt owned.

## Non Goals

Redesign/optimize existing virt-hanlder features to not update/patch the node in the first place.

## Definition of Users

Not user-facing change
Expand All @@ -27,60 +30,151 @@ Not user-facing change
Kubevirt/Kubevirt

# Design
(This should be brief and concise. We want just enough to get the point across)

In order to easy reviews and understand what needs to be change the first section will describe different usage of Node object in the virt-handler.
## High level design

The security issue is being addressed by introducing a new light weight CRD: ShadowNodes for each Kubernetes node.
Virt-handler will keep read access to nodes but will lose write access. Virt-handler will instead write to the ShadowNodes object, and this will subsequently be selectively copied to the node by a new controller. This controller will be the only entity who writes to the node object.

Pros:
- Since the new controller runs virt-controller context which only runs on the control-place nodes, the risk of taking over the cluster is greatly mitigated.
- A malicious user can't manipulate virt-handler's RBAC permissions to exploit non-kubevirt params/labels.
- The new controller will be able to filter exactly which information it will proxy to the node and what not.
- Adding a new CRD is a good opportunity to create a real spec for all that info. For instance, the reason we use heartbeat in metadata is that we don't have a place on the k8s node object to properly express information. Now we can.
- Option to reduce overall writes to the k8s node (by keeping the heartbeat on the ShadowNode). See more in Solution details.

## Node usage
1. [markNodeAsUnschedulable](https://github.com/kubevirt/kubevirt/blob/5b61a18f4c3b6c549d50fd201664c21720e3dbe2/cmd/virt-handler/virt-handler.go#L185) is used at the start and the termination (by SIGTERM only) of virt-handler to indicate (best-effort) that the node is not ready to handle VMs.
Cons:
- Resource Consumption: Albeit a simple and lightweight CRD (will only contain metadata), adding shadowNodes adds a 1:1 shadowNodes to nodes resources, therefor increasing:
- the #api-calls kubevirt does.
- the storage kubevirt takes.
This concern is addressed on the Scalability section.
- Adding virt-controller as a "proxy" agent, we now depend on its availability. This could cause an increase in heartbeat timeout in corner cases.
- To a degree, it is still possible to manipulate pods, by causing VMs to migrate from other nodes.

2. Parts of Heartbeat
1. [labelNodeUnschedulable](https://github.com/kubevirt/kubevirt/blob/5b61a18f4c3b6c549d50fd201664c21720e3dbe2/pkg/virt-handler/heartbeat/heartbeat.go#L96) is similarly as markNodeAsUnschedulable used only when virt-handler is stopping.
## Solution details
In order to easy review and understand what needs to be changed, the first section will describe different usage of Node object in the virt-handler.

2. As part of [do](https://github.com/kubevirt/kubevirt/blob/5b61a18f4c3b6c549d50fd201664c21720e3dbe2/pkg/virt-handler/heartbeat/heartbeat.go#L119) we do 2 calls:
[Get](https://github.com/kubevirt/kubevirt/blob/5b61a18f4c3b6c549d50fd201664c21720e3dbe2/pkg/virt-handler/heartbeat/heartbeat.go#L139) - irrelevant as this is read access.
[Patch](https://github.com/kubevirt/kubevirt/blob/5b61a18f4c3b6c549d50fd201664c21720e3dbe2/pkg/virt-handler/heartbeat/heartbeat.go#L139) - updating NodeSchedulable, CPUManager, KSMEnabledLabel labels and VirtHandlerHeartbeat KSMHandlerManagedAnnotation annotations once per minute.
### Node writers on virt-handler
Kubevirt sets ~170 labels on each node. This is set by the following entities
1. virt-handler [labels](https://github.com/kubevirt/kubevirt/blob/5b61a18f4c3b6c549d50fd201664c21720e3dbe2/cmd/virt-handler/virt-handler.go#L185) the node unschedulable for VMs at the start and the termination (by SIGTERM only) of the virt-handler process to indicate (best-effort) that the node is not ready to handle VMs.

3. As part of node labeller
1. [run](https://github.com/kubevirt/kubevirt/blob/5b61a18f4c3b6c549d50fd201664c21720e3dbe2/pkg/virt-handler/node-labeller/node_labeller.go#L195) - irrelevant as this is read access
2. virt-handler/heartbeat:
* [labels](https://github.com/kubevirt/kubevirt/blob/5b61a18f4c3b6c549d50fd201664c21720e3dbe2/pkg/virt-handler/heartbeat/heartbeat.go#L96) the node unschedulable for VMs when the stop channel closes (i.e. virt-handler exits).
* [labels and annotates](https://github.com/kubevirt/kubevirt/blob/5b61a18f4c3b6c549d50fd201664c21720e3dbe2/pkg/virt-handler/heartbeat/heartbeat.go#L139) the node, updating NodeSchedulable, CPUManager, KSMEnabledLabel labels and VirtHandlerHeartbeat KSMHandlerManagedAnnotation annotations once per minute.

2. [patchNode](https://github.com/kubevirt/kubevirt/blob/5b61a18f4c3b6c549d50fd201664c21720e3dbe2/pkg/virt-handler/node-labeller/node_labeller.go#L254) - node labeller patching node with labels for various informations which are mostly static in nature: cpu model, hyperv features, cpu timer, sev/-es, RT...
This is triggered both by interval (3 min) and on changes of cluster config.
3. virt-handler/node labeller [labels](https://github.com/kubevirt/kubevirt/blob/5b61a18f4c3b6c549d50fd201664c21720e3dbe2/pkg/virt-handler/node-labeller/node_labeller.go#L254) various information which are mostly static in nature: cpu model, hyperv features, cpu timer, sev/-es, RT...
* This is triggered both by interval (3 min) and on changes of cluster config.

### Node writers on virt-controller
1. virt-controller/node-controller [labels](https://github.com/kubevirt/kubevirt/blob/5b61a18f4c3b6c549d50fd201664c21720e3dbe2/pkg/virt-controller/watch/node.go#L318) the node unschedulable for VMs on heartbeat timeout.
2. virt-controller/nodetopologyupdater [labels](https://github.com/kubevirt/kubevirt/blob/5b61a18f4c3b6c549d50fd201664c21720e3dbe2/pkg/virt-controller/watch/topology/nodetopologyupdater.go#L63) the node with supported topology frequencies.

## Shadow node

The proposed solution is to introduce new CRD that will be shadowing a node "ShadowNode". The virt-handler will be writting to ShadowNode and virt-controller will have a responsibility to sync allowed information to Node.
The proposed solution is to introduce new CRD that will be shadowing a node "ShadowNode". Every entity that now writes to the node, will be writing to ShadowNode instead, and the new controller will have a responsibility to sync allowed information to Node.

## API Examples
```yaml
apiVersion: <apiversion>
kind: ShadowNode
metadata:
name: <Equivalent to existing name>
// Labels and Annotations only
spec: {} //Empty, allows futher extension
status: {} //Empty, allows futher extension
# Labels and Annotations only
spec: {} //Empty, allows further extension
status: {} //Empty, allows further extension
```
## Scalability
There are few aspects to consider.
There are a few aspects to consider.
1. #shadowNodes will be equivalent #nodes, negligible space overhead
1. #shadowNodes will be equivalent #nodes, negligible space overhead
2. #writes could double. Here it is importent to look at the usage and sort each case to 2 categories (low and high frequent write). The first usage is clearly low frequent as long as virt-handler operates as designed.
The second usage consist of two cases which might seem different but in fact these are same most of the time because NodeSchedulable, CPUManager, KSMEnabledLabel labels and KSMHandlerManagedAnnotation annotation is mostly static. What is left is VirtHandlerHeartbeat that is not necessary to sync on a Node (requires re-working node controller).
The third case is propagating labels that are also mostly static.
2. #writes could double. Here it is important to look at the usage and sort each case to 2 categories: low and high frequent writes. All but the heartbeat write cases are clearly low frequent as long as virt-handler operates as designed, since the data labeled is mostly static.
The Heartbeat is high frequency, but it is not necessary to sync on to the Node (requires re-working node controller tp watch for the heartbeat annotation the shadowNode instead).
With all the above doubling #writes is unlikely.
## Update/Rollback Compatibility
Virt-handler is going to continue writting to Node object, therefore update should be without compatibility issues.
virt-operator upgrade [order](https://github.com/kubevirt/kubevirt/blob/2e3a2a32d88a2e08c136c051c30754d1b4af423b/pkg/virt-operator/resource/apply/reconcile.go#L526-L641) is:
```
CRDs ==> RBACs ==> Daemonsets (=virt-handler) ==> Controller Deployments (=virt-controller) ==> API Deployments (=virt-API)
```

In order to handle backwards compatibility issues and minimize heartbeat timeouts, Virt-handler is going to continue writing to Node object.
At some point during the upgrade, it would start being blocked due to the removal of the patch ability. In order to mitigate this and minimize heartbeat timeout during upgrade, the node patch will start ignoring RBAC "forbidden" error.
Normally, new CRDs are blocked until the upgrade rollout is over. The shadowNode CRD is going to be excluded in order to avoid upgrade issues that could cause heartbeat timeout.

Having said the above, this is a specific mention to each upgrade scenario - before and after the upgrade itself:

### During upgrade rollout transient case scenarios:
* New CRD; old RBAC; old virt-handler; old virt-controller; old virt-API:
* Old RBAC: no upgrade issue
* Old virt-handler: no upgrade issue
* Old virt-controllers: no upgrade issue
* Old Virt-API: no upgrade issue
* New CRD, new RBAC; old virt-handler; old virt-controller; old virt-API:
* Old virt-handler: during upgrade, the old RBACs are kept as backup until upgrade is done. Hence, patching node should not fail so there is no upgrade issue.
* Old virt-controllers: no upgrade issue, virt controller RBAC is not changed due to this solution.
* Old Virt-API: no upgrade issue.
* New CRD; new RBAC; new virt-handler; old virt-controller; old virt-API:
* new virt-handler: shadowNode is written to but it is not synced to the node yet. However, virt-handler keeps patching the node directly, so there is no upgrade issue.
* [same as before] Old controllers: no upgrade issue, virt controller RBAC is not changed.
* [same as before] Old Virt-API: no upgrade issue.
* New CRD; new RBAC; new virt-handler; new virt-controller; old virt-API:
* New virt-handler: no upgrade issue. virt-handler patches both node and shadow node.
* New controllers: will start getting shadowNode requests once they are created. no upgrade issue.
* [same as before] Old Virt-API: no upgrade issue.

### After upgrade rollout:
Virt-handler will keep trying (and failing) to patch the node since the backup RBACs are now removed. Virt-handler will ignore these cases by ignoring RBAC "forbidden" errors. This behavior should be kept until we no longer support upgrades from the current release.

## Alternative solutions
### Using status section
Same proxy solution as primary solution where virt-hadler updates the labels/annotations somewhere, and it gets copied to the node by virt-controller, but instead of using a new CRD as the medium - writing the labels/Annotations to the status section of an existing CRD like kubevirt CR.

Pros:
- Avoid introducing a new CRD to the project.

Cons:
- The status section is considered a user facing API section. Dynamically adding/updating 150+ technical labels there may only confuse the user and thus is not advisable.

### Using the virt-handler pod as a medium to pass the labels/annotations
Same proxy solution as primary solution where virt-handler updates the labels/annotations somewhere, and it gets copied to the node by virt-controller, but instead of using a new CRD as the medium - writing the labels/annotations to the virt-handler pods's labels/annotations.
Labels/Annotations that need to pass to the node should get a special prefix (like "nodesync/"), so that regular labels don't get passed to the node by mistake.

Pros:
- Avoid introducing a new CRD to the project.
- A simple and clean approach.

Cons:
- It requires giving the virt-handler RBAC permission to update pods, which is generally considered risky.
- It's only relevant for passing labels/annotations. CRD approach allows for future expansions.

### Creating a GRPC communication between virt-handler and virt-controller
Pass the labels/annotations through a grpc channel to the virt-controller.

Pros:
- Avoid introducing a new CRD to the project.
- Allows for future expansions.

Cons:
- More complex. It's easier and more appealing to use the k8s API, i.e. patching labels/annotations, than to implement and maintain grpc case logic internally.

### Fine-tune each virt-handler RBAC to only patch its own node.
Introducing a new controller called token-distributor-controller. It will create a token for each virt-handler pod and deliver it to the node.

Pros:
- Avoid introducing a new CRD to the project.
- A malicious user can't manipulate pods from other nodes to migrate to their node.

Cons:
- A malicious user can still manipulate its own node with non-kubevirt controlled params/labels, then possibly lure control plane components to its node in order to steal their credentials.

## Functional Testing Approach

Existing functional tests should be enough to cover this change.

# Implementation Phases
(How/if this design will get broken up into multiple phases)

This change will be implemented in one phase.

0 comments on commit d356627

Please sign in to comment.