diff --git a/design-proposals/shadow-node.md b/design-proposals/shadow-node.md index f2c5be86..f5839070 100644 --- a/design-proposals/shadow-node.md +++ b/design-proposals/shadow-node.md @@ -10,10 +10,13 @@ Kubevirt should be secure by default. While mitigation is available, it is not p ## Goals -Mitigate the advisory by default. +- Prevent a malicious user that has taken over a Kubernetes node where virt-handler daemonset is deployed from exploiting the daemonset's RBAC permissions in order to elevate privileges beyond the node until potentially having full privileged access to the whole cluster. +- Prevent virt-handler from modifying non-kubevirt-owned fields on the node. This includes not being able to change the spec or any label/annotation which are not strictly kubevirt owned. ## Non Goals +Redesign/optimize existing virt-hanlder features to not update/patch the node in the first place. + ## Definition of Users Not user-facing change @@ -26,61 +29,315 @@ Not user-facing change Kubevirt/Kubevirt -# Design -(This should be brief and concise. We want just enough to get the point across) +# Design Options + +## Background +In order to easy review and understand the different options, this section will describe different usage of Node object in the virt-handler. + +### Node writers on virt-handler +Kubevirt sets ~170 labels on each node. This is set by the following entities +1. virt-handler [labels](https://github.com/kubevirt/kubevirt/blob/5b61a18f4c3b6c549d50fd201664c21720e3dbe2/cmd/virt-handler/virt-handler.go#L185) the node with `kubevirt.io/schedulable=false` at the start and the termination (by SIGTERM only) of the virt-handler process to indicate (best-effort) that the node is not ready to handle VMs. This is meant to improve user experience with VM scheduling. + +2. virt-handler/heartbeat: + * [labels](https://github.com/kubevirt/kubevirt/blob/5b61a18f4c3b6c549d50fd201664c21720e3dbe2/pkg/virt-handler/heartbeat/heartbeat.go#L96) the node with `kubevirt.io/schedulable=false` when the stop channel closes (i.e. virt-handler exits). + * [labels and annotates](https://github.com/kubevirt/kubevirt/blob/5b61a18f4c3b6c549d50fd201664c21720e3dbe2/pkg/virt-handler/heartbeat/heartbeat.go#L139) the node, updating NodeSchedulable, CPUManager, KSMEnabledLabel labels and VirtHandlerHeartbeat KSMHandlerManagedAnnotation annotations once per minute. + * On the other side - Virt-controller's node-controller watches which monitors the heartbeat annotation. + +3. Virt-handler/node labeller [labels](https://github.com/kubevirt/kubevirt/blob/5b61a18f4c3b6c549d50fd201664c21720e3dbe2/pkg/virt-handler/node-labeller/node_labeller.go#L254) various information which are mostly static in nature: cpu model, hyperv features, cpu timer, sev/-es, RT... + * This is triggered both by interval (3 min) and on changes of cluster config. However, as these labels usually don't change—this intervalled reconciliation is usually no-op. + +## Solution option #1: Use Validating Admission Policy +Add new [validationAdmissionPolicy and ValidatingAdmissionPolicyBinding](https://kubernetes.io/docs/reference/access-authn-authz/validating-admission-policy/) objects to prevent virt-handler from adding/deleting/updating non kubevirt-owned labels/annotations. +kubevirt-owned labels/annotations are defined as follows: +- labels/annotations that contain the string `kubevirt.io`. +- `cpu-manager` label, which is an exception to the above rule. + +The validationAdmissionPolicy and ValidatingAdmissionPolicyBinding objects will be configured with the following parameters: +- [matchConstraints](https://kubernetes.io/docs/reference/access-authn-authz/validating-admission-policy/#getting-started-with-validating-admission-policy): apply only on Nodes; Operations: UPDATE +- predefined [variables](https://kubernetes.io/docs/reference/access-authn-authz/validating-admission-policy/#variable-composition): + - `isPartitionedServiceAccount`: true if virt-handler is the user requesting the update. + - `oldLabels`: the nodes' labels before the update was issued + - `oldNonKubevirtLabels`: a sublist of the old labels that are not kubevirt owned + - `newLabels`: the nodes' labels after the update was issued + - `newNonKubevirtLabels`: a sublist of the new labels that are not kubevirt owned + - (same set of parameters for annotations) +- [matchConditions](https://kubernetes.io/docs/reference/access-authn-authz/validating-admission-policy/#matching-requests-matchconditions) rules (only if all are "true" then the validation will be evaluated): + - `isPartitionedServiceAccount` == true +- [validation](https://kubernetes.io/docs/reference/access-authn-authz/validating-admission-policy/#creating-a-validatingadmissionpolicy) rules (written in pseudocode to keep things readable): + + | validation (all rules need to be true in order to pass policy) | error message in case of validation failure | Remarks | + |-----------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------| + | object.spec == oldObject.spec | virt-handler user cannot modify spec of the nodes | | + | object.status == oldObject.status | virt-handler user cannot modify status of the nodes | | + | object.(not labels/annotations/resourceVersion/managedFields) == oldObject.(not labels/annotations/resourceVersion/managedFields) | virt-handler user can only change allowed sub-metadata fields. | | + | size(`newNonKubevirtLabels`) == size(`oldNonKubevirtLabels`) | virt-handler user cannot add/delete non kubevirt-owned labels | | + | for label in `newNonKubevirtLabels`: {(label in `oldNonKubevirtLabels`) AND `newLabels`[label] == `oldLabels`[label]} | virt-handler user cannot update non kubevirt-owned labels | | + +### Backwards compatibility: +validationAdmissionPolicy [going GA](https://kubernetes.io/blog/2024/03/12/kubernetes-1-30-upcoming-changes/) on k8s 1.30, so it should be OK for deployment on Kubevirt main and next minor release branches. +However, since kubevirt needs to be supported up to 2 k8s versions back, then the deployment of these new objects needs to be conditioned by a new feature gate "node restriction". HCO main and v1.3 branches will enable this feature gate by default. + +### Next step: Prevent patches from other nodes' virt-handlers +It is possible to expand the rules of the validationAdmissionPolicy to also prohibit virt-handler running on a node from patching other nodes. +This could be defined by setting a new variable `requestHasNodeName` which will be true if request has `userInfo.extra.authentication.kubernetes.io/node-name` field, then adding this validation: +``` +!`requestHasNodeName` OR (object.Name == request.nodeName) +``` +However, this field is [beta](https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#additional-metadata-in-pod-bound-tokens) on k8s 1.30 and is currently protected by the [ServiceAccountTokenPodNodeInfo](https://github.com/kubernetes/kubernetes/blob/c4bce63d9886e5f1fc00f8c3b5a13ea0d2bdf772/pkg/features/kube_features.go#L753) k8s feature gate. +This field can be added once it is GA. + +Pros: +* Avoid adding CRDs or changing the current virt-handler code to fix the CVE issue. +* A k8s built-in feature which is straightforward to maintain and configure. +* validationAdmissionPolicy is [going GA](https://kubernetes.io/blog/2024/03/12/kubernetes-1-30-upcoming-changes/) on k8s 1.30, so it could be used for the next Kubevirt minor release. + +Cons: +* needs a feature gate because it is only applicable from k8s 1.30 + +## Solution option #2: ShadowNode CRD +The security issue is being addressed by introducing a new light weight CRD: "ShadowNode" for each Kubernetes node. +Virt-handler will keep read access to nodes but will lose write access. Virt-handler will instead write to the ShadowNodes object, and this will subsequently be selectively copied to the node by a new controller. This controller will be the only entity who writes to the node object. + +Pros: +- The virt-controller is running only on subset of nodes, therefore, the exposure is not as wide as the virt-handler. Moreover, as virt-controller is advised to run on control-plane (see [node-placement](https://kubevirt.io/user-guide/operations/installation/#restricting-kubevirt-components-node-placement)) - the risk can be further reduced. +- A malicious user can't manipulate virt-handler's RBAC permissions to exploit non-kubevirt params/labels. +- Adding a new CRD is a good opportunity to create a real spec for all that info. For instance, the reason we use heartbeat in metadata is that we don't have a place on the k8s node object to properly express information. Now we can. +- Scalability: + - #shadowNodes will be equivalent #nodes, negligible space overhead + - If heartbeat annotation is handled only on the shadowNode and not directly patched to the node as been done now, then the nodes will get much fewer API-calls. Ergo, Node-watching elements will get "bothered" less frequently. + +Cons: +- Scalability: + - Albeit a simple and lightweight CRD (will only contain metadata), adding shadowNodes adds a 1:1 shadowNodes to node resources, therefore increasing: + - the #API-calls kubevirt does - At first #writes could double due to writing via the CRD as a proxy. However, this should be transiently reduced, as the labels kubevirt patches don't tend to change often (except the heartbeat that as mentioned above can be eliminated). Considering the heartbeat annotation removal from the node patch, then this con can be flagged as negligible. + - the storage kubevirt takes. +- Adding virt-controller as a "proxy" agent, we now depend on its availability. This could cause an increase in heartbeat timeout in corner cases like upgrade. +- To a degree, it is still possible to manipulate pods by causing VMs to migrate from other nodes. For example, a malicious user can add the control-plane label to its own node, then adding other labels to lure control plane pods to their node. + +### API Examples +```yaml +apiVersion: +kind: ShadowNode +metadata: + name: + # Labels and Annotations only +spec: {} //Empty, allows further extension +status: {} //Empty, allows further extension +``` -In order to easy reviews and understand what needs to be change the first section will describe different usage of Node object in the virt-handler. +### Solution option #3: Creating a REST/GRPC communication between virt-handler and virt-controller +Pass the labels/annotations through a REST/GRPC channel to the virt-controller. REST option is preferred because the server-side is already implemented. + +Virt-handler (Server): Instead of directly patching labels/annotation to the node, Virt-handler will maintain a "cache" (NodeInfoMgr) where all the components will "set" their data. +Virt handler can use its existing [web-service](https://github.com/kubevirt/kubevirt/blob/b102c56f0fcd52feff3ff7a6296737b8e8b99131/cmd/virt-handler/virt-handler.go#L556) to expose one more endpoint - GetNodeInfo. Once reached - this endpoint will return the node data. + +Virt-Controller (Client): Add a new web-service client infra that will: +1. run as a go routine and poll the client for updates every one minute. This frequency is chosen because it is the highest of the frequencies (=heartbeat). +2. ask every Virt-handler server separately for the NodeInfo (need to make sure it's possible to address them in a unique manner). +3. filter non-kubevirt owned annotations/labels. +4. patch the node if necessary. + +In case that virt-handler gets a SIGTERM, it will no longer be able to set the node as unschedulable. This will be replaced with readiness and liveness probes and a new controller that will watch the virt-handler Daemonset readiness and set it to unschedulable if necessary. + +Notes: +- The heartbeat annotation and virt-controller's [Node-controller](https://github.com/kubevirt/kubevirt/blob/b102c56f0fcd52feff3ff7a6296737b8e8b99131/pkg/virt-controller/watch/node.go) can be removed in favor using the web-service timeout mechanism (if received timeout - set node as unreachable). + +Pros: +- Avoid introducing a new CRD to the project. +- Using existing web-server infra on the server side. +- security: + - the web-service is secure using the cert-manager internal API. + - Virt-handler does not need to access kube-API at all, and cannot modify unrelated nodes or shadow nodes. Virt-controller does all the risky work +- Scalability: + - Reduces #API requests (i.e. node patches) to a minimum. + +Cons: +- More complex. It's easier and more appealing to use the k8s API, i.e. patching labels/annotations, than to implement and maintain GRPC/REST case logic internally. +- Virt-handler will no longer be able to immediately set the node as unschedulable (in case of virt-handler SIGTERM). However, replacing it with the probes and controllers should provide a sufficient substitute. + +## Solution option #4: Moving to third party entity +The idea is to eliminate the reason(s) to patch to the node from virt-handler in the first place. + +1. Use tools like [Node Feature Discovery](https://kubernetes-sigs.github.io/node-feature-discovery/v0.15/get-started/introduction.html) (personally preferred) OR utilize virt-handler's [device-manager](https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/) to publish node features. + - How these tools solve the CVE issue: + - Node Feature Discovery's master Daemonset (that patches the node) only runs on control-plane nodes, so there is less security concern. + - device-manager is a plugin for kubelet so the patching is actually done by kubelet. +2. `"kubevirt.io/schedulable"` label: + - Remove writing this label in case virt-handler crashes. It is racy, and has no real value. + - Move setting it entirely to virt-controller (right now it only sets it to value "false") +3. Heartbeat: Instead of adding the heartbeat TS annotation in order to later monitor its "freshness" from virt-controller's [Node-controller](https://github.com/kubevirt/kubevirt/blob/b102c56f0fcd52feff3ff7a6296737b8e8b99131/pkg/virt-controller/watch/node.go), simply monitor the readiness of the virt-handler Deamonset/pods. + +### utilize device-manager option +Instead of publishing the node's cpu-feature,supported_features,hostCPUModels,etc... as labels on the node, add them as devices on the node: +Example: +From this: +```yaml +apiVersion: v1 +kind: Node +metadata: +labels: + cpu-feature.node.kubevirt.io/abm: "true" + cpu-feature.node.kubevirt.io/adx: "true" +#... +``` +To this: +```yaml +apiVersion: v1 +kind: Node +#... +status: + allocatable: + devices.kubevirt.io/kvm: 1k + cpu-feature.node.kubevirt.io/abm: 1k + cpu-feature.node.kubevirt.io/adx: 1k + #... + capacity: + devices.kubevirt.io/kvm: 1k + cpu-feature.node.kubevirt.io/abm: 1k + cpu-feature.node.kubevirt.io/adx: 1k + #... +... +``` + When a VMI wants a VMI with specific feature, it would set the virt-loauncher pod with the desired resource: +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: virt-launcher-pod +spec: + containers: + - name: demo-container-1 + resources: + limits: + cpu-feature.node.kubevirt.io/adx: 1 +``` -## Node usage -1. [markNodeAsUnschedulable](https://github.com/kubevirt/kubevirt/blob/5b61a18f4c3b6c549d50fd201664c21720e3dbe2/cmd/virt-handler/virt-handler.go#L185) is used at the start and the termination (by SIGTERM only) of virt-handler to indicate (best-effort) that the node is not ready to handle VMs. +### utilize Node Feature Discovery (NFD) option +NFD can do the same thing the kubevirt's node-labeller is doing. +When deployed, NFD: +- allows for non-standard user-specific feature labels created with the [NodeFeature CRD](https://kubernetes-sigs.github.io/node-feature-discovery/v0.15/usage/custom-resources.html#nodefeature). + - virt-handler could use this in order to relay its labels, instead of patching them themselves. +- patches built-in labels with `feature.node.kubernetes.io/` prefix. See list [here](https://kubernetes-sigs.github.io/node-feature-discovery/v0.15/usage/features.html#built-in-labels). + - These labels are similar but do not exactly match the ones Kubevirt is deploying right now (See [comparison](https://www.diffchecker.com/qA2EvefO/) of local dev node deployed using KubevirtCI) + - This NFD feature will not be used for this solution. + +Example: +Instead of directly patching these labels to the node: `cpu-feature.node.kubevirt.io/abm: "true"`, `cpu-feature.node.kubevirt.io/adx: "true"`, it would deploy the following CR: +```yaml +apiVersion: nfd.k8s-sigs.io/v1alpha1 +kind: NodeFeature +metadata: + labels: + nfd.node.kubernetes.io/node-name: node-1 + name: node-1-labeller-sync +spec: + labels: + cpu-feature.node.kubevirt.io/abm: "true" + cpu-feature.node.kubevirt.io/adx: "true" +#... +``` +See full CR [here](https://paste.centos.org/view/ef218827) -2. Parts of Heartbeat - 1. [labelNodeUnschedulable](https://github.com/kubevirt/kubevirt/blob/5b61a18f4c3b6c549d50fd201664c21720e3dbe2/pkg/virt-handler/heartbeat/heartbeat.go#L96) is similarly as markNodeAsUnschedulable used only when virt-handler is stopping. +Pros: +- No need to invent the wheel. +- Re-examine current methods to make sure if we still need them is a good practice. In this case we may find out that patching is no longer needed, and in the long run need to maintain less code. +- These steps can be broken to multiple seemingly unrelated PRs, that are easier to review. - 2. As part of [do](https://github.com/kubevirt/kubevirt/blob/5b61a18f4c3b6c549d50fd201664c21720e3dbe2/pkg/virt-handler/heartbeat/heartbeat.go#L119) we do 2 calls: - [Get](https://github.com/kubevirt/kubevirt/blob/5b61a18f4c3b6c549d50fd201664c21720e3dbe2/pkg/virt-handler/heartbeat/heartbeat.go#L139) - irrelevant as this is read access. - [Patch](https://github.com/kubevirt/kubevirt/blob/5b61a18f4c3b6c549d50fd201664c21720e3dbe2/pkg/virt-handler/heartbeat/heartbeat.go#L139) - updating NodeSchedulable, CPUManager, KSMEnabledLabel labels and VirtHandlerHeartbeat KSMHandlerManagedAnnotation annotations once per minute. +Cons: +- Relying on a third part tools. +- Using third-party tools cannot filter only the kubevirt.io related labels. +- annotations are not addressed (for example `kubevirt.io/ksm-handler-managed` annotation) +- with NFD one cannot filter only kubevirt owned labels, making it a bit less safe. -3. As part of node labeller - 1. [run](https://github.com/kubevirt/kubevirt/blob/5b61a18f4c3b6c549d50fd201664c21720e3dbe2/pkg/virt-handler/node-labeller/node_labeller.go#L195) - irrelevant as this is read access +## Solution option #5: [Ruled-out] Using status section +Same proxy solution as the primary solution where virt-hadler updates the labels/annotations somewhere, and it gets copied to the node by virt-controller, but instead of using a new CRD as the medium - writing the labels/Annotations to the status section of an existing CRD like kubevirt CR. - 2. [patchNode](https://github.com/kubevirt/kubevirt/blob/5b61a18f4c3b6c549d50fd201664c21720e3dbe2/pkg/virt-handler/node-labeller/node_labeller.go#L254) - node labeller patching node with labels for various informations which are mostly static in nature: cpu model, hyperv features, cpu timer, sev/-es, RT... -This is triggered both by interval (3 min) and on changes of cluster config. +Pros: +- Avoid introducing a new CRD to the project. -## Shadow node +Cons: +- [Deal-breaker] The status section is considered a user facing API section. Dynamically adding/updating 150+ technical labels there may only confuse the user and thus is not advisable. -The proposed solution is to introduce new CRD that will be shadowing a node "ShadowNode". The virt-handler will be writting to ShadowNode and virt-controller will have a responsibility to sync allowed information to Node. +## Solution option #6: [Ruled-out] Using the virt-handler pod as a medium to pass the labels/annotations +Same proxy solution as primary solution where virt-handler updates the labels/annotations somewhere, and it gets copied to the node by virt-controller, but instead of using a new CRD as the medium - writing the labels/annotations to the virt-handler pods's labels/annotations. +Labels/Annotations that need to pass to the node should get a special prefix (like "nodesync/"), so that regular labels don't get passed to the node by mistake. -## API Examples -```yaml -apiVersion: -kind: ShadowNode -metadata: - name: - // Labels and Annotations only -spec: {} //Empty, allows futher extension -status: {} //Empty, allows futher extension -``` +Pros: +- Avoid introducing a new CRD to the project. +- A simple and clean approach. + +Cons: +- [Deal-breaker] It requires giving the virt-handler RBAC permission to update pods, which is generally considered risky. +- It's only relevant for passing labels/annotations. CRD approach allows for future expansions. -## Scalability +## Solution option #7: [Ruled-out] Fine-tune each virt-handler RBAC to only patch its own node. +Introducing a new controller called token-distributor-controller. It will create a token for each virt-handler pod and deliver it to the node. -There are few aspects to consider. +Pros: +- Avoid introducing a new CRD to the project. +- A malicious user can't manipulate pods from other nodes to migrate to their node. -1. #shadowNodes will be equivalent #nodes, negligible space overhead +Cons: +- A malicious user can still manipulate its own node with non-kubevirt controlled params/labels, then possibly lure control plane components to its node in order to steal their credentials. For example, it can add the control-plane label to its own node, then add other labels to attract components that seek specific functionalities, and take over them. -2. #writes could double. Here it is importent to look at the usage and sort each case to 2 categories (low and high frequent write). The first usage is clearly low frequent as long as virt-handler operates as designed. -The second usage consist of two cases which might seem different but in fact these are same most of the time because NodeSchedulable, CPUManager, KSMEnabledLabel labels and KSMHandlerManagedAnnotation annotation is mostly static. What is left is VirtHandlerHeartbeat that is not necessary to sync on a Node (requires re-working node controller). -The third case is propagating labels that are also mostly static. -With all the above doubling #writes is unlikely. +This solution is not a bad one, but it's more important to address not being able to affect non-kubevirt labels/annotations before doing this (see Follow-up Steps section). +# chosen solution -## Update/Rollback Compatibility +TBD until further discussion. -Virt-handler is going to continue writting to Node object, therefore update should be without compatibility issues. +# Update/Rollback Compatibility + +Note: TBD per chosen solution. + +The following reasoning is relevant for all the proposed solutions. +virt-operator upgrade [order](https://github.com/kubevirt/kubevirt/blob/2e3a2a32d88a2e08c136c051c30754d1b4af423b/pkg/virt-operator/resource/apply/reconcile.go#L526-L641) is: +``` +CRDs ==> RBACs ==> Daemonsets (=virt-handler) ==> Controller Deployments (=virt-controller) ==> API Deployments (=virt-API) +``` + +## Option 1: Ignore/Mask patches to the node in case of "RBAC forbidden" error received +In order to handle backwards compatibility issues and minimize heartbeat timeouts, Virt-handler is going to continue writing to Node object. +At some point during the upgrade, it would start being blocked due to the removal of the patch ability. In order to mitigate this and minimize heartbeat timeout during upgrade, the node patch will start ignoring RBAC "forbidden" error. +Having said the above, this is a specific mention to each upgrade scenario - before and after the upgrade itself: + +### During upgrade rollout transient case scenarios: +* New CRD; old RBAC; old virt-handler; old virt-controller; old virt-API: + * Old RBAC: no upgrade issue + * Old virt-handler: no upgrade issue + * Old virt-controllers: no upgrade issue + * Old Virt-API: no upgrade issue +* New CRD, new RBAC; old virt-handler; old virt-controller; old virt-API: + * Old virt-handler: during upgrade, the old RBACs are kept as backup until upgrade is done. Hence, patching node should not fail so there is no upgrade issue. + * Old virt-controllers: no upgrade issue, virt controller RBAC is not changed due to this solution. + * Old Virt-API: no upgrade issue. +* New CRD; new RBAC; new virt-handler; old virt-controller; old virt-API: + * new virt-handler: shadowNode is written to but it is not synced to the node yet. However, virt-handler keeps patching the node directly, so there is no upgrade issue. + * [same as before] Old controllers: no upgrade issue, virt controller RBAC is not changed. + * [same as before] Old Virt-API: no upgrade issue. +* New CRD; new RBAC; new virt-handler; new virt-controller; old virt-API: + * New virt-handler: no upgrade issue. virt-handler patches both node and shadow node. + * New controllers: will start getting shadowNode requests once they are created. no upgrade issue. + * [same as before] Old Virt-API: no upgrade issue. + +### After upgrade rollout: +Virt-handler will keep trying (and failing) to patch the node since the backup RBACs are now removed. Virt-handler will ignore these cases by ignoring RBAC "forbidden" errors. This behavior should be kept until we no longer support upgrades from the current release. + +## Option 2: Use a new flag in the kubevirt CR status to signal upgrade is done + +Until the upgrade is done, patches to the node will still occur. Once the upgrade is finished (after the temporary validating webhooks are cleared) - set a new status field on virt-operator to signal that you can use the new way. +Doing so will ensure that all new components are installed and ready. ## Functional Testing Approach Existing functional tests should be enough to cover this change. # Implementation Phases -(How/if this design will get broken up into multiple phases) + +This change will be implemented in one phase. + +# Follow-up Steps + +These steps could be presented in future design proposals: +- Restrict the virt-handler Daemonset to only modify KubeVirt owned CRs (such as virtualmachineinstances) related to its own node. +- Restrict the virt-handler Daemonset from being able to update/patch other nodes.