From 5f5c038d375dfc4fe9dc31993e1a4bdb7bb04149 Mon Sep 17 00:00:00 2001 From: Arvinder Wander Date: Thu, 3 Mar 2022 14:13:03 -0800 Subject: [PATCH] Proposal: Label Sync Between MachineDeployment and underlying Kubernetes Nodes --- docs/proposals/20220210-md-node-label-sync.md | 159 ++++++++++++++++++ 1 file changed, 159 insertions(+) create mode 100644 docs/proposals/20220210-md-node-label-sync.md diff --git a/docs/proposals/20220210-md-node-label-sync.md b/docs/proposals/20220210-md-node-label-sync.md new file mode 100644 index 000000000000..7d58bb3460d7 --- /dev/null +++ b/docs/proposals/20220210-md-node-label-sync.md @@ -0,0 +1,159 @@ +--- + +title: Label Sync Between MachineDeployment and underlying Kubernetes Nodes +authors: + - "@arvinderpal" + - "@enxebre" +reviewers: + - "@fabriziopandini" + - "@neolit123" + - "@sbueringer" + - "@vincepri" + - "@randomvariable" +creation-date: 2022-02-10 +last-updated: 2022-02-26 +status: implementable +replaces: +superseded-by: + +--- + +# Label Sync Between MachineDeployment and underlying Kubernetes Nodes + +## Glossary + +Refer to the [Cluster API Book Glossary](https://cluster-api.sigs.k8s.io/reference/glossary.html). + +## Summary + +Labels on worker nodes are commonly used for node role assignment and workload placement. This document discusses a means by which labels placed on MachineDeployments can be kept in sync with labels on the underlying Kubernetes Nodes. + +## Motivation + +Kubelet supports self-labeling of nodes via the `--node-labels` flag. CAPI users can specify these labels via `kubeletExtraArgs.node-labels` in the KubeadmConfigTemplate. There are a few shortcomings: +- Security concerns have restricted the allowed set of labels; In particular, kubelet's ability to self label in the `*.kubernetes.io/` and `*.k8s.io/` namespaces is mostly prohibited. Using prefixes outside these namespaces (i.e. unrestricted prefixes) is discouraged. +- Labels are only applicable at creation time. Adding/Removing a label would require a MachineDeployment rollout. This is undesirable especially in baremetal environments where provisioning can be time consuming. + +The documented approach for specifying restricted labels in CAPI is to utilize kubectl. This introduces the potential for human error and also goes against the general declarative model adopted by CAPI. + +Users could also implement their own label synchronizer in their tooling, but this may not be practical for most. + +### Goals + +- Support synchronizing labels that fall under the following restricted prefixes: `node-role.kubernetes.io/*` and `node-restriction.kubernetes.io/*` +- Support synchronizing labels that fall under a CAPI specific prefix: `node.cluster.k8s.io/*` + +### Non-Goals/Future Work + +- Label sync for control-plane nodes. +- Support for arbitrary/user-specified label prefixes. + +## Proposal + +CAPI will define its own label prefix which falls within the [restricted namespace *.k8s.io/](https://github.com/kubernetes/enhancements/tree/master/keps/sig-auth/279-limit-node-access): + +``` +node.cluster.k8s.io/* +``` + +Labels placed on the Machine that match the above prefixe will be authoritative -- Cluster API will act as the ultimate authority for anything under the prefix. Labels with the above prefix should be set only within Cluster API CRDs. + +In addition to the above, CAPI will support one-time application of the following labels during Machine creation: + +``` +node-role.kubernetes.io/* +node-restriction.kubernetes.io/* +``` + +The `node-role.kubernetes.io` prefix has been used widely in the past to identify the role of a node (e.g. `node-role.kubernetes.io/worker=''`). For example, `kubectl get node` looks for this specific label when displaying the role to the user. + +The use of the `node-restriction.kubernetes.io/*` prefix is recommended in the [Kubernetes docs](https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/#noderestriction) for things such as workload placement. + +It's important to note that the these specific labels are *not* kept in sync, but rather applied only once after the Machine and its corresponding Node are first created. Taking authoritative ownership of these prefixes would preclude other entities from adding and removing such labels on the Node. + +### User Stories + +#### Story 1 + +As a cluster admin/user, I would like a declarative and secure means by which to assign roles to my nodes. For example, when I do `kubectl get nodes`, I want the ROLES column to list my assigned roles for the nodes. + +#### Story 2 + +As a cluster admin/user, for the purpose of workload placement, I would like a declarative and secure means by which I can add/remove labels on groups of nodes. For example, I want to run my outward facing nginx pods on a specific set of nodes (e.g. md-secure-zone) to reduce exposure to the company's publicly trusted server certificates. + +### Implementation Details/Notes/Constraints + +#### Label propogation from MachineDeployment to Machine + +Option #1: CAPI metadata Propagation + +CAPI supports [metadata propagation](https://cluster-api.sigs.k8s.io/developer/architecture/controllers/metadata-propagation.html?highlight=propagation#metadata-propagation) across core API resources. For the MachineDeployment types, template labels propagate to MachineSets top-level and MachineSets template metadata: +``` +MachineDeployment.spec.template.metadata.labels => MachineSet.labels, MachineSet.spec.template.metadata.labels +``` +Similarly, on MachineSet, template labels propagate to Machines: +``` +MachineSet.spec.template.metadata.labels => Machine.labels +``` + +As of this writing, changing the `MachineDeployment.spec.template.metadata.labels` will trigger a rollout. This is undesirable. + +Alternative is to use the labels in the top-level metadata. However, these are not currently propagated. One option is to only propogate top-level labels matching the aforementioned prefixes. The propogation would follow a similar path to that described above: +``` +MachineDeployment.labels => MachineSet.labels => Machine.labels +``` + +Option #2: New field Spec.NodeLifecycle.Labels + +Instead of using the top-level metadata, a new field would be introduced: +``` +MachineDeployment.Spec.NodeLifecycle.Labels => MachineSet.Spec.NodeLifecycle.Labels => Machine.Spec.NodeLifecycle.Labels +``` + +The benefit being that it provides a clear indication to the user that these labels will be synced to the Kubernetes Node(s). This will, however, require api changes. + +#### Synchronization of CAPI Labels + +Labels that fall under the CAPI specific prefix `node.cluster.k8s.io/*` will need to be kept in sync between the Machine and Node objects. Synchronization must handle these two scenarios: +(A) Label is added/removed on a Machine and must be added/removed on the corresponding Node. +(B) Label is added/removed on the Node and must be removed/re-added on the Node to bring it in sync with the labels on Machine. + + +#### Machine Controller Changes + +On each Machine reconcile event, the machine controller will fetch the corresponding Node (i.e. Machine.Status.NodeRef). If the Machine has labels specified, then the controller will ensure those labels are applied to the Node. + +To watch for changes on the Node object, the Tracker pattern will be used. ... + +#### One-time Apply of Standard Kubernetes Labels + +The machine controller will apply the standard kubernetes labels, if specified, to the Node immediately after the Node enters the Ready state (ProviderID is set...). + +#### Delay between Node Create and Label Sync + +The Node object is first created when kubeadm joins a node to the workload cluster (i.e. kubelet is up and running). There may be a delay (potentially several seconds) before the LSC kicks in to apply the labels on the Node. + +Kubernetes supports both equality and inequality requirements in label selection. In an equality based selection, the user wants to place a workload on node(s) matching a specific label (e.g. Node.Labels contains `my.prefix/foo=bar`). The delay in applying the label on the node, may cause a subsequent delay in the placement of the workload, but this is likely acceptable. + +In an inequality based selection, the user wants to place a workload on node(s) that do not contain a specific label (e.g. Node.Labels not contain `my.prefix/foo=bar`). The case is potentially problematic because it relies on the absence of a label and this can occur if the pod scheduler runs during the delay interval. + +One way to address this is to use kubelet's `--register-with-taints` flag. We could use that to taint a newly minted Node with say `cluster.x-k8s.io=label-sync-pending:NoSchedule`. Assuming workloads don't have this specific toleration, then nothing should be scheduled. When the LSC synchronizes labels, it removes this specific taint. + +KubeadmConfigTemplate provides the means to set taints on nodes (see JoinConfiguration.NodeRegistration.Taints). We have some options as to how best to “insert” this specific taint: +(1) Leave it to the user to specify this exact taint in their KubeadmConfigTemplate manifests +(2) CABPK inserts it by default in all join configurations. + +Option (1) is the simplest and could be documented as so: + +``` +If you utilize inequality based selection for workload placement, to prevent unintended scheduling of pods during the initial node startup phase, it is recommend that you specify the following taint in your KubeadmConfigTemplate: +`cluster.x-k8s.io=label-sync-pending:NoSchedule` +``` + +## Alternatives + +Users can utilize kubectl in an imperative fashion to add labels to the worker nodes. + +## Implementation History + +- [ ] MM/DD/YYYY: Proposed idea in an issue or [community meeting]