Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

📖 Label Sync Between MachineDeployment and underlying Kubernetes Nodes #6255

Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
165 changes: 165 additions & 0 deletions docs/proposals/20220210-md-node-label-sync.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,165 @@
---

title: Label Sync Between MachineDeployment and underlying Kubernetes Nodes
authors:
- "@arvinderpal"
- "@enxebre"
reviewers:
- "@fabriziopandini"
- "@neolit123"
- "@sbueringer"
- "@vincepri"
- "@randomvariable"
creation-date: 2022-02-10
last-updated: 2022-02-26
status: implementable
replaces:
superseded-by:

---

# Label Sync Between MachineDeployment and underlying Kubernetes Nodes

## Glossary

Refer to the [Cluster API Book Glossary](https://cluster-api.sigs.k8s.io/reference/glossary.html).

## Summary

Labels on worker nodes are commonly used for node role assignment and workload placement. This document discusses a means by which labels placed on MachineDeployments can be kept in sync with labels on the underlying Kubernetes Nodes.

## Motivation

Kubelet supports self-labeling of nodes via the `--node-labels` flag. CAPI users can specify these labels via `kubeletExtraArgs.node-labels` in the KubeadmConfigTemplate. There are a few shortcomings:
Arvinderpal marked this conversation as resolved.
Show resolved Hide resolved
- [Security concerns](https://github.com/kubernetes/enhancements/tree/master/keps/sig-auth/279-limit-node-access) have restricted the allowed set of labels; In particular, kubelet's ability to self label in the `*.kubernetes.io/` and `*.k8s.io/` namespaces is mostly prohibited. Using prefixes outside these namespaces (i.e. unrestricted prefixes) is discouraged.
- Labels are only applicable at creation time. Adding/Removing a label would require a MachineDeployment rollout. This is undesirable especially in baremetal environments where provisioning can be time consuming.

The documented approach for specifying restricted labels in CAPI is to utilize [kubectl](https://cluster-api.sigs.k8s.io/user/troubleshooting.html#labeling-nodes-with-reserved-labels-such-as-node-rolekubernetesio-fails-with-kubeadm-error-during-bootstrap). This introduces the potential for human error and also goes against the general declarative model adopted by CAPI.

Users could also implement their own label synchronizer in their tooling, but this may not be practical for most.

### Goals

- Support label synchronization for MachineDeployments as well as KubeadmControlPlane (KCP).
- Support a one-time application of labels that fall under the following restricted prefixes: `node-role.kubernetes.io/*` and `node-restriction.kubernetes.io/*`
- Support synchronizing labels that fall under a CAPI specific prefix: `node.cluster.x-k8s.io/*`

### Future Goals

- Label sync for MachinePools nodes.

### Non-Goals

- Support for arbitrary/user-specified label prefixes.

## Proposal

CAPI will define its own label prefix:

```
node.cluster.x-k8s.io/*
```

Labels placed on the Machine that match the above prefix will be authoritative -- Cluster API will act as the ultimate authority for anything under the prefix. Labels with the above prefix should be set only within Cluster API CRDs.

In addition to the above, CAPI will support one-time application of the following labels during Machine creation:
sbueringer marked this conversation as resolved.
Show resolved Hide resolved

```
node-role.kubernetes.io/*
node-restriction.kubernetes.io/*
```

The `node-role.kubernetes.io` prefix has been used widely in the past to identify the role of a node (e.g. `node-role.kubernetes.io/worker=''`). For example, `kubectl get node` looks for this specific label when displaying the role to the user.

The use of the `node-restriction.kubernetes.io/*` prefix is recommended in the [Kubernetes docs](https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/#noderestriction) for things such as workload placement.

It's important to note that the these specific labels are *not* kept in sync, but rather applied only once after the Machine and its corresponding Node are first created. Taking authoritative ownership of these prefixes would preclude other entities from adding and removing such labels on the Node.

### User Stories

#### Story 1

As a cluster admin/user, I would like a declarative and secure means by which to assign roles to my nodes. For example, when I do `kubectl get nodes`, I want the ROLES column to list my assigned roles for the nodes.

#### Story 2

As a cluster admin/user, for the purpose of workload placement, I would like a declarative and secure means by which I can add/remove labels on groups of nodes. For example, I want to run my outward facing nginx pods on a specific set of nodes (e.g. md-secure-zone) to reduce exposure to the company's publicly trusted server certificates.

### Implementation Details/Notes/Constraints

#### Label propogation from MachineDeployment to Machine

Utilize CAPI metadata Propagation. CAPI supports [metadata propagation](https://cluster-api.sigs.k8s.io/developer/architecture/controllers/metadata-propagation.html?highlight=propagation#metadata-propagation) across core API resources. For the MachineDeployment types, template labels propagate to MachineSets top-level and MachineSets template metadata:
```
MachineDeployment.spec.template.metadata.labels => MachineSet.labels, MachineSet.spec.template.metadata.labels
```
Similarly, on MachineSet, template labels propagate to Machines:
```
MachineSet.spec.template.metadata.labels => Machine.labels
```

As of this writing, changing the `MachineDeployment.spec.template.metadata.labels` will trigger a rollout. This is undesirable. CAPI will be updated to instead ignore updates of labels that fall within the CAPI prefix. There is precedence for this with the handling of `MachineDeploymentUniqueLabel`.
Arvinderpal marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
As of this writing, changing the `MachineDeployment.spec.template.metadata.labels` will trigger a rollout. This is undesirable. CAPI will be updated to instead ignore updates of labels that fall within the CAPI prefix. There is precedence for this with the handling of `MachineDeploymentUniqueLabel`.
As of this writing, changing the `MachineDeployment.spec.template.metadata.labels` will trigger a rollout. This is undesirable. CAPI will be updated to instead propagate updates of labels in-place without a rollout. Similar changes will be made to KCP.

I would suggest to propagate all labels inline. Otherwise we will end up in a weird state where changes to some labels trigger a MD/MS rollout while others are in-place updated (should be simpler to implement as well).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, that said this is a breaking behavioral change which needs proper release documentation and a new minor release; in addition, changes to the logic of MachineDeployment might be extensive

Copy link
Member

@sbueringer sbueringer Jun 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah which is why I favored a separate field ... but yeah that decision has been made now :)

Just to confirm, we're fine with breaking behavior changes of our API fields as long as we do them in new minor releases? (my impression was that this requires a new apiVersion)

P.S. re: MD changes. The real fun part is probably to stop considering labels for the hash we're using while not triggering a MD rollout when the new CAPI version with this behavior is rolled out

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm, we're fine with breaking behavior changes of our API fields as long as we do them in new minor releases? (my impression was that this requires a new apiVersion)

The API fields wouldn't need to change, but the propagation behavior of those fields would change, which is something that needs to be documented.

P.S. re: MD changes. The real fun part is probably to stop considering labels for the hash we're using while not triggering a MD rollout when the new CAPI version with this behavior is rolled out

The same problem applies if we add a new field to the Machine template, we'd need a way to make sure that the generated hash is backward compatible. If that's easier though, let's just add a new field?

Copy link
Member

@sbueringer sbueringer Jun 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API fields wouldn't need to change, but the propagation behavior of those fields would change, which is something that needs to be documented.

Okay, good to know that we don't consider this a breaking change :)

The same problem applies if we add a new field to the Machine template, we'd need a way to make sure that the generated hash is backward compatible. If that's easier though, let's just add a new field?

I think if it's a separate field we just never have to include it in the hash and we don't have to figure out how to remove something from the hash calculation without rollouts. Probably?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Between the two solutions proposed above, the second one seems the most sane from a user perspective. If we also think that in the fullness of time we'd want to expand the in place node fields, this might be a good start.

That said, the biggest issue I see with the above is that we're relying on a hashing mechanism that's flaky and fragile. Have we considered breaking the hashing first before introducing new changes that might affect future rollouts and extensibility points?

Copy link
Member

@sbueringer sbueringer Jun 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, the biggest issue I see with the above is that we're relying on a hashing mechanism that's flaky and fragile. Have we considered breaking the hashing first before introducing new changes that might affect future rollouts and extensibility points?

Good point. I don't think we went that far regarding thinking about how to implement this proposal. But I think we should try to get to a consensus relatively soon, we're going back and forth on the field discussion since a while.

Maybe the following is a path forward?

  1. Pick the option that labels will be in a separate node.labels field
    Here it was basically a 50:50 decision (📖 Label Sync Between MachineDeployment and underlying Kubernetes Nodes #6255 (comment)), but sounds like now we prefer a separate field
  2. Implement the sync from Machine to Nodes
  3. Figure out how to in-place upgrade node.labels in MD
  4. Figure out how to in-place upgrade node.labels in KCP

I think the feature already would have a big benefit after 2., even without in-place upgrades (as it at least allows to propagate labels to nodes at all, which isn't possible right now).

Then we can continue with 3. and doing POC's and figuring out what the ideal way is to implement in-place upgrades.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Damn :/ Forgot that adding the field will already lead to a rollout on CAPI upgrade if we don't change the hashing.
So we definitely have to figure out how to do it for 2. already.

But If the general consensus is that we prefer the new field anyway (and we have the hash problem in both cases) we can at least move the proposal forward and then have to experiment on the implementation how to actually do it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sbueringer Thanks for pointing out the issue!
Introducing a new field is fine by me! I don't believe we have any objections, besides maybe @fabriziopandini :)

Should we also consider replacing the Deepcopy with something that includes specific fields (i.e. excludes our new field).

Also, if I'm not mistaken, wouldn't introducing a new field always cause a different hash to be generated, even if that new field is nil. So, when people upgrade, there will always be a rollout? Is this behavior considered okay for a breaking change?

Copy link
Member

@sbueringer sbueringer Jun 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also consider replacing the Deepcopy with something that includes specific fields (i.e. excludes our new field).

I think as long as we are using a changed struct the hash will change. I think we have a few options to solve this:

  1. "freeze" the struct by creating another one which produces the same output
  2. hack the way we write the struct into the hasher (e.g. by just dropping parts of the string)
  3. Replace Spew with something which writes a compatible output but allows ignoring fields
  4. Replace the entire hash mechanism through something else

There might be other alternatives as well. But for me personally it would be fine to experiment during implementation. I think by making it a separate field we have a few advantages, e.g.:

  • the hash issue is probably simpler to solve (if we keep the hash mechanism, we don't have to figure out how to only include pre-existing labels in the hash and ignore new ones)
  • and that it's not a behavioral change of the label field anymore.

So, when people upgrade, there will always be a rollout? Is this behavior considered okay for a breaking change?

Up until now we tried to avoid that at all costs (we even have an e2e test which validates it)


#### Synchronization of CAPI Labels

Labels that fall under the CAPI specific prefix `node.cluster.k8s.io/*` will need to be kept in sync between the Machine and Node objects. Synchronization must handle these two scenarios:
(A) Label is added/removed on a Machine and must be added/removed on the corresponding Node.
(B) Label is added/removed on the Node and must be removed/re-added on the Node to bring it in sync with the labels on Machine.
Comment on lines +105 to +107
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Labels that fall under the CAPI specific prefix `node.cluster.k8s.io/*` will need to be kept in sync between the Machine and Node objects. Synchronization must handle these two scenarios:
(A) Label is added/removed on a Machine and must be added/removed on the corresponding Node.
(B) Label is added/removed on the Node and must be removed/re-added on the Node to bring it in sync with the labels on Machine.
Labels that fall under the CAPI specific prefix `node.cluster.k8s.io/*` will need to be kept in sync between the Machine and Node objects. Synchronization must handle these three scenarios:
(A) Label is added/removed on a Machine and must be added/removed on the corresponding Node.
(B) Label is added/removed on the Node and must be removed/re-added on the Node to bring it in sync with the labels on Machine.
(C) Label value is changed on the Machine or Node and must be brought in sync with the label value on Machine.

I think we have 3 scenarios (or something similar)



#### Machine Controller Changes
Arvinderpal marked this conversation as resolved.
Show resolved Hide resolved

On each Machine reconcile event, the machine controller will fetch the corresponding Node (i.e. Machine.Status.NodeRef). If the Machine has labels specified, then the controller will ensure those labels are applied to the Node.

Similarly, today the machine controller watches Nodes within the workload cluster, where changes to the Node trigger a reconcile of its corresponding Machine object. This should allow us to capture changes made directly to the labels on the Node (e.g. user accidently deletes a label that falls under CAPI's managed prefix).

#### One-time Apply of Standard Kubernetes Labels

The machine controller will apply the standard kubernetes labels, if specified, to the Node immediately after the Node enters the Ready state (ProviderID is set). We will enforce this one-time application by making the labels immutable via a validating webhook.
sbueringer marked this conversation as resolved.
Show resolved Hide resolved

#### Delay between Node Create and Label Sync

The Node object is first created when kubeadm joins a node to the workload cluster (i.e. kubelet is up and running). There may be a delay (potentially several seconds) before the machine controller kicks in to apply the labels on the Node.

Kubernetes supports both equality and inequality requirements in label selection. In an equality based selection, the user wants to place a workload on node(s) matching a specific label (e.g. Node.Labels contains `my.prefix/foo=bar`). The delay in applying the label on the node, may cause a subsequent delay in the placement of the workload, but this is likely acceptable.

In an inequality based selection, the user wants to place a workload on node(s) that do not contain a specific label (e.g. Node.Labels not contain `my.prefix/foo=bar`). The case is potentially problematic because it relies on the absence of a label and this can occur if the pod scheduler runs during the delay interval.
sbueringer marked this conversation as resolved.
Show resolved Hide resolved

One way to address this is to use kubelet's `--register-with-taints` flag. Newly minted nodes can be tainted via the taint `cluster.x-k8s.io=label-sync-pending:NoSchedule`. Assuming workloads don't have this specific toleration, then nothing should be scheduled. KubeadmConfigTemplate provides the means to set taints on nodes (see JoinConfiguration.NodeRegistration.Taints).

The process of tainting the nodes, can be carried out by the user and can be documented as follows:

```
If you utilize inequality based selection for workload placement, to prevent unintended scheduling of pods during the initial node startup phase, it is recommend that you specify the following taint in your KubeadmConfigTemplate:
`cluster.x-k8s.io=label-sync-pending:NoSchedule`
Copy link
Member

@sbueringer sbueringer Jun 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
`cluster.x-k8s.io=label-sync-pending:NoSchedule`
`cluster.x-k8s.io/label-sync-pending:NoSchedule`

Assuming key:effect

(same for other occurrences)

```

After the node has come up and the machine controller has applied the labels, the machine controller will also remove this specific taint if it's present.

In the future, we may consider automating the insertion of the taint via CABPK.

## Alternatives

### Label propogation from MachineDeployment to Machine

Option #1: Introduce a new field Spec.Node.Labels

Instead of using the top-level metadata, a new field would be introduced:

```
MachineDeployment.Spec.Node.Labels => MachineSet.Spec.Node.Labels => Machine.Spec.Node.Labels
```

The benefit being that it provides a clear indication to the user that these labels will be synced to the Kubernetes Node(s). This will, however, require api changes.

Option #2: Propogate labels in the top-level metadata
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Option #2: Propogate labels in the top-level metadata
Option #2: Propagate labels in the top-level metadata

nit: please search/replace across the whole doc (also "propagation")


We could propogate labels in the top-level metadata. These are however are not currently propagated. One option is to only propogate top-level labels matching the aforementioned prefixes. The propogation would follow a similar path to that described earlier:

```
MachineDeployment.labels => MachineSet.labels => Machine.labels
```

## Implementation History

- [ ] MM/DD/YYYY: Proposed idea in an issue or [community meeting]