Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

📖 Add In-place updates proposal #11029

Draft
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

g-gaston
Copy link
Contributor

@g-gaston g-gaston commented Aug 7, 2024

What this PR does / why we need it:
Proposal doc for In-place updates written by the In-place updates feature group.

Starting this as a draft to collect early feedback on the main ideas and high level flow. APIs and some other lower level details are left purposefully as TODOs to focus the conversation on the rest of the doc, speed up consensus and avoid rework.

Fixes #9489

/area documentation

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. area/documentation Issues or PRs related to documentation cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Aug 7, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign enxebre for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Aug 7, 2024
@g-gaston g-gaston force-pushed the in-place-updates-proposal branch from c77a225 to be97dc6 Compare August 7, 2024 16:55
Copy link
Member

@neolit123 neolit123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the write up.
i left some comments, but i did not go in detailed review on the controller interaction (diagrams) part.


An External Update Extension implementing custom update strategies will report the subset of changes they know how to perform. Cluster API will orchestrate the different extensions, polling the update progress from them.

If the totality of the required changes cannot be covered by the defined extensions, Cluster API will allow to fall back to the current behavior (rolling update).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this might be only what some users want. IMO, if a in-place update fails, it should fail and give the signal for it. there could be a "fallback" option with default value "false", but it also opens to some questions - what if the external update tempered with objects in a way that the fallback is no longer possible? i think that in-place upgrades should be a "hard-toggle" i.e. it's either replace or in-place. no fallbacks from CAPIs perspective.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic can also use fallback scenario in case of timeout or some general condition. It might not scale well with multiple upgraders, but having options here would seem beneficial.

Since the changes are constrained on the single machine, machine replace should still work?

Copy link

@mogliang mogliang Aug 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if the external update tempered with objects in a way that the fallback is no longer possible

You mean there is (or there will be) case that external update can do but rollout update can't? If it happens, we can introduce some verification logic to determine if it can fallback.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be in favor of having a possibility to disable fallback to rollout updates. In some cases, users would want only certain fields to be handled in-place, for example, instance tags, if any other fields were changed it should be ok to do rollout update.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@neolit123 A couple of clarifications:

  • The fallback strategy is not meant for the scenario where the in-place update starts and fails. In this case, the update will remain in "failed" state until either the user manually intervenes or remediation (if configured) kicks in a deletes the failed machine. The fallback strategy is meant for when the external updaters cannot handle the desired update. In other words, when capi detects the need for an update, it queries the external updaters and decides to either start an in-place update or a rolling update (fallback strategy). But once it makes that decision and the update starts, it doesn't switch strategies.
  • We were thinking that the fallback strategy would be optional. TBD if opt-in or opt-out, pending the discussion on the API changes.

docs/proposals/20240807-in-place-updates.md Outdated Show resolved Hide resolved
docs/proposals/20240807-in-place-updates.md Outdated Show resolved Hide resolved

As this proposal is an output of the In-place updates Feature Group, ensuring that the rollout extension allows the implementation of in-place rollout strategies is considered a non-negotiable goal of this effort.

Please note that the practical consequence of focusing on in-place rollout strategies, is that the possibility to implement different types of custom rollout strategies, even if technically possible, won’t be validated in this first iteration (future goal).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by 'validated', do you mean something CAPI will maintain e2e tests for?
i would think there could be some community owned e2e tests for this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@neolit123 can you take a look at "Test Plan" section at the end of the proposal? The initial plan was to have it in CAPI CI

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What this paragraph tries to say is that although the concept of "external updater" theoretically allows to implement different types of update strategies (other than in-place), our focus here is to ensure that it can be used to implement in-place updates and that's what we will validate.

docs/proposals/20240807-in-place-updates.md Outdated Show resolved Hide resolved

### Non-Goals

- To provide rollbacks in case of an in-place update failure. Failed updates need to be fixed manually by the user on the machine or by replacing the machine.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to the earlier points, if in-place fails, how would the controllers know to leave it to the user for a manual fix vs rollout the machine?

Copy link
Contributor Author

@g-gaston g-gaston Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Controllers will never rollout the machine in case of in-place update failure. At most, MHC might mark the machine for remediation. But that's a separate process.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To @neolit123's point, this should be configurable — not everyone will want to fallback

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the idea is for both the update fallback strategy and MHC remediation (already is) to be optional

docs/proposals/20240807-in-place-updates.md Outdated Show resolved Hide resolved
docs/proposals/20240807-in-place-updates.md Outdated Show resolved Hide resolved
end
mach->>apiserver: Mark Machine as updated
end
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the diagram is missing the feedback signal from external updater to CAPI controllers whether the update has passed and what is the follow up for them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's correct. This is a high level flow that simplifies certain things. The idea is to help get a high level understanding of the flow with subsequent sections digging into the details of each part of the flow.

docs/proposals/20240807-in-place-updates.md Outdated Show resolved Hide resolved

If this set is reduced to zero, then CAPI will determine that the update can be performed using the external strategy. CAPI will define the update plan as a list of sequential external updaters in a particular order and proceed to execute it. The update plan will be stored in the Machine object as an array of strings (the names of the selected external updaters).

If after iterating over all external updaters the remaining set still contains uncovered changes, CAPI will determine the desired state cannot be reached through external updaters. If a fallback rolling update strategy has been configured (this is optional), CAPI will replace the machines. If no fallback strategy is configured, we will surface the issue in the resource status. Machines will remain unchanged and the desired state won't be reached unless remediated by the user. Depending on the scenario, users can: ammend the desired state to something that the registered updaters can cover, register additional updaters capable of handling the desired changes or simply enable the fallback strategy.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the order of external upgraders can be defined? Since there will be implicit requirements which will make them dependent on each other.

Since the idea is to iterate over an array of upgraders, this should have support for multiple iterations, and more clever mechanism than substraction. One iteration will not be enough to mark desired state unreachable.

Copy link
Contributor Author

@g-gaston g-gaston Aug 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the current proposal, updaters need to be independent in order to be scheduled in the same upgrade plan. Updaters just look at the set of required changes and tell capi what is the subset of changes they can take care of. And they need to be capable of updating those fields regardless of how many other updaters are scheduled and no matter if they run before or after.

If for some reason an updater needs certain fields to be updated first before being able to execute its update, then two update plans will be needed, hence the change would need to be performed by the user in two phases.

We could (probably in future iterations) add a "priority" property to the updaters that would help order updaters when they have overlapping functions. However this would be a global priority and not relative between updaters.

Now all that said, this is a what we are proposing, which might not cover all usecases. Do you have a particular usecase where order matters and updaters must be dependent on each other.


Both `KCP` and `MachineDeployment` controllers follow a similar pattern around updates, they first detect if an update is required and then based on the configured strategy follow the appropiate update logic (note that today there is only one valid strategy, `RollingUpdate`).

With `ExternalUpdate` strategy, CAPI controllers will compute the set of desired changes and iterate over the registered external updaters, requesting through the Runtime Hook the set of changes each updater can handle. The changes supported by an updater can be the complete set of desired changes, a subset of them or an empty set, signaling it cannot handle any of the desired changes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're falling back to rolling update, to @neolit123's point, it doesn't make sense to me that ExternalUpdate is a rollout strategy on its own, but rather it should be a field, or set of fields within rolling update that control its behavior?

Note that technically, a rolling update it doesn't have to be a replace operation, but it can be done in place, so imo it can be expanded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an interesting point. I'm not against representing external updates as a subtype of rolling update strategy. You are right that we what we are proposing here, CAPI is following a rolling update process except it delegates the machine update instead of replacing the machine by itself. But capi orchestrates the rolling process.

As long as we can represent the fallback as optional, I'm ok with this if folks think it makes more sense.


CAPI expects the `/UpdateMachine` endpoint of an updater to be idempotent: for the same Machine with the same spec, the endpoint can be called any number of times (before and after it completes), and the end result should be the same. CAPI guarantees that once an `/UpdateMachine` endpoint has been called once, it won't change the Machine spec until the update reaches a terminal state.

Once the update completes, the Machine controller will remove the name of the updater that has finished from the list of updaters and will start the next one. If the update fails, this will be reflected in the Machine status.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like we're tracking state and keeping this state in the Machine controller itself. This is usually a common source of issues given that the state can drift from reality. Have we considered the set of hooks be only ever present on the MachineDeployment object, and the Machine object only contain its status, hence every updater has to be 1) re-entrant and 2) track where it "left-off".

This way, the status can be calculated from scratch at every iteration, rather than rely on sync calls and other means of strict operations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I follow. What state are you referring to? The list of updaters to be run?

Answering your other question, yeah we opted to have the set of hooks at the Machine level because that allows to reuse the same mechanism for both KCP and MD machines.

Regarding re-entrance for updaters: yeah, that is the idea here (it might need more clarification in the doc). CAPI will continue call the /UpdateMachine endpoint of an updater until this either returns success or failure. It's up to the updater to track the "update progress". Or maybe I didn't understand your comment correctly?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my understanding reading the proposal, it sounds like we're building a plan and tracking it in the Machine spec it self, which can be error prone; I'd suggest instead to find an approach that's ultimately declarative: declare the plan somewhere else and reflect the status of that plan in Machine status

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I understand now, thanks for the clarification. Yeah this sounds very reasonable, let me give it a thought and I'll come back to it.

* A way to define different rules for Machines on-going an update. This might involve new fields in the MHC object. We will decouple these API changes from this proposal. For the first implementation of in-place updates, we might decide to just disable remediation for Machines that are on-going an update.


### API Changes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be great to start the proposal with an example in how we envision the end state to look like, from defining the state and provide an in depth example with KCP, kubeadm bootstrap provider, and an example infra provider (like AWS or similar)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting we do this before we define the API changes? or as part of that work?

We purposefully left the API design for later so we can focus the conversation on the core ideas and high level flow and make sure we are aligned there first.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without seeing the API changes we're proposing it's generally hard to grasp the high level concept. I would like to see it from a user/operator perspective:

  • How will we setup this feature in yaml?
  • What are the required pieces that we need to install?
  • Are there any assumptions we're making?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a section with examples with what we believe are 3 of the most common scenarios.


We propose a pluggable update strategy architecture that allows External Update Extension to handle the update process. The design decouples core CAPI controllers from the specific extension implementation responsible for updating a machine. The External Update Strategy will be configured reusing the existing field in KCP and MD resources, by introducing new type of strategy called `ExternalUpdate` (reusing the existing field in KCP and MD). This allows us to provide a consistent user experience: the interaction witht he CAPI resources is the same as in rolling updates.

This proposal introduces a Lifecycle Hook named `ExternalUpdate` for communication between CAPI and external update implementers. Multiple external updaters can be registered, each of them only covering a subset of machine changes. The CAPI controllers will ask the external updaters what kind of changes they can handle and, based on the reponse, compose and orchestrate them to achieve the desired state.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposal is a missing details in how the external updater logic would work, and how the "kind of changes they can handle" is handled. How is that going to work?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think It'd be good for the proposal to include a reference external updater implementation and shape around one common/trivial driving use case. E.g perform an in-place rolling update of the kubernetes version for a pool of Nodes. Then we can grasp and discuss design implications for RBAC, drain...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@enxebre In the 'test plan' section we mention a "CAPD Kubeadm Updater", which will be a reference implementation and also used for testing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vincepri

What do you mean with "how is that going to work?"? Are you referring to how the external updater knows what are the desired changes? Or how does the external updater compute what changes it can perform and what changes it can't?

Trying to give a generic answer here, the external updater will receive something like "current state" and "desired state" for a particular machine (including machine, infra machine and bootstrap) in the CanUpdateRequest. Then it will respond with something like an array of fields for those objects (kubeadmconfig -> ["spec.files", "spec.mounts", "spec.files"]), which would signal the subset of fields that it can update.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@enxebre
The idea of opening the draft at this stage for review is to get feedback on the core ideas and high level flow before we invest more time on this direction. Unless you think that a reference implementation is necessary to have these discussions, I would prefer to avoid that work.

That said, I totally get that it's possible that the lack of detail in certain areas is making difficult to have the high level discussion. If that's the case, we are happy to add that detail wherever needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to give a generic answer here, the external updater will receive something like "current state" and "desired state" for a particular machine (including machine, infra machine and bootstrap) in the CanUpdateRequest. Then it will respond with something like an array of fields for those objects (kubeadmconfig -> ["spec.files", "spec.mounts", "spec.files"]), which would signal the subset of fields that it can update.

These details must be part of the proposal, the details on how the entire flow from MachineDeployment, to the external request, back to the Machine, and reflecting status are not present, which makes it hard to understand how the technical flow will go and/or propose alternative solutions.


* More efficient updates (multiple instances) that don't require re-bootstrap. Re-bootstrapping a bare metal machine takes ~10-15 mins on average. Speed matters when you have 100s - 1000s of nodes to upgrade. For a common telco RAN use case, users can have 30000-ish nodes. Depending on the parallelism, that could take days / weeks to upgrade because of the re-bootstrap time.
* Single node cluster without extra hardware available.
* `TODO: looking for more real life usecases here`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we include certificate rotation in the use case?

Copy link
Contributor Author

@g-gaston g-gaston Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great usecase. However, I'm not sure if we should add it because what we have in this doc doesn't really solve that problem.

The abstractions/ideas we present can totally be used for cert rotation. However, what we have only covers changes triggered by updates to the KCP/MD specs. If I'm not mistaken, in-place cert rotation would be a separate process, similar to what capi does today, where the expiration date of certs is tracked in the background and handled separately from machine rollouts.

Opinions?

@g-gaston g-gaston force-pushed the in-place-updates-proposal branch from 5eb6664 to 472a336 Compare September 10, 2024 20:48
@t-lo
Copy link
Contributor

t-lo commented Sep 12, 2024

Hey folks 👋

@g-gaston Dropping by from the Flatcar Container Linux project - we're a container optimised Linux distro; we joined the CNCF a few weeks ago (incubating).

We've been driving implementation spikes of in-place OS and Kubernetes updates in ClusterAPI for some time - at the OS level. Your proposal looks great from our point of view.

While progress has been slower in the recent months due to project resource constraints, Flatcar has working proof-of-concept implementations for both in-place updating the OS and Kubernetes - independently. Our implementation is near production ready on the OS level, update activation can be coordinated via kured, and the worker cluster control plane picks up the correct versions. We do lack any signalling to the management cluster as well as more advanced features like coordinated roll-backs (though this would be easy to implement on the OS level).

In theory, our approach of in-place Kubernetes updates is distro agnostic (given the "mutable sysext" changes in recent versions of systemd starting with release 256).

We presented our work in a CAPZ office hours call earlier this year: https://youtu.be/Fpn-E9832UQ?feature=shared&t=164 (slide deck: https://drive.google.com/file/d/1MfBQcRvGHsb-xNU3g_MqvY4haNJl-WY2/view).

We hope our work can provide some insights that help to further flesh out this proposal. Happy to chat if folks are interested.

(CC: @tormath1 for visibility)

EDIT after initial feedback from @neolit123 : in-place updates of Kubernetes in CAPI are in "proof of concept" stage. Just using sysexts to ship Kubernetes (with and without CAPI) has been in production on (at least) Flatcar for quite some time. Several CAPI providers (OpenStack, Linode) use sysexts as preferred mechanism for Flatcar worker nodes.

@neolit123
Copy link
Member

neolit123 commented Sep 12, 2024

systemd-sysext

i don't think i've seen usage of sysext with k8s. it's provisioning of image extensions seems like something users can do, but they might as well stick to the vanilla way of using the k8s package registries and employing update scripts for e.g. containerd.

the kubeadm upgrade docs, just leverage the package manager upgrade way:
https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade/

one concern that i think i have with systemd-sysext that you still have a intermediate build process for the extension, while the k8s package build process is already done by the k8s release folks.

@t-lo
Copy link
Contributor

t-lo commented Sep 12, 2024

On Flatcar, sysexts are the preferred way to run Kubernetes. "Packaging" is straightforward - create a filesystem from a subdirectory - and does not require any distro specific information. The resulting sysext can be used across many distros.

I'd argue that the overhead is negligible: download release binaries into a sub-directory and run mksquashfs. We might even evangelise sysext releases with k8s upstream if this is a continued concern.

Drawbacks of the packaging process are:

  • intermediate state: no atomic updates, recovery required if update process fails
  • distro specific: needs to be re-implemented for every distro
  • no easy roll-back: going back to a previous version (e.g. because a new release causes issues with user workloads) is complicated and risky (again, intermediate state)

Sysexts are already used by the ClusterAPI OpenStack and the Linode providers with Flatcar (though without in-place updates).


If this set is reduced to zero, then CAPI will determine that the update can be performed using the external strategy. CAPI will define the update plan as a list of sequential external updaters in a particular order and proceed to execute it. The update plan will be stored in the Machine object as an array of strings (the names of the selected external updaters).

If after iterating over all external updaters the remaining set still contains uncovered changes, CAPI will determine the desired state cannot be reached through external updaters. If a fallback rolling update strategy has been configured (this is optional), CAPI will replace the machines. If no fallback strategy is configured, we will surface the issue in the resource status. Machines will remain unchanged and the desired state won't be reached unless remediated by the user. Depending on the scenario, users can: ammend the desired state to something that the registered updaters can cover, register additional updaters capable of handling the desired changes or simply enable the fallback strategy.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If after iterating over all external updaters the remaining set still contains uncovered changes

How do we envision this to take place? Diffing each field?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I envision something like capi would generate a set with all the fields that are changing for an object by diffing current state with desired state. Then as it iterates over the updaters, it would remove fields from the set. If it finish iterating over the updaters and there are still fields left in the set, then the update can't be performed in-place.

@neolit123
Copy link
Member

On Flatcar, sysexts are the preferred way to run Kubernetes. "Packaging" is straightforward - create a filesystem from a subdirectory - and does not require any distro specific information. The resulting sysext can be used across many distros.

the kubeadm and kubelet systemd drop-in files (in the official k8s packages) have some distro specific nuances like Debian vs RedHat paths. is sysexts capable of managing different drop-in files if the target distro is different, perhaps even detecting that automatically?

@t-lo
Copy link
Contributor

t-lo commented Sep 12, 2024

the kubeadm and kubelet systemd drop-in files (in the official k8s packages) have some distro specific nuances like Debian vs RedHat paths. is sysexts capable of managing different drop-in files if the target distro is different, perhaps even detecting that automatically?

Sysexts focus on shipping application bits (Kubernetes in the case at hand); configuration is usually supplied by separate means. That said, a complementary image-based configuration mechanism ("confext") exists for etc. Both approaches have their pros and cons, I'd say it depends on the specifics (I'm not very familiar with kubeadm on Debian vs. Red Hat, I'm more of an OS person :) ). But this should by no means be a blocker.

(Sorry for the sysext nerd sniping. I think we should stick to the topic of this PR - I merely wanted to raise that we have a working PoC of in-place Kubernetes updates. Happy to discuss Kubernetes sysexts elsewhere)

@neolit123
Copy link
Member

Sysexts focus on shipping application bits (Kubernetes in the case at hand); configuration is usually supplied by separate means. That said, a complementary image-based configuration mechanism ("confext") exists for etc. Both approaches have their pros and cons, I'd say it depends on the specifics (I'm not very familiar with kubeadm on Debian vs. Red Hat, I'm more of an OS person :) ). But this should by no means be a blocker.

while the nuances between distros are subtle in the k8s packages, the drop-in files are critical. i won't argue if they are config or not, but if kubeadm and systemd is used, e.g. without /usr/lib/systemd/system/kubelet.service.d/10-kubeadm.conf the kubelet / kubeadm integration breaks:
https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/kubelet-integration/#the-kubelet-drop-in-file-for-systemd

(Sorry for the sysext nerd sniping. I think we should stick to the topic of this PR - I merely wanted to raise that we have a working PoC of in-place Kubernetes updates. Happy to discuss Kubernetes sysexts elsewhere)

i think it's a useful POV. perhaps @g-gaston has comments on the sysext topic. although, this proposal is more about the CAPI integration of the in-place upgrade concept.

@t-lo
Copy link
Contributor

t-lo commented Sep 12, 2024

while the nuances between distros are subtle in the k8s packages, the drop-in files are critical. i won't argue if they are config or not, but if kubeadm and systemd is used, e.g. without /usr/lib/systemd/system/kubelet.service.d/10-kubeadm.conf the kubelet / kubeadm integration breaks: https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/kubelet-integration/#the-kubelet-drop-in-file-for-systemd

Shipping this file in a sysext is straightforward. In fact, the kubernetes sysexts we publish in our "sysext bakery" include it.

i think it's a useful POV. perhaps @g-gaston has comments on the sysext topic. although, this proposal is more about the CAPI integration of the in-place upgrade concept.

That's what originally motivated me to speak up: the proposal appears to discuss the control plane "upper half" our proof of concept implementation lacks. As stated we're OS folks :) And we're very happy to see this gets some traction.

@fabriziopandini
Copy link
Member

@t-lo thanks for reaching out! really appreciated
I agree with your comment that this proposal is tackling on layer of the problem and your work another.

+1 from me to keep discussion on this PR focused on the first layer

But great to see things are moving for the Flatcar Container Linux project; let's make sure the design work that is happening here does not prevent using Flatcar in place upgrade capabilities (but at the same time, we should make sure it could work with other OS as well, even the ones less "cloud native")

@daper
Copy link

daper commented Sep 24, 2024

It would be nice also to ensure the process is also compatible or at least gears well with talos.dev. Which is managed completely by a set of controllers that expose just an API. Useful for single-node long-lived clusters. As far as I read I see no complications yet for it.

@t-lo
Copy link
Contributor

t-lo commented Sep 26, 2024

Hello folks,

We've briefly discussed systemd-sysext and its potential uses for ClusterAPI in the September 25, 2024 ClusterAPI meeting (https://docs.google.com/document/d/1GgFbaYs-H6J5HSQ6a7n4aKpk0nDLE2hgG2NSOM9YIRw/edit#heading=h.s6d5g3hqxxzt).

Summarising the points made here so you don't need to watch the recording 😉 . Let's wrap up the sysext discussion in this PR so we can get the focus back to in-place updates. If there's more interest in this technology from ClusterAPI folks I'm happy to have a separate discussion (here: #11227).

  1. systemd-sysext are a distro-independent and vendor-independent way of shipping Kubernetes for clusterAPI. While it doesn't have much traction with CAPI providers at this time, it is supported by a wide range of distros and with recent changes, has become feasible for general purpose distros like Ubuntu (systemd 256 and above). Sysexts allow using stock distro images on vendor clouds, reducing CAPI operators' maintenance load. (no custom-built, self-hosted images required)
    1. Sysexts are easy to adapt to non-systemd distros as they use basic Linux mechanisms ("glorified overlayfs mounts").
  2. systemd-sysupdate is a complementary service that allows integration of atomic in-place updates of Kubernetes. It is supported on a wide range of distros and too uses basic mechanisms like HTTPS endpoints, index files, and semver matching. It uses symlinks for staging / applying updates; roll-back is possible by simply sym-linking the previous release. Sysupdate is very easy to integrate with Kubernetes reboot managers like kured.

@mkjpryor
Copy link

mkjpryor commented Nov 28, 2024

Would this mechanism as proposed allow me to do a node rebuild on clouds that support that, instead of a create/delete? I think from reading the proposal that the answer is yes, but I am not 100% certain...

Mainly, I am thinking about nodes in a bare-metal cloud using OpenStack Ironic (via Nova). We don't want to keep "spare" bare-metal nodes hanging around in order to be able to do an upgrade, and even if we did have a spare node the create/delete cycle would involve "cleaning" each node which can take a while - O(30m) - before it can be reprovisioned into the cluster. Cleaning is intended to make the node suitable for use with another tenant, so can include operations such as secure erase that are totally unnecessary when the node is being recycled back into the same tenant.

OpenStack supports a REBUILD operation on these hosts that basically re-images the node without having to do a delete/create, and I am hoping to use that in the future for these clusters potentially. The plan in this case would not necessarily be to update the Kubernetes components in place, but to trigger a rebuild of the node using a new image with updated Kubernetes components, and having the node rejoin the cluster without having to go through a cleaning cycle.

@g-gaston
Copy link
Contributor Author

g-gaston commented Dec 2, 2024

Would this mechanism as proposed allow me to do a node rebuild on clouds that support that, instead of a create/delete? I think from reading the proposal that the answer is yes, but I am not 100% certain...

Mainly, I am thinking about nodes in a bare-metal cloud using OpenStack Ironic (via Nova). We don't want to keep "spare" bare-metal nodes hanging around in order to be able to do an upgrade, and even if we did have a spare node the create/delete cycle would involve "cleaning" each node which can take a while - O(30m) - before it can be reprovisioned into the cluster. Cleaning is intended to make the node suitable for use with another tenant, so can include operations such as secure erase that are totally unnecessary when the node is being recycled back into the same tenant.

OpenStack supports a REBUILD operation on these hosts that basically re-images the node without having to do a delete/create, and I am hoping to use that in the future for these clusters potentially. The plan in this case would not necessarily be to update the Kubernetes components in place, but to trigger a rebuild of the node using a new image with updated Kubernetes components, and having the node rejoin the cluster without having to go through a cleaning cycle.

Yes, that should be doable. That said, and although I'm not familiar with the rebuild functionality, but that sounds like something that the infra provider could implement today without the in-place update functionality.

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 5, 2024

## Motivation

Cluster API by default performs rollouts by deleting a machine and creating a new one.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Cluster API by default performs rollouts by deleting a machine and creating a new one.
Cluster API by default performs rollouts by creating a new machine and deleting the old one.

Isn't the flow the other way around? This has implications for example on bare-metal when you don't have a +1 spare machine to start your rollouts with, reason why In-place updates would be needed.

apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: VSphereMachineTemplate
metadata:
name: md-1-2

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name should probably change here to indicate this is a new resource, not editing an existing one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/documentation Issues or PRs related to documentation cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Supporting an Inplace Update Rollout Strategy for upgrading Workload Clusters