-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
📖 CAEP: machine deletion phase hooks #3132
📖 CAEP: machine deletion phase hooks #3132
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems like a reasonable solution to me. i like the opt-in nature and the flexibility it gives to users.
i do wonder about the extended interactions between a, possibly, large number of owning controllers and the ability of a user to reasonably debug issues. i realize some of this will be highly dependent on the complexity that people create, but it concerns me with respect to users inspecting capi artifacts for answers.
i left a couple comments and nits inline
/milestone v0.3.x |
309a8e0
to
189a30c
Compare
189a30c
to
dffafdc
Compare
i think this is looking good, thanks for the updates. |
(pre-drain) As an operator, when I am deleting a control plane machine for | ||
maintenance reasons, I want to ensure my existing control plane machine is not | ||
drained until after my replacement comes online. This will prevent protracted | ||
periods of a missing control plane member if the replacement machine cannot be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The references here to a control plane member are confusing, especially since it is at odds with the existing work on integrating MachineHealthCheck remediation with KubeadmControlPlane. I think it would be better to genericize this user story to avoid specific mention of control plane
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with changing the wording, but I don't see how this is at odds with MHC. MHC can delete a machine, as normal. We just prevent the instance from being drained/terminated until the replacement comes online. To the MHC, this just looks like it's taking a long time to drain the machine, it doesn't need to do anything different.
I think when we were discussing the modifications to MHC for the KCP previously, those bits might be a temporary fix until something like this merges.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern is that the work that is in-flight (#3185) will block creation of the new Machine until the old Machine has been fully deleted. This is actually needed to avoid cases where scaling up the etcd cluster prior to fully removing the machine being remediated could lead to a loss of quorum.
For example, given a 3 replica Control Plane:
- 1 machine fails MHC in a way that also impacts the etcd member, we are now at 2/3 healthy etcd members and still have quorum
- If we attempt to scale up the Control Plane, part way through the bootstrapping of the 4th replica, the new member will be added to etcd, at this point quorum is lost (2/4 members healthy)
- At this point one of two things can happen:
- The new etcd member finishes syncing, quorum is restored, and the 4th replica finishes bootstrapping. Cluster wide impact is relatively minimal
- The etcd member fails to sync, quorum loss remains. Cluster wide services impacted due to read-only datastore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we attempt to scale up the Control Plane, part way through the bootstrapping of the 4th replica, the new member will be added to etcd, at this point quorum is lost (2/4 members healthy)
I have thought about this at great length in the context of what we're doing in OpenShift.
The answer I came up with was divorcing etcd from automatically being enrolled as a member of the cluster upon control plane scale up. We can install the binaries, we can configure it, but just don't add the member to the cluster until some smarter thing elects to do that. So, you could have 6 control plane hosts, and only 3 active etcd members. The others could be 'warm' so you can quickly remove/add the hosts you want. The etcd controller would need to have some idea about fault domains and machine/node statuses, but that is the best way IMO.
This also lets us scale etcd independently of scaling control plane hosts. Want to have a single control plane host with a single etcd member? Easy, scale etcd down to one, then scale down your control plane hosts. etcd controller will ensure that etcd lands in the right spot. Want to scale up to 5 etcd hosts? Fine, scale up the control plane, scale up etcd.
It solves so many problems. I'm not sure exactly how it fits into the existing architecture, but this is the only reasonable setup I can see.
In any case, what we're defining in this CAEP is optional behavior of 3rd party components. Since cluster-api bits are supposed to be reusable (that is still the case, right?), this use case makes sense IMO. If it's creating confusion as you say, I can remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are probably more nuanced stories we can tease out specifically around control plane / etcd membership and deletion hooks. To move forward with this proposal, though, let's remove this story as you suggested.
|
||
Hooks defined at this point will prevent the machine-controller from draining a node after the machine-object has been marked for deletion until the hooks are removed. | ||
##### pre-term | ||
`pre-term.delete.hook.machine.cluster.x-k8s.io` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it make sense to be explicit here and call this pre-terminiation.delete.hook.machine.cluster.x-k8s.io
instead to avoid any confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think that should be acceptable. The first portion of the key before the '/' character is like 253 char limit, so that should be okay.
|
||
pre-term.hook.machine.cluster-api.x-k8s.io/preserve-instance: my-instance-replacement-controller | ||
|
||
pre-term.hook.machine.cluster-api.x-k8s.io/never-delete: cluster-admin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be better to rename this example a bit. I think I get the intention (that Cluster API should not attempt to delete the underlying infrastructure), but this example could make some believe it gives more assurances than it actually does, since there is nothing preventing an external actor from deleting the InfraMachine resource out of band or any type of termination prevention on the infrastructure side preventing removal of the underlying infra directly.
using CRDs external to the machine-api, gRPC communications, or | ||
additional annotations on the machine or other objects. | ||
|
||
#### Owning Controller Design |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be good to try to find a more descriptive name than Owning Controller here, I think there is a high likelihood of confusion for a user between the owning controller of the resource (Machine Controller in this case) and the owning controller of an annotation on the resource.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Hook Owning Controller" ? Then we can just use HOC all over the doc?
* Watch machine objects and determine when an appropriate action must be taken. | ||
* After completing the desired hook action, remove the hook annotation. | ||
##### Owning Controllers may | ||
* Watch machine objects and add a hook annotation as desired by the operator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by operator here? It's not clear if it's referring to a controller or an end-user. It's also not clear when automatically adding an annotation would be considered appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant 'cluster administrator' so I updated the notes. I'll add a new section about when adding is appropriate.
* After completing the desired hook action, remove the hook annotation. | ||
##### Owning Controllers may | ||
* Watch machine objects and add a hook annotation as desired by the operator | ||
* After completing a hook, set an addition annotation that indicates this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a bit out of scope for this proposal and a bit confusing, since the required deletion of the annotation should also be a sufficient signal that this controller has finished it's work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I see what you mean, but it's intentional. This kind of describes an abstraction so one HOC doesn't necessarily have to know about another.
On HOC "A" let's say I am a pre-drain hook. Let's say I take some action like stopping some specific workload safely. I set the "critical-pods-stopped-safely" annotation and remove my hook. On HOC "B", I'm also a pre-drain hook. I don't know anything about HOC "A", but I do know I'm not going to start my work until I see the annotation "critical-pods-stopped-safely" so I can detach some special storage or something. This also might be used to signal some other controller that doesn't know or care about lifecycle points. The possibilities are numerous. Ideally, each HOC is configurable as to waiting for a certain annotation to be present or absent, so this abstraction is redundant, but there's nothing stopping someone from doing this, so I figured it's worth documenting the approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would probably remove this entire "may" section, but I'm also ok leaving it in. It does not affect what we're going to code; it just describes some possibilities created by adding hooks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some of the details based on questions people previously had in the gdoc. I could remove it, but then those questions are just unanswered. People want to know how to order these things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, thanks for the clarification. Does something like the following wording help at all?
* Coordinate with other hook controllers through any means possible, such as using common annotations, CRDs, etc. For example, one hook controller could set an annotation indicating it has finished its work, and another hook controller could wait for the presence of the annotation before proceeding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also wonder if the "Hook Ordering" section above is sufficient and if this item is repetitive?
dffafdc
to
2171346
Compare
2171346
to
a00748f
Compare
I am generally 👍 to proceed. My only real question remaining is about the value of each annotation (i.e. the "owner"). |
Do we envision that down the line we'd keep the annotation as the main approach to add hooks to deletion workflow, or move to a CRD-based, or ~webhook one? I'm asking this mostly because once this proposal and the related implementation merges, it becomes an API we'll have to maintain and respect for future versions. As we think of v1alpha4 and getting the core types to v1beta1, I'd like to make sure this group has thought about possible implications and future support for these annotations and hooks. Is this the best place for the hooks to live short and long term? Or would it be better to have an experiment registration CRD? That said, +1 on moving the proposal along given that it's pretty simple and straightforward. |
I guess you're thinking about what it would take to convert an annotation to a CR if we decide to use CRDs for hooks in v1alpha4? |
Personally, I think annotation-based approaches are fine for proof of concept and experimental support for features, but not much beyond that, since it is basically an un-versioned addition to the API that creates many challenges around backward and forward compatibility support. It then becomes difficult to impossible to change, deprecate, and/or remove support for when consumed by loosely-coupled external integrations. It might be good to define out a policy of when/where annotations should be used and possibly provide suggested alternatives where they shouldn't be. For example, annotations do not present much of an issue when used for coordinating actions between core in-tree components that are updated and released on the same release cadence. However using annotations between in-tree and out-of-tree providers presents a bit more of a challenge. Sure we could encode it as part of the contracts we are publishing, but it complicates transition to a new version of the contract and makes the transition more error prone. Whereas if the behavior is built into the versioned part of the API, then the backward/forward compatibility can easily be built into the conversions. For any other integrations, I believe we really should avoid annotations all together for a few different reasons:
+1 |
@michaelgugino I think we're mostly good to go here. Could you please review the open comments and check if everything has been addressed in the proposal? Thanks! |
Yes that's correct, more beyond v1alpha4 and getting to beta at some point.
I guess given that our types are still alpha, seems we're ok proceeding with annotations for now, and revisit later. The only warning I see here is in terms of migration from annotation to custom CRDs, it's not going to be a conversion webhook, it could be an external utility, or we could expect users to do the necessary changes when ready. Does this change our approach within Cluster API to adopt these hooks? For example, we talked in the past about getting KCP to register hooks, should we avoid doing so, or are we ok proceeding with that plan if/when it comes down to it?
💯 |
If the "external" integration support we are defining here is considered "alpha" or "experimental", then I don't think we need to offer any type of conversion between implementations as the feature matures and could externalize that (either to the external implementation(s) or users).
I don't think the current implementation details vs future implementation details matter too much for in-tree use, since we know exactly how those components are interacting and can provide the conversion steps needed automatically for those specific use cases. |
I'd like to proceed with this proposal so people can start trying out the hooks. We can file issues for the remaining questions. I don't think we need to hold off from using them in CAPI controllers. |
Sounds good to me 👍 |
I know there are some perceived shortcomings with annotations compared to CRDs. Sometimes I think there is too much focus on API 'correctness' instead of UX. With annotations, everything is simple. If I want to add this annotation to every machine in my cluster at creation time, I can add it to my machine-deployments / machine-sets. They're all in one place, I don't have to hunt for them across multiple objects. There is one CRD implementation I could live with, I think. A 1-to-1 CR to Machine mapping like this: Kind: MachineLifecycleHook
meta:
name: my-machine
spec:
pre-drain-delete:
- my-hook: some-owner
- other-hook: owner2
pre-terminate-delete:
- wait-for-replacement: custom-controller
... That still lacks the API of creating a particular hook for every machine at creation time. Then you have to decide who's creating and deleting the object (I suppose it could be a child object of the machine, we'd need to set a finalizer on it, remove the finalizer later). We'd need to do something hacky like 'initialHooks' as part of the machineSpec. So, now we need lots more controller code and multiple webhooks (need to validate the initial set is valid on the machine, and also validate each create/update on the CR). Now, everyone's controller needs to setup watches on two resources. If you give someone a list, they're going to want to order it, and that's not how this was designed, so that's confusing. Anyway, there's so many other fish to fry, so I hope we never move to CRDs or some other worse abstraction. Just embrace the annotations :) |
node. | ||
|
||
#### Story 3 | ||
(pre-drain) As an operator, when I am deleting a control plane machine for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this different from story 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what will be immediate the consumers of this proposal? Is there any particular story the core capi components would implement right away out of this proposal that could be included here or is this just to enable others?
For me UX is a big part of why I'm not a fan of annotations. They are an unversioned part of the API and would require coordination across all components that are using them to be able to make any changes to (including name changes, additions, deprecation, and removal). As such, many times anywhere where compatibility is broken in a non-obvious way, it defers the complications related to upgrade to the user to fix.
I feel like this is a best case scenario, though. In the event that something goes wrong annotations feel much more opaque. It is on the user to figure out what annotations are applied, then they need to figure out what behaviors are modified by those annotations, and then finally try to determine what external controller is acting on those annotations before they can figure out what is the problem. |
a00748f
to
6aba084
Compare
Defines a set of annotations that can be applied to a machine which affect the linear progress of a machine’s lifecycle after a machine has been marked for deletion. These annotations are optional and may be applied during machine creation, sometime after machine creation by a user, or sometime after machine creation by another controller or application.
6aba084
to
5cacc9f
Compare
/lgtm |
|
||
## Motivation | ||
|
||
Allow custom and 3rd party components to easily interact with a machine or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this mechanism meant to be an alternative to infrastructure provider mechanisms to do this where they don't exist? eg. Azure has scheduled events https://docs.microsoft.com/en-us/azure/virtual-machines/linux/scheduled-events
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. This is meant to prevent the machine-controller from performing certain actions at certain times, such as drain, or instructing the infrastructure provider to removing the VM.
/milestone v0.3.9 |
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
/hold
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: vincepri The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold cancel |
What this PR does / why we need it:
Defines a set of annotations that can be applied to a machine which affect the
linear progress of a machine’s lifecycle after a machine has been marked for
deletion. These annotations are optional and may be applied during machine
creation, sometime after machine creation by a user, or sometime after machine
creation by another controller or application.