-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow providers to react when Clusters are paused and indicate in the Cluster's status
when those actions are finished
#8473
Comments
/triage accepted As of today clustectl move has a preflight check that tries to detect if provisioning is completed from proxy information: Should this, or an improved version of this, be enough for this use case? If not, we should start thinking about introducing an optional mechanism of synchronization between clusterctl move and providers, or even better, revamp the discussion about plugin in clusterctl which could provide more powerful extension points. Looping in also @yastij and @srm09 because I assume a similar issue has been dealt with in CAPV |
Thanks @fabriziopandini, I think something like that might work but I see a couple gaps for our use case. I noticed that check happens before Then if CAPZ can detect a move is happening, I think there would be a race between clusterctl's check that the cluster is provisioned and CAPZ marking a provisioned cluster as unprovisioned in reaction to the move. If CAPZ doesn't react fast enough, then clusterctl might go ahead and move the cluster even if it was just about to be marked as unrpovisioned. It seems to me that clusterctl would need to wait explicitly for the cluster to be paused to avoid that issue. |
might be related to this #8322 |
Adding what I said during office hours here: Would a finalizer work for this use case? We could add a specific finalizer as part of the CAPI move contract (with an exported const part of CAPI utils) that providers can add on resources to indicate "I need to do some cleanup on this resource before you can move it". Then, CAPI move could treat it as best effort:
Unless we're looking to block create as well, then it might be more tricky... |
I think ideally this is the case to ensure we never have two ASO instances reconciling the same resource. In practice, I think the risk is fairly low since the time between In general, pausing the ASO resources is something CAPZ should do whenever a Cluster is paused, not just during |
edited May 1st 2023 s/finalizer/annotation @CecileRobertMichon @nojnhuh What about a slight variation of the idea in #8473 (comment), that hopefully can address the potential issue of the delay between create and delete discussed in #8473 (comment). Given the current move workflow is: preflights - pause old - create new - delete old - unpause new)
I'm using annotation instead of finalizer, because it blocks not only deletion but also creation (it blocks move). We can consider defining this as annotations on single objects or on a CRDs (thus defining this setting once for all the objects), like we are already doing for force-move or other things; eventually, we can make it even more sophisticated allowing folks to override the duration of the period clusterctl move should wait with a similar mechanism. caveats: the CRD must be discoverable by clusterctl |
Would this finalizer be added by a controller? If so, then it seems like that would be susceptible to a race condition where the check that the finalizer is gone might happen before the controller has a chance to add it in the first place. And if the finalizer is added by Overall I still think some explicit indication of whether or not the cluster is paused in its if cond := conditions.Get(cluster.Status.Conditions, "Paused"); cond != nil {
return cond.Status == "True"
}
return cluster.Spec.Paused |
Using a finalizer to block move sounds wrong to me. As far as I can tell finalizers have a clear semantic in kubernetes and our use case doesn't match. Just wanted to add, I think this problem is not specific to infra providers. Because of the way the Controller-runtime cache works the Cluster CR is probably the only one where the pause is in effect instantly. I think there are similar race conditions after velero restore + unpause just for unpause instead of pause |
@fabriziopandini @CecileRobertMichon @sbueringer Thoughts on introducing a |
It is up to the controllers to set this annotation as soon as possible in the object's lifecycle as we do for finalizers, ownerRef, core labels, and other annotations as well.
I expressed similar concerns in my proposal above, and I suggested using annotation instead; I have edited the comment to make it more explicit
My main concern with this solution is that it will introduce a change that will prevent the new versions of clusterctl to perform move when working with older versions of CAPI/providers (versions without the condition), which is a use case that currently works. |
@fabriziopandini What if instead of checking that the annotation is present or not, clusterctl waits for the annotation either to not be defined (for backward-compat) or to have a particular value? That way, controllers can also use the annotation to indicate that a cluster is not paused, like @sbueringer brought up would be useful in other places as well. That also seems to make the potential race condition a bit more airtight if controllers can indicate the cluster's paused status proactively instead of only after a move is invoked. Overall I think a status condition could work similarly, where when no |
@fabriziopandini Do you still think an annotation is the best path forward at the moment? |
@nojnhuh I'm getting a little bit confused by the direction this discussion is taking In the issue description, we are discussing the problem of propagating pause to ASO resources and how to make clusterctl move to wait for everything to be actually paused. @CecileRobertMichon made a proposal to introduce a finalizer on the CAPZ resources responsible for the corresponding ASO objects; since then I have assumed that having this finalizer at the level of each CAPZ resource could make things easier because each resource can simply tell "I'm actually paused/I'm still in the process of pausing", but there was no need to surface a global "everything is paused" on the cluster object, which requires a lot of coordination across controllers. Then, during the discussion, the finalizer became an annotation, but as far as understood the idea was still to support this annotation potentially on any resource in the scope of the move operation; the annotation was optional, allowing each resource/controller to opt-in in controlling the move operation. Now, reading your last comment and looking at the linked PR, it seems that we are shifting to a different approach because we are now discussing only one annotation at the cluster level; not only, we are also trying to make this "everything is paused" a generic thing, not strictly related to clustectl move. If I got all this right, my feedback about this shift of perspective is the following:
I hope this helps in making progress; happy to jump in a meeting to discuss this in person if this can help |
@fabriziopandini I think that's a good recap of my understanding as well. I agree that annotating each individual resource would be less disruptive overall. I could see that being more difficult for use cases other than I also still think it's acceptable to have a way to express "everything is paused" on the Cluster object since Overall I have no issue with a |
Now that I've thought about this more, I think I'm coming around to the annotating-individual-resources approach. I think I had it stuck in my head that the ASO resources would need to be annotated this way, but maybe that's not the case if we can instead selectively annotate only the CAPZ infra resources which would block the move for it and all the resources under their respective ownership hierarchies (which would include ASO resources). Maybe that's already what you had in mind. At that point, it seems like the only real difference with the annotating-the-cluster approach is whether or not to propagate the annotation from the InfraCluster to the Cluster. I need to take another look at @fabriziopandini Does an approach like that align with what you had in mind? |
/assign |
What would you like to be added (User Story)?
As an infrastructure provider developer, I would like to allow my provider to perform and synchronize tasks in reaction to a Cluster being paused instead of assuming that pausing a Cluster happens instantaneously.
Detailed Description
In CAPZ, we are planning to adopt Azure Service Operator (ASO) to manage individual resources in Azure as Kubernetes objects in place of the Azure SDK from within CAPZ. Since Azure resources are reconciled out-of-band with Cluster API resources in this scenario, CAPZ cannot immediately block ASO's reconciliation loop when a Cluster is
spec.paused
.ASO reads a separate annotation on each resource it controls to optionally block reconciliation. In response to a Cluster being paused, CAPZ intends to set this annotation on each relevant ASO resource so no reconciliation occurs when a Cluster is paused.
Since tooling like
clusterctl move
assumes pausing is an instantaneous operation, there is a risk in this case that actions may be taken when the Cluster is not yet fully paused without waiting for some indication that it is in fact paused.An example of how I envision this might work for CAPZ with
clusterctl move
:clusterctl move
startsspec.paused
set totrue
clusterctl move
waits for each Cluster to havestatus.paused=true
status.paused=true
clusterctl move
continues as it does todayAnything else you would like to add?
cc @dtzar @CecileRobertMichon
Label(s) to be applied
/kind feature
/area api
/area clusterctl
The text was updated successfully, but these errors were encountered: