Test `clusterctl move` with ASO #3795

nojnhuh · 2023-08-02T19:11:00Z

This issue aims to assess how necessary kubernetes-sigs/cluster-api#8473 is to address for CAPZ's ASO migration. The following are criteria required for clusterctl move to work that are most at risk without a solution to the linked CAPI issue:

ASO resources on the destination management cluster eventually become steadily Ready
ASO resources on the source cluster never enter a Deleting state

There is a chance the above criteria may be met without a solution to the above CAPI issue, in which case we may be able to afford to be more patient in addressing that.

See also:
#3525

The text was updated successfully, but these errors were encountered:

nojnhuh · 2023-08-02T19:21:58Z

/assign

nojnhuh · 2023-08-03T00:19:20Z

I think as long as the reconcile-policy: skip annotation gets applied before the resource gets created/deleted by clusterctl move, then neither of the above items are at risk.

ASO resources on the source cluster never enter a Deleting state

To ensure this, the ASO resource needs to be annotated before being deleted from the source management cluster. In my first test, the ASO resource was annotated about 2s before it was deleted. That seems like a reasonably comfortable buffer to me if that's consistent.

ASO resources on the destination management cluster eventually become steadily Ready

This scenario is to prevent the ASO on the source and destination clusters from actively reconciling the same resource, which could pose problems if the resource gets modified during move. This can be ensured when the reconcile-policy: skip annotation is applied before the ASO resource is created on the destination cluster. In my first test, this happened about 0.8s before the ASO resource was created. That window seems tight enough that I'm not sure we can guarantee the annotation will consistently beat the new resource being created.

I'm not sure that's a problem worth worrying about though since both ASO instances may be reconciling two definitions of the same resource, but as long as both definitions are equivalent during the move, the ASO control planes in each cluster will be doing redundant work but will not actively conflict with each other. So I doubt this would cause any problems.

I'll do more testing to get a better sample size of the timings here and try with more clusters being moved at once. These timings should maybe also be taken with a grain of salt since I'm not sure if the capz-controller-manager Pod and clusterctl environments are using the same clock or how well they're synced if not.

tl;dr Things look ok without the CAPI change so far, more testing needed.

nojnhuh · 2023-08-04T22:36:14Z

It looks like moving more (10) clusters gives us more buffer, with about 6.5s between annotating and the first moved resource being created and another ~13s before any moved resource gets deleted.

Overall, I'm reasonably confident users won't run into issues even without the CAPI fix, at least for this first iteration of ASO in CAPZ that manages only resource groups.

cc @dtzar

I'll look to see what we might be able to add to the tests to catch when the annotation doesn't get applied in time, but I'm not optimistic we can do anything meaningful without tweaking clusterctl.

nojnhuh · 2023-08-04T22:36:48Z

cc @CecileRobertMichon

nojnhuh · 2023-08-14T15:55:16Z

ASO resources on the destination management cluster eventually become steadily Ready

I think this point is probably already tested well enough for now since the test will ensure that the Cluster is Provisioned, which son't be the case when the ResourceGroup is not Ready.

ASO resources on the source cluster never enter a Deleting state

I suppose we could put a watch on the ResourceGroup while the move is taking place to check for this explicitly. I have a feeling that if resources really were being deleted, that would probably at least make the test time out if it has to recreate the cluster from scratch, if that doesn't completely blow up the test some other way.

I'll do this same testing once there are a few more ASO resources in the mix but close this for now. I don't think this is critical enough at the moment and it doesn't seem like there's a simple high-value check we can add to the tests.

/close

k8s-ci-robot · 2023-08-14T15:55:21Z

@nojnhuh: Closing this issue.

In response to this:

ASO resources on the destination management cluster eventually become steadily Ready

I think this point is probably already tested well enough for now since the test will ensure that the Cluster is Provisioned, which son't be the case when the ResourceGroup is not Ready.

ASO resources on the source cluster never enter a Deleting state

I suppose we could put a watch on the ResourceGroup while the move is taking place to check for this explicitly. I have a feeling that if resources really were being deleted, that would probably at least make the test time out if it has to recreate the cluster from scratch, if that doesn't completely blow up the test some other way.

I'll do this same testing once there are a few more ASO resources in the mix but close this for now. I don't think this is critical enough at the moment and it doesn't seem like there's a simple high-value check we can add to the tests.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

github-project-automation bot added this to CAPZ Planning Aug 2, 2023

This was referenced Aug 2, 2023

Integrate Azure Service Operator #3402

Closed

Enable ASO #3527

Closed

nojnhuh moved this to In Progress in CAPZ Planning Aug 2, 2023

k8s-ci-robot assigned nojnhuh Aug 2, 2023

This was referenced Aug 14, 2023

Handle CAPI's block-move annotation #3839

Closed

Set reconcile-policy: skip on ASO resources when Cluster is paused #3525

Closed

k8s-ci-robot closed this as completed Aug 14, 2023

github-project-automation bot moved this from In Progress to Done in CAPZ Planning Aug 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test `clusterctl move` with ASO #3795

Test `clusterctl move` with ASO #3795

nojnhuh commented Aug 2, 2023 •

edited

Loading

nojnhuh commented Aug 2, 2023

nojnhuh commented Aug 3, 2023 •

edited

Loading

nojnhuh commented Aug 4, 2023

nojnhuh commented Aug 4, 2023

nojnhuh commented Aug 14, 2023

k8s-ci-robot commented Aug 14, 2023

Test clusterctl move with ASO #3795

Test clusterctl move with ASO #3795

Comments

nojnhuh commented Aug 2, 2023 • edited Loading

nojnhuh commented Aug 2, 2023

nojnhuh commented Aug 3, 2023 • edited Loading

nojnhuh commented Aug 4, 2023

nojnhuh commented Aug 4, 2023

nojnhuh commented Aug 14, 2023

k8s-ci-robot commented Aug 14, 2023

Test `clusterctl move` with ASO #3795

Test `clusterctl move` with ASO #3795

nojnhuh commented Aug 2, 2023 •

edited

Loading

nojnhuh commented Aug 3, 2023 •

edited

Loading