Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test clusterctl move with ASO #3795

Closed
Tracked by #3527 ...
nojnhuh opened this issue Aug 2, 2023 · 6 comments
Closed
Tracked by #3527 ...

Test clusterctl move with ASO #3795

nojnhuh opened this issue Aug 2, 2023 · 6 comments
Assignees

Comments

@nojnhuh
Copy link
Contributor

nojnhuh commented Aug 2, 2023

This issue aims to assess how necessary kubernetes-sigs/cluster-api#8473 is to address for CAPZ's ASO migration. The following are criteria required for clusterctl move to work that are most at risk without a solution to the linked CAPI issue:

  • ASO resources on the destination management cluster eventually become steadily Ready
  • ASO resources on the source cluster never enter a Deleting state

There is a chance the above criteria may be met without a solution to the above CAPI issue, in which case we may be able to afford to be more patient in addressing that.

See also:
#3525

This was referenced Aug 2, 2023
@nojnhuh nojnhuh moved this to In Progress in CAPZ Planning Aug 2, 2023
@nojnhuh
Copy link
Contributor Author

nojnhuh commented Aug 2, 2023

/assign

@nojnhuh
Copy link
Contributor Author

nojnhuh commented Aug 3, 2023

I think as long as the reconcile-policy: skip annotation gets applied before the resource gets created/deleted by clusterctl move, then neither of the above items are at risk.

  • ASO resources on the source cluster never enter a Deleting state

To ensure this, the ASO resource needs to be annotated before being deleted from the source management cluster. In my first test, the ASO resource was annotated about 2s before it was deleted. That seems like a reasonably comfortable buffer to me if that's consistent.

  • ASO resources on the destination management cluster eventually become steadily Ready

This scenario is to prevent the ASO on the source and destination clusters from actively reconciling the same resource, which could pose problems if the resource gets modified during move. This can be ensured when the reconcile-policy: skip annotation is applied before the ASO resource is created on the destination cluster. In my first test, this happened about 0.8s before the ASO resource was created. That window seems tight enough that I'm not sure we can guarantee the annotation will consistently beat the new resource being created.

I'm not sure that's a problem worth worrying about though since both ASO instances may be reconciling two definitions of the same resource, but as long as both definitions are equivalent during the move, the ASO control planes in each cluster will be doing redundant work but will not actively conflict with each other. So I doubt this would cause any problems.

I'll do more testing to get a better sample size of the timings here and try with more clusters being moved at once. These timings should maybe also be taken with a grain of salt since I'm not sure if the capz-controller-manager Pod and clusterctl environments are using the same clock or how well they're synced if not.

tl;dr Things look ok without the CAPI change so far, more testing needed.

@nojnhuh
Copy link
Contributor Author

nojnhuh commented Aug 4, 2023

It looks like moving more (10) clusters gives us more buffer, with about 6.5s between annotating and the first moved resource being created and another ~13s before any moved resource gets deleted.

Overall, I'm reasonably confident users won't run into issues even without the CAPI fix, at least for this first iteration of ASO in CAPZ that manages only resource groups.

cc @dtzar

I'll look to see what we might be able to add to the tests to catch when the annotation doesn't get applied in time, but I'm not optimistic we can do anything meaningful without tweaking clusterctl.

@nojnhuh
Copy link
Contributor Author

nojnhuh commented Aug 4, 2023

cc @CecileRobertMichon

@nojnhuh
Copy link
Contributor Author

nojnhuh commented Aug 14, 2023

  • ASO resources on the destination management cluster eventually become steadily Ready

I think this point is probably already tested well enough for now since the test will ensure that the Cluster is Provisioned, which son't be the case when the ResourceGroup is not Ready.

  • ASO resources on the source cluster never enter a Deleting state

I suppose we could put a watch on the ResourceGroup while the move is taking place to check for this explicitly. I have a feeling that if resources really were being deleted, that would probably at least make the test time out if it has to recreate the cluster from scratch, if that doesn't completely blow up the test some other way.

I'll do this same testing once there are a few more ASO resources in the mix but close this for now. I don't think this is critical enough at the moment and it doesn't seem like there's a simple high-value check we can add to the tests.

/close

@k8s-ci-robot
Copy link
Contributor

@nojnhuh: Closing this issue.

In response to this:

  • ASO resources on the destination management cluster eventually become steadily Ready

I think this point is probably already tested well enough for now since the test will ensure that the Cluster is Provisioned, which son't be the case when the ResourceGroup is not Ready.

  • ASO resources on the source cluster never enter a Deleting state

I suppose we could put a watch on the ResourceGroup while the move is taking place to check for this explicitly. I have a feeling that if resources really were being deleted, that would probably at least make the test time out if it has to recreate the cluster from scratch, if that doesn't completely blow up the test some other way.

I'll do this same testing once there are a few more ASO resources in the mix but close this for now. I don't think this is critical enough at the moment and it doesn't seem like there's a simple high-value check we can add to the tests.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@github-project-automation github-project-automation bot moved this from In Progress to Done in CAPZ Planning Aug 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

2 participants