✨ [e2e framework] Add ability to run pre and post actions during clusterctl upgrade spec #5093

detiber · 2021-08-13T21:41:01Z

What this PR does / why we need it:

Allows a provider to specify pre and post upgrade actions to take when running ClusterctlUpgradeSpec

Also enable better cleanup in clusterctl upgrade spec when management
cluster hasn't been upgraded yet.

detiber · 2021-08-13T21:42:24Z

test/e2e/clusterctl_upgrade.go

+					}, input.E2EConfig.GetIntervals(specName, "wait-delete-cluster")...)
+				} else {
+					log.Logf("Management Cluster does not appear to support CAPI resources.")
+				}


This cleanup step would previously fail when the spec fails prior to upgrading the management cluster, this allows cleanup to proceed if failure occurs prior to running clusterctl upgrade, or even before clusterctl init has been run.

detiber · 2021-08-13T21:45:38Z

test/e2e/clusterctl_upgrade.go

+	PreUpgrade            func(managementClusterProxy framework.ClusterProxy)
+	PostUpgrade           func(managementClusterProxy framework.ClusterProxy)


This allows for pre and post upgrade steps to be run by providers when running this spec.

For example, in cluster-api-provider-packet, we require users to execute both pre and post upgrade actions when moving from v1alpha3 to v1alpha4. This is to enable the use of a newer version of our cloud provider implementation, which needs to be done prior to the v1alpha4 upgrade, and an additional step to update the providerIDs on existing nodes after the v1alpha4 upgrade is finished.

Thanks for this @detiber! (also cc @CecileRobertMichon which might be interested in this from CAPZ perspective)

and an additional step to update the providerIDs on existing nodes after the v1alpha4 upgrade is finished

Curious about this one, do you have more details?

@vincepri the main issue we are facing is a result of the rebranding from Packet to Equinix Metal. We've already fully completed the rebranding effort for our cloud-provider (https://github.com/equinix/cloud-provider-equinix-metal), which was formerly known as the "Packet CCM", which only strictly requires renaming a config map.

However, the newer version of the cloud provider uses equinixmetal:// as a prefix for node providerIDs, while the old version used packet://. The cloud-provider will still work with both the new and old prefixes, but any new Nodes will get the equinixmetal:// prefix. It doesn't, however support updating the providerID for any nodes that exist with the old packet:// prefix. It cannot update the providerID for a couple of reasons.

Functionality is not exposed in the upstream https://github.com/kubernetes/cloud-provider library that is used by all cloud provider implementations.

The k8s apiserver considers the providerID field on the Node as immutable and will reject any changes.

As a result of this, we need to actually delete and re-create the node resource in order to update the provider ID for existing Nodes.

The migration is complicated further, since cluster-api expects Node and Machine provider ids to match, and that comparison is done by the Machine controller. After upgrading the cloud provider, CAPI will fail to match Machines/Nodes for any newly created Machines when using the updated cloud provider deployment.

As a result we need to introduce a change in cluster-api-provider-packet (in the upcoming v0.4.0 release) that uses the new equinixmetal:// provider id prefix instead of the older packet:// prefix. Since we don't enforce immutability of this field (in CAPP or upstream CAPI), we can actually update the field for both existing an new Machines, which makes that part of the migration a lot simpler, thankfully.

To ease the migration and provide an upgrade with the least amount of disruption we are working on tooling and documentation to guide users through the upgrade in the following order:

Upgrade the cloud provider for all existing workload clusters (rename the config map, update the deployment)

Upgrade CAPP to the soon to be release v0.4.0 (which includes upgrading core CAPI to v0.4.x as well, to avoid multiple disruptive changes close together)

Delete and recreate any pre-existing Nodes with the packet:// prefixed provider IDs

@vincepri I don't think this applies to CAPZ, but no objections from my side to enabling it for other providers as long as it's optional and doesn't break existing tests

sbueringer

Looks great, thx :) A few nits from my side

test/e2e/clusterctl_upgrade.go

detiber · 2021-08-16T14:01:19Z

@sbueringer thanks for the review, I believe I've addressed all the comments/suggestions as well as the golangci-lint issue.

As an aside, I actually thought about pushing the old/current api support into the framework properly, but it would have required much more invasive changes that would have had bigger impact to the API (either breaking changes to existing method signatures or a more in depth migration and deprecation strategy). I suspect it could be something valuable that we might want to think about, but would warrant a much bigger discussion :)

fabriziopandini

This is a nice addition!
lgtm pending a question on the additional Cluster helpers

fabriziopandini · 2021-08-16T14:25:13Z

test/e2e/clusterctl_upgrade.go

@@ -299,3 +330,73 @@ func downloadToTmpFile(url string) string {

 	return tmpFile.Name()
 }
+
+// deleteAllClustersAndWait deletes all v1alpha3 cluster resources in the given namespace and waits for them to be gone.
+func deleteAllClustersAndWait(ctx context.Context, input framework.DeleteAllClustersAndWaitInput, intervals ...interface{}) {


Q: should this be implemented as a helper in https://github.com/kubernetes-sigs/cluster-api/blob/master/test/framework/cluster_helpers.go (so we can benefit from GetAllClustersByNamespace, DeleteCluster, WaitForClusterDeleted already existing there)?

Can we use those? The new funcs are all using v1alpha3 types which we kind of only need here.

@fabriziopandini more than happy to move it to the framework, my initial thought was that we could add the discovery.ServerSupportsVersion() check into the main helper functions, that way we could even use most of those helpers in the pre-upgrade steps in the clusterctl_upgrade test, but the existing method signatures complicated that a bit.

I'm more than happy to try and explore either that route or just add parallel methods where needed for the "old" api version. Just let me know which approach you prefer.

This is a good point.
What about adding a v1alpa3 suffix to those func so it is more evident why we are using custom helpers here

@fabriziopandini what about using OldAPI as the suffix? that should limit complicating future api changes (should be able to stick with just updating the imports)?

Updated with the OldAPI suffixing. I'm going to try and play around with supporting falling back to the old API in the framework helpers, and if that pans out open a separate PR to discuss that.

@detiber @fabriziopandini

Just another idea. What about implementing version-independent deletion helpers? Looks like we only access a few fields during deletion, so I guess it can be reasonably easy implemented with unstructured.

(Feel free to ignore if that would be too hacky / too much effort :))

@sbueringer I'm going to experiment with doing version independence with the framework helpers, but will separate that work out into it's own PR if it works out.

Sounds good to me. I didn't see your last answer before sending mine :)

test/e2e/clusterctl_upgrade.go

…rctl upgrade spec Also enable better cleanup in clusterctl upgrade spec when management cluster hasn't been upgraded yet.

sbueringer · 2021-08-16T15:37:19Z

/lgtm

fabriziopandini

/lgtm
/approve

k8s-ci-robot · 2021-08-17T09:14:36Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fabriziopandini

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [fabriziopandini]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 13, 2021

k8s-ci-robot requested review from enxebre and fabriziopandini August 13, 2021 21:41

detiber commented Aug 13, 2021

View reviewed changes

sbueringer reviewed Aug 16, 2021

View reviewed changes

test/e2e/clusterctl_upgrade.go Outdated Show resolved Hide resolved

test/e2e/clusterctl_upgrade.go Outdated Show resolved Hide resolved

test/e2e/clusterctl_upgrade.go Outdated Show resolved Hide resolved

detiber force-pushed the clusterctlUpgrade branch from cfc61e8 to b075c51 Compare August 16, 2021 13:56

fabriziopandini reviewed Aug 16, 2021

View reviewed changes

detiber force-pushed the clusterctlUpgrade branch from b075c51 to 4bcbca1 Compare August 16, 2021 15:19

sbueringer reviewed Aug 16, 2021

View reviewed changes

test/e2e/clusterctl_upgrade.go Outdated Show resolved Hide resolved

test/e2e/clusterctl_upgrade.go Outdated Show resolved Hide resolved

[e2e framework] Add ability to run pre and post actions during cluste…

f8cc6f5

…rctl upgrade spec Also enable better cleanup in clusterctl upgrade spec when management cluster hasn't been upgraded yet.

detiber force-pushed the clusterctlUpgrade branch from 4bcbca1 to f8cc6f5 Compare August 16, 2021 15:36

k8s-ci-robot assigned sbueringer Aug 16, 2021

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 16, 2021

fabriziopandini reviewed Aug 17, 2021

View reviewed changes

k8s-ci-robot assigned fabriziopandini Aug 17, 2021

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 17, 2021

k8s-ci-robot merged commit f23f263 into kubernetes-sigs:master Aug 17, 2021

k8s-ci-robot added this to the v0.4 milestone Aug 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

✨ [e2e framework] Add ability to run pre and post actions during clusterctl upgrade spec #5093

✨ [e2e framework] Add ability to run pre and post actions during clusterctl upgrade spec #5093

detiber commented Aug 13, 2021

detiber Aug 13, 2021

detiber Aug 13, 2021

vincepri Aug 16, 2021

detiber Aug 16, 2021

CecileRobertMichon Aug 16, 2021

sbueringer left a comment

detiber commented Aug 16, 2021

fabriziopandini left a comment

fabriziopandini Aug 16, 2021

sbueringer Aug 16, 2021

detiber Aug 16, 2021

fabriziopandini Aug 16, 2021

detiber Aug 16, 2021

detiber Aug 16, 2021

sbueringer Aug 16, 2021 •

edited

Loading

detiber Aug 16, 2021

sbueringer Aug 16, 2021

sbueringer commented Aug 16, 2021

fabriziopandini left a comment

k8s-ci-robot commented Aug 17, 2021

		PreUpgrade func(managementClusterProxy framework.ClusterProxy)
		PostUpgrade func(managementClusterProxy framework.ClusterProxy)

✨ [e2e framework] Add ability to run pre and post actions during clusterctl upgrade spec #5093

✨ [e2e framework] Add ability to run pre and post actions during clusterctl upgrade spec #5093

Conversation

detiber commented Aug 13, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sbueringer left a comment

Choose a reason for hiding this comment

detiber commented Aug 16, 2021

fabriziopandini left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sbueringer Aug 16, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sbueringer commented Aug 16, 2021

fabriziopandini left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Aug 17, 2021

sbueringer Aug 16, 2021 •

edited

Loading