Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ [e2e framework] Add ability to run pre and post actions during clusterctl upgrade spec #5093

Merged
merged 1 commit into from
Aug 17, 2021

Conversation

detiber
Copy link
Member

@detiber detiber commented Aug 13, 2021

What this PR does / why we need it:

Allows a provider to specify pre and post upgrade actions to take when running ClusterctlUpgradeSpec

Also enable better cleanup in clusterctl upgrade spec when management
cluster hasn't been upgraded yet.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 13, 2021
}, input.E2EConfig.GetIntervals(specName, "wait-delete-cluster")...)
} else {
log.Logf("Management Cluster does not appear to support CAPI resources.")
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cleanup step would previously fail when the spec fails prior to upgrading the management cluster, this allows cleanup to proceed if failure occurs prior to running clusterctl upgrade, or even before clusterctl init has been run.

Comment on lines +57 to +58
PreUpgrade func(managementClusterProxy framework.ClusterProxy)
PostUpgrade func(managementClusterProxy framework.ClusterProxy)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allows for pre and post upgrade steps to be run by providers when running this spec.

For example, in cluster-api-provider-packet, we require users to execute both pre and post upgrade actions when moving from v1alpha3 to v1alpha4. This is to enable the use of a newer version of our cloud provider implementation, which needs to be done prior to the v1alpha4 upgrade, and an additional step to update the providerIDs on existing nodes after the v1alpha4 upgrade is finished.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @detiber! (also cc @CecileRobertMichon which might be interested in this from CAPZ perspective)

and an additional step to update the providerIDs on existing nodes after the v1alpha4 upgrade is finished

Curious about this one, do you have more details?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vincepri the main issue we are facing is a result of the rebranding from Packet to Equinix Metal. We've already fully completed the rebranding effort for our cloud-provider (https://github.com/equinix/cloud-provider-equinix-metal), which was formerly known as the "Packet CCM", which only strictly requires renaming a config map.

However, the newer version of the cloud provider uses equinixmetal:// as a prefix for node providerIDs, while the old version used packet://. The cloud-provider will still work with both the new and old prefixes, but any new Nodes will get the equinixmetal:// prefix. It doesn't, however support updating the providerID for any nodes that exist with the old packet:// prefix. It cannot update the providerID for a couple of reasons.

As a result of this, we need to actually delete and re-create the node resource in order to update the provider ID for existing Nodes.

The migration is complicated further, since cluster-api expects Node and Machine provider ids to match, and that comparison is done by the Machine controller. After upgrading the cloud provider, CAPI will fail to match Machines/Nodes for any newly created Machines when using the updated cloud provider deployment.

As a result we need to introduce a change in cluster-api-provider-packet (in the upcoming v0.4.0 release) that uses the new equinixmetal:// provider id prefix instead of the older packet:// prefix. Since we don't enforce immutability of this field (in CAPP or upstream CAPI), we can actually update the field for both existing an new Machines, which makes that part of the migration a lot simpler, thankfully.

To ease the migration and provide an upgrade with the least amount of disruption we are working on tooling and documentation to guide users through the upgrade in the following order:

  • Upgrade the cloud provider for all existing workload clusters (rename the config map, update the deployment)
  • Upgrade CAPP to the soon to be release v0.4.0 (which includes upgrading core CAPI to v0.4.x as well, to avoid multiple disruptive changes close together)
  • Delete and recreate any pre-existing Nodes with the packet:// prefixed provider IDs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vincepri I don't think this applies to CAPZ, but no objections from my side to enabling it for other providers as long as it's optional and doesn't break existing tests

Copy link
Member

@sbueringer sbueringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thx :) A few nits from my side

test/e2e/clusterctl_upgrade.go Outdated Show resolved Hide resolved
test/e2e/clusterctl_upgrade.go Outdated Show resolved Hide resolved
test/e2e/clusterctl_upgrade.go Outdated Show resolved Hide resolved
@detiber detiber force-pushed the clusterctlUpgrade branch from cfc61e8 to b075c51 Compare August 16, 2021 13:56
@detiber
Copy link
Member Author

detiber commented Aug 16, 2021

@sbueringer thanks for the review, I believe I've addressed all the comments/suggestions as well as the golangci-lint issue.

As an aside, I actually thought about pushing the old/current api support into the framework properly, but it would have required much more invasive changes that would have had bigger impact to the API (either breaking changes to existing method signatures or a more in depth migration and deprecation strategy). I suspect it could be something valuable that we might want to think about, but would warrant a much bigger discussion :)

Copy link
Member

@fabriziopandini fabriziopandini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice addition!
lgtm pending a question on the additional Cluster helpers

@@ -299,3 +330,73 @@ func downloadToTmpFile(url string) string {

return tmpFile.Name()
}

// deleteAllClustersAndWait deletes all v1alpha3 cluster resources in the given namespace and waits for them to be gone.
func deleteAllClustersAndWait(ctx context.Context, input framework.DeleteAllClustersAndWaitInput, intervals ...interface{}) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: should this be implemented as a helper in https://github.com/kubernetes-sigs/cluster-api/blob/master/test/framework/cluster_helpers.go (so we can benefit from GetAllClustersByNamespace, DeleteCluster, WaitForClusterDeleted already existing there)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use those? The new funcs are all using v1alpha3 types which we kind of only need here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fabriziopandini more than happy to move it to the framework, my initial thought was that we could add the discovery.ServerSupportsVersion() check into the main helper functions, that way we could even use most of those helpers in the pre-upgrade steps in the clusterctl_upgrade test, but the existing method signatures complicated that a bit.

I'm more than happy to try and explore either that route or just add parallel methods where needed for the "old" api version. Just let me know which approach you prefer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point.
What about adding a v1alpa3 suffix to those func so it is more evident why we are using custom helpers here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fabriziopandini what about using OldAPI as the suffix? that should limit complicating future api changes (should be able to stick with just updating the imports)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated with the OldAPI suffixing. I'm going to try and play around with supporting falling back to the old API in the framework helpers, and if that pans out open a separate PR to discuss that.

Copy link
Member

@sbueringer sbueringer Aug 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@detiber @fabriziopandini

Just another idea. What about implementing version-independent deletion helpers? Looks like we only access a few fields during deletion, so I guess it can be reasonably easy implemented with unstructured.

(Feel free to ignore if that would be too hacky / too much effort :))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sbueringer I'm going to experiment with doing version independence with the framework helpers, but will separate that work out into it's own PR if it works out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me. I didn't see your last answer before sending mine :)

@detiber detiber force-pushed the clusterctlUpgrade branch from b075c51 to 4bcbca1 Compare August 16, 2021 15:19
test/e2e/clusterctl_upgrade.go Outdated Show resolved Hide resolved
test/e2e/clusterctl_upgrade.go Outdated Show resolved Hide resolved
…rctl upgrade spec

Also enable better cleanup in clusterctl upgrade spec when management
cluster hasn't been upgraded yet.
@detiber detiber force-pushed the clusterctlUpgrade branch from 4bcbca1 to f8cc6f5 Compare August 16, 2021 15:36
@sbueringer
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 16, 2021
Copy link
Member

@fabriziopandini fabriziopandini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fabriziopandini

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 17, 2021
@k8s-ci-robot k8s-ci-robot merged commit f23f263 into kubernetes-sigs:master Aug 17, 2021
@k8s-ci-robot k8s-ci-robot added this to the v0.4 milestone Aug 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants