-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
✨ [e2e framework] Add ability to run pre and post actions during clusterctl upgrade spec #5093
✨ [e2e framework] Add ability to run pre and post actions during clusterctl upgrade spec #5093
Conversation
}, input.E2EConfig.GetIntervals(specName, "wait-delete-cluster")...) | ||
} else { | ||
log.Logf("Management Cluster does not appear to support CAPI resources.") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This cleanup step would previously fail when the spec fails prior to upgrading the management cluster, this allows cleanup to proceed if failure occurs prior to running clusterctl upgrade
, or even before clusterctl init
has been run.
PreUpgrade func(managementClusterProxy framework.ClusterProxy) | ||
PostUpgrade func(managementClusterProxy framework.ClusterProxy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This allows for pre and post upgrade steps to be run by providers when running this spec.
For example, in cluster-api-provider-packet, we require users to execute both pre and post upgrade actions when moving from v1alpha3 to v1alpha4. This is to enable the use of a newer version of our cloud provider implementation, which needs to be done prior to the v1alpha4 upgrade, and an additional step to update the providerIDs on existing nodes after the v1alpha4 upgrade is finished.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this @detiber! (also cc @CecileRobertMichon which might be interested in this from CAPZ perspective)
and an additional step to update the providerIDs on existing nodes after the v1alpha4 upgrade is finished
Curious about this one, do you have more details?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vincepri the main issue we are facing is a result of the rebranding from Packet to Equinix Metal. We've already fully completed the rebranding effort for our cloud-provider (https://github.com/equinix/cloud-provider-equinix-metal), which was formerly known as the "Packet CCM", which only strictly requires renaming a config map.
However, the newer version of the cloud provider uses equinixmetal://
as a prefix for node providerIDs, while the old version used packet://
. The cloud-provider will still work with both the new and old prefixes, but any new Nodes will get the equinixmetal://
prefix. It doesn't, however support updating the providerID for any nodes that exist with the old packet://
prefix. It cannot update the providerID for a couple of reasons.
- Functionality is not exposed in the upstream https://github.com/kubernetes/cloud-provider library that is used by all cloud provider implementations.
- The k8s apiserver considers the providerID field on the Node as immutable and will reject any changes.
As a result of this, we need to actually delete and re-create the node resource in order to update the provider ID for existing Nodes.
The migration is complicated further, since cluster-api expects Node and Machine provider ids to match, and that comparison is done by the Machine controller. After upgrading the cloud provider, CAPI will fail to match Machines/Nodes for any newly created Machines when using the updated cloud provider deployment.
As a result we need to introduce a change in cluster-api-provider-packet (in the upcoming v0.4.0 release) that uses the new equinixmetal://
provider id prefix instead of the older packet://
prefix. Since we don't enforce immutability of this field (in CAPP or upstream CAPI), we can actually update the field for both existing an new Machines, which makes that part of the migration a lot simpler, thankfully.
To ease the migration and provide an upgrade with the least amount of disruption we are working on tooling and documentation to guide users through the upgrade in the following order:
- Upgrade the cloud provider for all existing workload clusters (rename the config map, update the deployment)
- Upgrade CAPP to the soon to be release v0.4.0 (which includes upgrading core CAPI to v0.4.x as well, to avoid multiple disruptive changes close together)
- Delete and recreate any pre-existing Nodes with the
packet://
prefixed provider IDs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vincepri I don't think this applies to CAPZ, but no objections from my side to enabling it for other providers as long as it's optional and doesn't break existing tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thx :) A few nits from my side
cfc61e8
to
b075c51
Compare
@sbueringer thanks for the review, I believe I've addressed all the comments/suggestions as well as the golangci-lint issue. As an aside, I actually thought about pushing the old/current api support into the framework properly, but it would have required much more invasive changes that would have had bigger impact to the API (either breaking changes to existing method signatures or a more in depth migration and deprecation strategy). I suspect it could be something valuable that we might want to think about, but would warrant a much bigger discussion :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nice addition!
lgtm pending a question on the additional Cluster helpers
test/e2e/clusterctl_upgrade.go
Outdated
@@ -299,3 +330,73 @@ func downloadToTmpFile(url string) string { | |||
|
|||
return tmpFile.Name() | |||
} | |||
|
|||
// deleteAllClustersAndWait deletes all v1alpha3 cluster resources in the given namespace and waits for them to be gone. | |||
func deleteAllClustersAndWait(ctx context.Context, input framework.DeleteAllClustersAndWaitInput, intervals ...interface{}) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: should this be implemented as a helper in https://github.com/kubernetes-sigs/cluster-api/blob/master/test/framework/cluster_helpers.go (so we can benefit from GetAllClustersByNamespace
, DeleteCluster
, WaitForClusterDeleted
already existing there)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use those? The new funcs are all using v1alpha3 types which we kind of only need here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fabriziopandini more than happy to move it to the framework, my initial thought was that we could add the discovery.ServerSupportsVersion()
check into the main helper functions, that way we could even use most of those helpers in the pre-upgrade steps in the clusterctl_upgrade test, but the existing method signatures complicated that a bit.
I'm more than happy to try and explore either that route or just add parallel methods where needed for the "old" api version. Just let me know which approach you prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point.
What about adding a v1alpa3 suffix to those func so it is more evident why we are using custom helpers here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fabriziopandini what about using OldAPI
as the suffix? that should limit complicating future api changes (should be able to stick with just updating the imports)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated with the OldAPI
suffixing. I'm going to try and play around with supporting falling back to the old API in the framework helpers, and if that pans out open a separate PR to discuss that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just another idea. What about implementing version-independent deletion helpers? Looks like we only access a few fields during deletion, so I guess it can be reasonably easy implemented with unstructured.
(Feel free to ignore if that would be too hacky / too much effort :))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sbueringer I'm going to experiment with doing version independence with the framework helpers, but will separate that work out into it's own PR if it works out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me. I didn't see your last answer before sending mine :)
b075c51
to
4bcbca1
Compare
…rctl upgrade spec Also enable better cleanup in clusterctl upgrade spec when management cluster hasn't been upgraded yet.
4bcbca1
to
f8cc6f5
Compare
/lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: fabriziopandini The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What this PR does / why we need it:
Allows a provider to specify pre and post upgrade actions to take when running ClusterctlUpgradeSpec
Also enable better cleanup in clusterctl upgrade spec when management
cluster hasn't been upgraded yet.