-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
⚠️ Machine ProviderID equality is now strictly enforced #6412
⚠️ Machine ProviderID equality is now strictly enforced #6412
Conversation
@enxebre @alexeldeib is this PR getting close to addressing #4526? Obviously simply changing the capi equality implementation will require per-cloud-provider changes in their cluster-api providers in order to prevent breakage. Is that what we want to advocate? |
yeah this solves it. I actually don't know that it requires infra provider changes -- CAPI should not have been the one setting these values ideally, CCM or similar would be, so this PR just aligns CAPI to CCM's behavior. put differently: in AWS one node may be represented with multiple providerIDs, but a concrete node will only ever have one form of that provider ID, so I feel this should be safe. probably good to get some CAPA eyes on this and maybe something less cloud-y like CAPD? |
@sedefsavas can you PTAL at this change and confirm what @alexeldeib is suggesting thank you! @CecileRobertMichon are you able to speak authoritiatively on capd ramifications? |
CAPD runs the PR e2e tests so if tests are passing I feel good about this change. Let's run the full suite to be sure. /test ls |
@CecileRobertMichon: The specified target(s) for
The following commands are available to trigger optional jobs:
Use
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/test pull-cluster-api-e2e-full-main |
CCM set on Nodes but capi providers set providerID on Machines. Therefore although not necessarily this might definitely break providers as it's changing the equality contract. So we should proceed taking that into consideration. Also we should consistently update IndexKey(). |
@enxebre I'm not sure how |
@jackfrancis |
I think there's more background here that I'm not aware of, but from a naive point of view |
indexKey is just an abstraction to be used to index (and fetch) machines/nodes by providerID in the controllers indexers. It must be unique and so It should use whatever we consider to be the equality contract, otherwise we could end up with one index mapping to multiple machines/nodes. If the impl of indexKey happens to be a thing wrapper that's fine. |
2e88f2e
to
5b099d4
Compare
@enxebre got it, updated |
Hm. That would require CCM and CAPI to disagree on what the providerID is for a given node, though? I suppose it could be something like -- CAPA sets provider ID using AZ, CCM doesn't, current logic evaluates those as the same node even though the strings are different? This is sort of why I wanted a sanity check on other providers. That would seem to require CAPI providers to be mutating the provider ID they get back from cloud providers (or perhaps CCM doing the same), which would be a world of hurt regardless of the contract. At a glance, seems like CAPA just reads whatever EC2 returns, which is roughly the same as Azure. Not sure how CCM for AWS generates provider IDs, but I'd hope it's the same. https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/7a545cd1aa5d05e5a0364a49b3c700341737ca61/controllers/awsmachine_controller.go#L510-L511 Definitely want to make sure we get this lined up right either way. |
/test pull-cluster-api-e2e-full-main |
@alexeldeib That's exactly right, so since we are changing the equality criteria this could technically break any provider out there. So at the very minimum this would require a note for providers implementers. Also I agree with @killianmuldoon #6412 (comment) we need to deprecate existing behaviour first as this is a public func. |
adf03e8
to
364a7bf
Compare
/test pull-cluster-api-e2e-full-main |
Correction to my comment above, the That change has been realized in the PR now, tests look good. I think at this point we need to evaluate the potential for provider breakage, and if we can convince ourselves that this is actually fixing a capi bug based on the way that providerID URLs are set from their real-world authoritative sources, then we just want to make a communication plan so that any providers who need to update their code have time to do so. |
@@ -89,7 +88,7 @@ func (p *ProviderID) ID() string { | |||
|
|||
// Equals returns true if both the CloudProvider and ID match. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment should be updated to match new function (or we should make a new func as @killianmuldoon suggests)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How might we deprecate this in favor of another func and not affect users? AFAICT this func is only used here:
If we created a new func and updated the reference in (r *Reconciler) getNode
we would still be making a change for all providers, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to deprecate it and replace as it's a part of our public Go API, so we don't have control over who is using it and for what.
As for the behavioural change inside the CAPI method, I think the impact of that needs to be assessed separately. If there is an impact from changing the behaviour here that should be called out in the deprecation notice and/or on the migration guide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jackfrancis here's an example of deprecating a public func in CAPI https://github.com/kubernetes-sigs/cluster-api/pull/5545/files#diff-e25fb86695a8022fa1611100bd244f9b4185f28629d19830e83d18b4fa2ce710R41
And an example of adding to the migration guide for providers f73a277#diff-531b360cba021aa4e0cf2df8f2ce8ead9601db1601bf516cab3c916a83622436R85
364a7bf
to
20f3fd5
Compare
/hold I'm quite sure we had this behavior back in the v1alpha1 days and we had to change for something more structured given that some Cluster API providers can only set (guess?) a partial provider ID. Looking through the Cluster API AWS provider code, we're still doing that today https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/760e4e7651f46c29bf309607a0b6307570228988/pkg/cloud/scope/machine.go#L144-L148 Unless we have a contract requirement for every cloud provider to match their equivalent CCM/CPI codebase, we should keep this behavior and introduce a new one over time. I might have missed this while going through the above messages, what is the problem we're trying to solve with this PR? |
We would need another PR to enable this in CAPA. It should be ok till we bump to the CAPI version which contains this change. |
@fabriziopandini @sbueringer over to you for approval |
I'll take a look early next week after the public holiday on Monday |
I'm not sure if this detail has yet been raised, but when using the Azure provider this bug can actually cause service disruptions. If a MachinePool is scaled down, CAPI may delete node resources associated with running VM's in random unrelated MachinePools. I believe this occurs due to the nodeRefs being incorrect the status fields of the MachinePool combined with the logic in this function: cluster-api/exp/internal/controllers/machinepool_controller_noderef.go Lines 132 to 135 in 1917d52
|
As far as I understand the situtation the PR sounds fine to me. @jackfrancis Let's please add a note to |
1a2e324
to
c5fd6f4
Compare
@sbueringer added a note to the 1.2-1.3 doc PTAL and I think we're finally ready here |
Changes LGTM, @jackfrancis can you rebase? |
c5fd6f4
to
626b4f4
Compare
@vincepri done |
Thank you very much! /lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: CecileRobertMichon The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What this PR does / why we need it:
This PR updates the capi
ProviderID
type equality enforcement (via theEquals
object pointer receiver method) to validate against the entire provider ID string rather than a concatenation of the "cloud provider" plus "ID" substrings.Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #4526