⚠️ Machine ProviderID equality is now strictly enforced #6412

jackfrancis · 2022-04-12T20:36:57Z

What this PR does / why we need it:

This PR updates the capi ProviderID type equality enforcement (via the Equals object pointer receiver method) to validate against the entire provider ID string rather than a concatenation of the "cloud provider" plus "ID" substrings.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #4526

jackfrancis · 2022-04-12T20:38:36Z

@enxebre @alexeldeib is this PR getting close to addressing #4526? Obviously simply changing the capi equality implementation will require per-cloud-provider changes in their cluster-api providers in order to prevent breakage.

Is that what we want to advocate?

alexeldeib · 2022-04-12T21:11:10Z

yeah this solves it. I actually don't know that it requires infra provider changes -- CAPI should not have been the one setting these values ideally, CCM or similar would be, so this PR just aligns CAPI to CCM's behavior.

put differently: in AWS one node may be represented with multiple providerIDs, but a concrete node will only ever have one form of that provider ID, so I feel this should be safe.

probably good to get some CAPA eyes on this and maybe something less cloud-y like CAPD?

jackfrancis · 2022-04-12T21:19:02Z

@sedefsavas can you PTAL at this change and confirm what @alexeldeib is suggesting

thank you!

@CecileRobertMichon are you able to speak authoritiatively on capd ramifications?

CecileRobertMichon · 2022-04-12T21:29:19Z

maybe something less cloud-y like CAPD?

CAPD runs the PR e2e tests so if tests are passing I feel good about this change. Let's run the full suite to be sure.

/test ls

k8s-ci-robot · 2022-04-12T21:29:20Z

@CecileRobertMichon: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test pull-cluster-api-build-main
/test pull-cluster-api-e2e-main
/test pull-cluster-api-test-main
/test pull-cluster-api-test-mink8s-main
/test pull-cluster-api-verify-main

The following commands are available to trigger optional jobs:

/test pull-cluster-api-apidiff-main
/test pull-cluster-api-e2e-full-main
/test pull-cluster-api-e2e-informing-ipv6-main
/test pull-cluster-api-e2e-informing-main
/test pull-cluster-api-e2e-workload-upgrade-1-23-latest-main

Use /test all to run the following jobs that were automatically triggered:

pull-cluster-api-apidiff-main
pull-cluster-api-build-main
pull-cluster-api-e2e-informing-ipv6-main
pull-cluster-api-e2e-informing-main
pull-cluster-api-e2e-main
pull-cluster-api-test-main
pull-cluster-api-test-mink8s-main
pull-cluster-api-verify-main

In response to this:

maybe something less cloud-y like CAPD?

CAPD runs the PR e2e tests so if tests are passing I feel good about this change. Let's run the full suite to be sure.

/test ls

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

CecileRobertMichon · 2022-04-12T21:29:51Z

/test pull-cluster-api-e2e-full-main

controllers/noderefutil/providerid.go

enxebre · 2022-04-13T15:26:25Z

yeah this solves it. I actually don't know that it requires infra provider changes -- CAPI should not have been the one setting these values ideally, CCM or similar would be, so this PR just aligns CAPI to CCM's behavior.

CCM set on Nodes but capi providers set providerID on Machines. Therefore although not necessarily this might definitely break providers as it's changing the equality contract. So we should proceed taking that into consideration.

Also we should consistently update IndexKey().

jackfrancis · 2022-04-13T16:32:34Z

@enxebre I'm not sure how IndexKey() is affected by this change

enxebre · 2022-04-13T16:38:27Z

@enxebre I'm not sure how IndexKey() is affected by this change

// This is useful to use the providerID as a reliable index between nodes and machines // as it guarantees the infra Providers contract.

@jackfrancis IndexKey() was intentionally sticking to the equality contract, which we are changing now so it should be updated with it for consistency. Need to think about impact on running clusters though, I guess it should be fine as the controller is restarted and will just reindex using the new criteria.

jackfrancis · 2022-04-13T17:16:01Z

@enxebre I'm not sure how IndexKey() is affected by this change

// This is useful to use the providerID as a reliable index between nodes and machines // as it guarantees the infra Providers contract.

@jackfrancis IndexKey() was intentionally sticking to the equality contract, which we are changing now so it should be updated with it for consistency. Need to think about impact on running clusters though, I guess it should be fine as the controller is restarted and will just reindex using the new criteria.

I think there's more background here that I'm not aware of, but from a naive point of view IndexKey() doesn't have any value if it simply returns the entire string (it will then be equivalent to the CloudProvider() method which just thinly wraps the cloudProvider string property)

enxebre · 2022-04-13T17:57:21Z

I think there's more background here that I'm not aware of, but from a naive point of view IndexKey() doesn't have any value if it simply returns the entire string (it will then be equivalent to the CloudProvider() method which just thinly wraps the cloudProvider string property)

indexKey is just an abstraction to be used to index (and fetch) machines/nodes by providerID in the controllers indexers. It must be unique and so It should use whatever we consider to be the equality contract, otherwise we could end up with one index mapping to multiple machines/nodes. If the impl of indexKey happens to be a thing wrapper that's fine.

jackfrancis · 2022-04-13T18:08:24Z

@enxebre got it, updated

alexeldeib · 2022-04-13T18:11:02Z

CCM set on Nodes but capi providers set providerID on Machines. Therefore although not necessarily this might definitely break providers as it's changing the equality contract. So we should proceed taking that into consideration

Hm. That would require CCM and CAPI to disagree on what the providerID is for a given node, though? I suppose it could be something like -- CAPA sets provider ID using AZ, CCM doesn't, current logic evaluates those as the same node even though the strings are different? This is sort of why I wanted a sanity check on other providers.

That would seem to require CAPI providers to be mutating the provider ID they get back from cloud providers (or perhaps CCM doing the same), which would be a world of hurt regardless of the contract.

At a glance, seems like CAPA just reads whatever EC2 returns, which is roughly the same as Azure. Not sure how CCM for AWS generates provider IDs, but I'd hope it's the same. https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/7a545cd1aa5d05e5a0364a49b3c700341737ca61/controllers/awsmachine_controller.go#L510-L511

Definitely want to make sure we get this lined up right either way.

jackfrancis · 2022-04-13T18:13:37Z

/test pull-cluster-api-e2e-full-main

enxebre · 2022-04-13T18:29:35Z

Hm. That would require CCM and CAPI to disagree on what the providerID is for a given node, though? I suppose it could be something like -- CAPA sets provider ID using AZ, CCM doesn't, current logic evaluates those as the same node even though the strings are different? This is sort of why I wanted a sanity check on other providers.

@alexeldeib That's exactly right, so since we are changing the equality criteria this could technically break any provider out there. So at the very minimum this would require a note for providers implementers.

Also I agree with @killianmuldoon #6412 (comment) we need to deprecate existing behaviour first as this is a public func.

jackfrancis · 2022-04-13T21:34:54Z

/test pull-cluster-api-e2e-full-main

jackfrancis · 2022-04-13T23:02:26Z

Correction to my comment above, the IndexKey() has now been made equivalent to String() (not CloudProvider()).

That change has been realized in the PR now, tests look good.

I think at this point we need to evaluate the potential for provider breakage, and if we can convince ourselves that this is actually fixing a capi bug based on the way that providerID URLs are set from their real-world authoritative sources, then we just want to make a communication plan so that any providers who need to update their code have time to do so.

CecileRobertMichon · 2022-04-13T23:46:12Z

controllers/noderefutil/providerid.go

@@ -89,7 +88,7 @@ func (p *ProviderID) ID() string {

 // Equals returns true if both the CloudProvider and ID match.


comment should be updated to match new function (or we should make a new func as @killianmuldoon suggests)

How might we deprecate this in favor of another func and not affect users? AFAICT this func is only used here:

https://github.com/kubernetes-sigs/cluster-api/blob/release-1.1/internal/controllers/machine/machine_controller_noderef.go#L196

If we created a new func and updated the reference in (r *Reconciler) getNode we would still be making a change for all providers, no?

I think we need to deprecate it and replace as it's a part of our public Go API, so we don't have control over who is using it and for what.

As for the behavioural change inside the CAPI method, I think the impact of that needs to be assessed separately. If there is an impact from changing the behaviour here that should be called out in the deprecation notice and/or on the migration guide.

@jackfrancis here's an example of deprecating a public func in CAPI https://github.com/kubernetes-sigs/cluster-api/pull/5545/files#diff-e25fb86695a8022fa1611100bd244f9b4185f28629d19830e83d18b4fa2ce710R41

And an example of adding to the migration guide for providers f73a277#diff-531b360cba021aa4e0cf2df8f2ce8ead9601db1601bf516cab3c916a83622436R85

vincepri · 2022-05-02T14:35:36Z

/hold

I'm quite sure we had this behavior back in the v1alpha1 days and we had to change for something more structured given that some Cluster API providers can only set (guess?) a partial provider ID. Looking through the Cluster API AWS provider code, we're still doing that today https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/760e4e7651f46c29bf309607a0b6307570228988/pkg/cloud/scope/machine.go#L144-L148

Unless we have a contract requirement for every cloud provider to match their equivalent CCM/CPI codebase, we should keep this behavior and introduce a new one over time.

I might have missed this while going through the above messages, what is the problem we're trying to solve with this PR?

Ankitasw · 2022-09-30T15:15:32Z

To make sure CAPA is all set to accept this change

We would need another PR to enable this in CAPA. It should be ok till we bump to the CAPI version which contains this change.
cc @sedefsavas

vincepri · 2022-09-30T16:50:28Z

@fabriziopandini @sbueringer over to you for approval

sbueringer · 2022-09-30T17:57:54Z

@fabriziopandini @sbueringer over to you for approval

I'll take a look early next week after the public holiday on Monday

dthorsen · 2022-10-03T13:42:21Z

I'm not sure if this detail has yet been raised, but when using the Azure provider this bug can actually cause service disruptions. If a MachinePool is scaled down, CAPI may delete node resources associated with running VM's in random unrelated MachinePools. I believe this occurs due to the nodeRefs being incorrect the status fields of the MachinePool combined with the logic in this function:

cluster-api/exp/internal/controllers/machinepool_controller_noderef.go

Lines 132 to 135 in 1917d52

    
           // deleteRetiredNodes deletes nodes that don't have a corresponding ProviderID in Spec.ProviderIDList. 
        
           // A MachinePool infrastructure provider indicates an instance in the set has been deleted by 
        
           // removing its ProviderID from the slice. 
        
           func (r *MachinePoolReconciler) deleteRetiredNodes(ctx context.Context, c client.Client, nodeRefs []corev1.ObjectReference, providerIDList []string) error {

CecileRobertMichon · 2022-10-03T17:33:23Z

@dthorsen the specific bug you're referring to should already be fixed in #6971

controllers/noderefutil/providerid_test.go

sbueringer · 2022-10-06T07:06:12Z

As far as I understand the situtation the PR sounds fine to me.

@jackfrancis Let's please add a note to v1.2-to-v1.3.md to increase the chances infra providers will notice the change when they start adopting v1.3 alpha|beta|RC|GA

jackfrancis · 2022-10-14T23:13:03Z

@sbueringer added a note to the 1.2-1.3 doc

PTAL and I think we're finally ready here

vincepri · 2022-10-18T14:46:56Z

Changes LGTM, @jackfrancis can you rebase?

jackfrancis · 2022-10-18T16:17:43Z

@vincepri done

sbueringer · 2022-10-18T16:29:00Z

Thank you very much!

/lgtm

CecileRobertMichon

/lgtm
/approve

k8s-ci-robot · 2022-10-18T16:33:54Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: CecileRobertMichon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [CecileRobertMichon]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 12, 2022

k8s-ci-robot requested review from CecileRobertMichon and vincepri April 12, 2022 20:37

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Apr 12, 2022

killianmuldoon reviewed Apr 13, 2022

View reviewed changes

controllers/noderefutil/providerid.go Show resolved Hide resolved

jackfrancis force-pushed the providerID-equality-match-entire-string branch from 2e88f2e to 5b099d4 Compare April 13, 2022 18:08

jackfrancis force-pushed the providerID-equality-match-entire-string branch 4 times, most recently from adf03e8 to 364a7bf Compare April 13, 2022 21:13

CecileRobertMichon reviewed Apr 13, 2022

View reviewed changes

jackfrancis force-pushed the providerID-equality-match-entire-string branch from 364a7bf to 20f3fd5 Compare April 14, 2022 00:24

k8s-ci-robot requested review from richardcase and sedefsavas September 29, 2022 19:09

k8s-ci-robot assigned CecileRobertMichon Sep 29, 2022

sedefsavas mentioned this pull request Oct 3, 2022

v1beta2 API Changes kubernetes-sigs/cluster-api-provider-aws#2355

Closed

4 tasks

sbueringer reviewed Oct 6, 2022

View reviewed changes

controllers/noderefutil/providerid_test.go Show resolved Hide resolved

jackfrancis force-pushed the providerID-equality-match-entire-string branch from 1a2e324 to c5fd6f4 Compare October 14, 2022 23:11

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 14, 2022

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 18, 2022

validate ProviderID equality by comparing entire string

626b4f4

jackfrancis force-pushed the providerID-equality-match-entire-string branch from c5fd6f4 to 626b4f4 Compare October 18, 2022 16:17

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 18, 2022

k8s-ci-robot assigned sbueringer Oct 18, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 18, 2022

CecileRobertMichon reviewed Oct 18, 2022

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 18, 2022

k8s-ci-robot merged commit 4b6f3c3 into kubernetes-sigs:main Oct 18, 2022

k8s-ci-robot added this to the v1.3 milestone Oct 18, 2022

jackfrancis deleted the providerID-equality-match-entire-string branch October 18, 2022 21:34

enxebre mentioned this pull request Apr 10, 2023

Allow provider id regex matching to support more provider formats #8485

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

⚠️ Machine ProviderID equality is now strictly enforced #6412

⚠️ Machine ProviderID equality is now strictly enforced #6412

jackfrancis commented Apr 12, 2022

jackfrancis commented Apr 12, 2022

alexeldeib commented Apr 12, 2022

jackfrancis commented Apr 12, 2022

CecileRobertMichon commented Apr 12, 2022

k8s-ci-robot commented Apr 12, 2022

CecileRobertMichon commented Apr 12, 2022

enxebre commented Apr 13, 2022 •

edited

Loading

jackfrancis commented Apr 13, 2022

enxebre commented Apr 13, 2022 •

edited

Loading

jackfrancis commented Apr 13, 2022

enxebre commented Apr 13, 2022

jackfrancis commented Apr 13, 2022

alexeldeib commented Apr 13, 2022 •

edited

Loading

jackfrancis commented Apr 13, 2022

enxebre commented Apr 13, 2022

jackfrancis commented Apr 13, 2022

jackfrancis commented Apr 13, 2022

CecileRobertMichon Apr 13, 2022

jackfrancis Apr 14, 2022

killianmuldoon Apr 15, 2022

CecileRobertMichon May 2, 2022

vincepri commented May 2, 2022

Ankitasw commented Sep 30, 2022

vincepri commented Sep 30, 2022

sbueringer commented Sep 30, 2022

dthorsen commented Oct 3, 2022

CecileRobertMichon commented Oct 3, 2022

sbueringer commented Oct 6, 2022

jackfrancis commented Oct 14, 2022

vincepri commented Oct 18, 2022

jackfrancis commented Oct 18, 2022

sbueringer commented Oct 18, 2022

CecileRobertMichon left a comment

k8s-ci-robot commented Oct 18, 2022

		@@ -89,7 +88,7 @@ func (p *ProviderID) ID() string {

		// Equals returns true if both the CloudProvider and ID match.

⚠️ Machine ProviderID equality is now strictly enforced #6412

⚠️ Machine ProviderID equality is now strictly enforced #6412

Conversation

jackfrancis commented Apr 12, 2022

jackfrancis commented Apr 12, 2022

alexeldeib commented Apr 12, 2022

jackfrancis commented Apr 12, 2022

CecileRobertMichon commented Apr 12, 2022

k8s-ci-robot commented Apr 12, 2022

CecileRobertMichon commented Apr 12, 2022

enxebre commented Apr 13, 2022 • edited Loading

jackfrancis commented Apr 13, 2022

enxebre commented Apr 13, 2022 • edited Loading

jackfrancis commented Apr 13, 2022

enxebre commented Apr 13, 2022

jackfrancis commented Apr 13, 2022

alexeldeib commented Apr 13, 2022 • edited Loading

jackfrancis commented Apr 13, 2022

enxebre commented Apr 13, 2022

jackfrancis commented Apr 13, 2022

jackfrancis commented Apr 13, 2022

CecileRobertMichon Apr 13, 2022

Choose a reason for hiding this comment

jackfrancis Apr 14, 2022

Choose a reason for hiding this comment

killianmuldoon Apr 15, 2022

Choose a reason for hiding this comment

CecileRobertMichon May 2, 2022

Choose a reason for hiding this comment

vincepri commented May 2, 2022

Ankitasw commented Sep 30, 2022

vincepri commented Sep 30, 2022

sbueringer commented Sep 30, 2022

dthorsen commented Oct 3, 2022

CecileRobertMichon commented Oct 3, 2022

sbueringer commented Oct 6, 2022

jackfrancis commented Oct 14, 2022

vincepri commented Oct 18, 2022

jackfrancis commented Oct 18, 2022

sbueringer commented Oct 18, 2022

CecileRobertMichon left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Oct 18, 2022

enxebre commented Apr 13, 2022 •

edited

Loading

enxebre commented Apr 13, 2022 •

edited

Loading

alexeldeib commented Apr 13, 2022 •

edited

Loading