Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

⚠️ Machine ProviderID equality is now strictly enforced #6412

Conversation

jackfrancis
Copy link
Contributor

What this PR does / why we need it:

This PR updates the capi ProviderID type equality enforcement (via the Equals object pointer receiver method) to validate against the entire provider ID string rather than a concatenation of the "cloud provider" plus "ID" substrings.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #4526

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 12, 2022
@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Apr 12, 2022
@jackfrancis
Copy link
Contributor Author

@enxebre @alexeldeib is this PR getting close to addressing #4526? Obviously simply changing the capi equality implementation will require per-cloud-provider changes in their cluster-api providers in order to prevent breakage.

Is that what we want to advocate?

@alexeldeib
Copy link
Contributor

yeah this solves it. I actually don't know that it requires infra provider changes -- CAPI should not have been the one setting these values ideally, CCM or similar would be, so this PR just aligns CAPI to CCM's behavior.

put differently: in AWS one node may be represented with multiple providerIDs, but a concrete node will only ever have one form of that provider ID, so I feel this should be safe.

probably good to get some CAPA eyes on this and maybe something less cloud-y like CAPD?

@jackfrancis
Copy link
Contributor Author

@sedefsavas can you PTAL at this change and confirm what @alexeldeib is suggesting

thank you!

@CecileRobertMichon are you able to speak authoritiatively on capd ramifications?

@CecileRobertMichon
Copy link
Contributor

maybe something less cloud-y like CAPD?

CAPD runs the PR e2e tests so if tests are passing I feel good about this change. Let's run the full suite to be sure.

/test ls

@k8s-ci-robot
Copy link
Contributor

@CecileRobertMichon: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

  • /test pull-cluster-api-build-main
  • /test pull-cluster-api-e2e-main
  • /test pull-cluster-api-test-main
  • /test pull-cluster-api-test-mink8s-main
  • /test pull-cluster-api-verify-main

The following commands are available to trigger optional jobs:

  • /test pull-cluster-api-apidiff-main
  • /test pull-cluster-api-e2e-full-main
  • /test pull-cluster-api-e2e-informing-ipv6-main
  • /test pull-cluster-api-e2e-informing-main
  • /test pull-cluster-api-e2e-workload-upgrade-1-23-latest-main

Use /test all to run the following jobs that were automatically triggered:

  • pull-cluster-api-apidiff-main
  • pull-cluster-api-build-main
  • pull-cluster-api-e2e-informing-ipv6-main
  • pull-cluster-api-e2e-informing-main
  • pull-cluster-api-e2e-main
  • pull-cluster-api-test-main
  • pull-cluster-api-test-mink8s-main
  • pull-cluster-api-verify-main

In response to this:

maybe something less cloud-y like CAPD?

CAPD runs the PR e2e tests so if tests are passing I feel good about this change. Let's run the full suite to be sure.

/test ls

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@CecileRobertMichon
Copy link
Contributor

/test pull-cluster-api-e2e-full-main

@enxebre
Copy link
Member

enxebre commented Apr 13, 2022

yeah this solves it. I actually don't know that it requires infra provider changes -- CAPI should not have been the one setting these values ideally, CCM or similar would be, so this PR just aligns CAPI to CCM's behavior.

CCM set on Nodes but capi providers set providerID on Machines. Therefore although not necessarily this might definitely break providers as it's changing the equality contract. So we should proceed taking that into consideration.

Also we should consistently update IndexKey().

@jackfrancis
Copy link
Contributor Author

@enxebre I'm not sure how IndexKey() is affected by this change

@enxebre
Copy link
Member

enxebre commented Apr 13, 2022

@enxebre I'm not sure how IndexKey() is affected by this change

// This is useful to use the providerID as a reliable index between nodes and machines // as it guarantees the infra Providers contract.

@jackfrancis IndexKey() was intentionally sticking to the equality contract, which we are changing now so it should be updated with it for consistency. Need to think about impact on running clusters though, I guess it should be fine as the controller is restarted and will just reindex using the new criteria.

@jackfrancis
Copy link
Contributor Author

@enxebre I'm not sure how IndexKey() is affected by this change

// This is useful to use the providerID as a reliable index between nodes and machines // as it guarantees the infra Providers contract.

@jackfrancis IndexKey() was intentionally sticking to the equality contract, which we are changing now so it should be updated with it for consistency. Need to think about impact on running clusters though, I guess it should be fine as the controller is restarted and will just reindex using the new criteria.

I think there's more background here that I'm not aware of, but from a naive point of view IndexKey() doesn't have any value if it simply returns the entire string (it will then be equivalent to the CloudProvider() method which just thinly wraps the cloudProvider string property)

@enxebre
Copy link
Member

enxebre commented Apr 13, 2022

I think there's more background here that I'm not aware of, but from a naive point of view IndexKey() doesn't have any value if it simply returns the entire string (it will then be equivalent to the CloudProvider() method which just thinly wraps the cloudProvider string property)

indexKey is just an abstraction to be used to index (and fetch) machines/nodes by providerID in the controllers indexers. It must be unique and so It should use whatever we consider to be the equality contract, otherwise we could end up with one index mapping to multiple machines/nodes. If the impl of indexKey happens to be a thing wrapper that's fine.

@jackfrancis jackfrancis force-pushed the providerID-equality-match-entire-string branch from 2e88f2e to 5b099d4 Compare April 13, 2022 18:08
@jackfrancis
Copy link
Contributor Author

@enxebre got it, updated

@alexeldeib
Copy link
Contributor

alexeldeib commented Apr 13, 2022

CCM set on Nodes but capi providers set providerID on Machines. Therefore although not necessarily this might definitely break providers as it's changing the equality contract. So we should proceed taking that into consideration

Hm. That would require CCM and CAPI to disagree on what the providerID is for a given node, though? I suppose it could be something like -- CAPA sets provider ID using AZ, CCM doesn't, current logic evaluates those as the same node even though the strings are different? This is sort of why I wanted a sanity check on other providers.

That would seem to require CAPI providers to be mutating the provider ID they get back from cloud providers (or perhaps CCM doing the same), which would be a world of hurt regardless of the contract.

At a glance, seems like CAPA just reads whatever EC2 returns, which is roughly the same as Azure. Not sure how CCM for AWS generates provider IDs, but I'd hope it's the same. https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/7a545cd1aa5d05e5a0364a49b3c700341737ca61/controllers/awsmachine_controller.go#L510-L511

Definitely want to make sure we get this lined up right either way.

@jackfrancis
Copy link
Contributor Author

/test pull-cluster-api-e2e-full-main

@enxebre
Copy link
Member

enxebre commented Apr 13, 2022

Hm. That would require CCM and CAPI to disagree on what the providerID is for a given node, though? I suppose it could be something like -- CAPA sets provider ID using AZ, CCM doesn't, current logic evaluates those as the same node even though the strings are different? This is sort of why I wanted a sanity check on other providers.

@alexeldeib That's exactly right, so since we are changing the equality criteria this could technically break any provider out there. So at the very minimum this would require a note for providers implementers.

Also I agree with @killianmuldoon #6412 (comment) we need to deprecate existing behaviour first as this is a public func.

@jackfrancis jackfrancis force-pushed the providerID-equality-match-entire-string branch 4 times, most recently from adf03e8 to 364a7bf Compare April 13, 2022 21:13
@jackfrancis
Copy link
Contributor Author

/test pull-cluster-api-e2e-full-main

@jackfrancis
Copy link
Contributor Author

Correction to my comment above, the IndexKey() has now been made equivalent to String() (not CloudProvider()).

That change has been realized in the PR now, tests look good.

I think at this point we need to evaluate the potential for provider breakage, and if we can convince ourselves that this is actually fixing a capi bug based on the way that providerID URLs are set from their real-world authoritative sources, then we just want to make a communication plan so that any providers who need to update their code have time to do so.

@@ -89,7 +88,7 @@ func (p *ProviderID) ID() string {

// Equals returns true if both the CloudProvider and ID match.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment should be updated to match new function (or we should make a new func as @killianmuldoon suggests)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How might we deprecate this in favor of another func and not affect users? AFAICT this func is only used here:

https://github.com/kubernetes-sigs/cluster-api/blob/release-1.1/internal/controllers/machine/machine_controller_noderef.go#L196

If we created a new func and updated the reference in (r *Reconciler) getNode we would still be making a change for all providers, no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to deprecate it and replace as it's a part of our public Go API, so we don't have control over who is using it and for what.

As for the behavioural change inside the CAPI method, I think the impact of that needs to be assessed separately. If there is an impact from changing the behaviour here that should be called out in the deprecation notice and/or on the migration guide.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackfrancis jackfrancis force-pushed the providerID-equality-match-entire-string branch from 364a7bf to 20f3fd5 Compare April 14, 2022 00:24
@vincepri
Copy link
Member

vincepri commented May 2, 2022

/hold

I'm quite sure we had this behavior back in the v1alpha1 days and we had to change for something more structured given that some Cluster API providers can only set (guess?) a partial provider ID. Looking through the Cluster API AWS provider code, we're still doing that today https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/760e4e7651f46c29bf309607a0b6307570228988/pkg/cloud/scope/machine.go#L144-L148

Unless we have a contract requirement for every cloud provider to match their equivalent CCM/CPI codebase, we should keep this behavior and introduce a new one over time.

I might have missed this while going through the above messages, what is the problem we're trying to solve with this PR?

@Ankitasw
Copy link
Member

To make sure CAPA is all set to accept this change

We would need another PR to enable this in CAPA. It should be ok till we bump to the CAPI version which contains this change.
cc @sedefsavas

@vincepri
Copy link
Member

@fabriziopandini @sbueringer over to you for approval

@sbueringer
Copy link
Member

@fabriziopandini @sbueringer over to you for approval

I'll take a look early next week after the public holiday on Monday

@dthorsen
Copy link
Contributor

dthorsen commented Oct 3, 2022

I'm not sure if this detail has yet been raised, but when using the Azure provider this bug can actually cause service disruptions. If a MachinePool is scaled down, CAPI may delete node resources associated with running VM's in random unrelated MachinePools. I believe this occurs due to the nodeRefs being incorrect the status fields of the MachinePool combined with the logic in this function:

// deleteRetiredNodes deletes nodes that don't have a corresponding ProviderID in Spec.ProviderIDList.
// A MachinePool infrastructure provider indicates an instance in the set has been deleted by
// removing its ProviderID from the slice.
func (r *MachinePoolReconciler) deleteRetiredNodes(ctx context.Context, c client.Client, nodeRefs []corev1.ObjectReference, providerIDList []string) error {

@CecileRobertMichon
Copy link
Contributor

@dthorsen the specific bug you're referring to should already be fixed in #6971

@sbueringer
Copy link
Member

As far as I understand the situtation the PR sounds fine to me.

@jackfrancis Let's please add a note to v1.2-to-v1.3.md to increase the chances infra providers will notice the change when they start adopting v1.3 alpha|beta|RC|GA

@jackfrancis jackfrancis force-pushed the providerID-equality-match-entire-string branch from 1a2e324 to c5fd6f4 Compare October 14, 2022 23:11
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 14, 2022
@jackfrancis
Copy link
Contributor Author

@sbueringer added a note to the 1.2-1.3 doc

PTAL and I think we're finally ready here

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 18, 2022
@vincepri
Copy link
Member

Changes LGTM, @jackfrancis can you rebase?

@jackfrancis jackfrancis force-pushed the providerID-equality-match-entire-string branch from c5fd6f4 to 626b4f4 Compare October 18, 2022 16:17
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 18, 2022
@jackfrancis
Copy link
Contributor Author

@vincepri done

@sbueringer
Copy link
Member

Thank you very much!

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 18, 2022
Copy link
Contributor

@CecileRobertMichon CecileRobertMichon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: CecileRobertMichon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 18, 2022
@k8s-ci-robot k8s-ci-robot merged commit 4b6f3c3 into kubernetes-sigs:main Oct 18, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.3 milestone Oct 18, 2022
@jackfrancis jackfrancis deleted the providerID-equality-match-entire-string branch October 18, 2022 21:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ProviderID set by capi infra providers should match the one set by the controller manager cloud-provider
10 participants