Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ProviderID set by capi infra providers should match the one set by the controller manager cloud-provider #4526

Closed
enxebre opened this issue Apr 26, 2021 · 24 comments · Fixed by #6412 or #6971
Assignees
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@enxebre
Copy link
Member

enxebre commented Apr 26, 2021

What steps did you take and what happened:
In the existing providerID logic assumptions, cloudProvider and ID is what is used to compare, skipping the other segments of the providerID. See https://github.com/kubernetes-sigs/cluster-api/blob/master/controllers/noderefutil/providerid.go#L86

This makes the assumption - not necessarily true - that IDs in different regions/zones won't be reused by the cloud provider.

What did you expect to happen:
I'd be in favour of changing the expectation for the cluster-api-providers to set exactly the same providerID the controller manager cloud-provider sets.

We actually ensured this in AWS a while ago kubernetes-sigs/cluster-api-provider-aws#1693.
Though there might be other provider specific reasons why this is not possible I'm not aware of.

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Environment:

  • Cluster-api version:
  • Minikube/KIND version:
  • Kubernetes version: (use kubectl version):
  • OS (e.g. from /etc/os-release):

/kind bug
[One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Apr 26, 2021
@enxebre
Copy link
Member Author

enxebre commented Apr 26, 2021

/assign

@sbueringer
Copy link
Member

sbueringer commented Apr 26, 2021

Seems reasonable to me, but I don't know about other providers.

I'm also not sure if this would to problems with CAPO providerIDs?

They currently look like this in CAPO and in the cloud provider openstack: openstack:///e85a1e5e-0340-423a-be12-23d3c52c9e10

(EDIT: for completeness, in OpenStack it's just openstack:/// + server id in OpenStack (which is a UUID))
I'm wondering if we're missing a slash there :)
xref:
https://github.com/kubernetes-sigs/cluster-api/blob/master/controllers/noderefutil/providerid.go#L27

@CecileRobertMichon
Copy link
Contributor

What are you proposing the change be in cluster-api itself?

The providerIDs definitely need to be consistent with the ones expected by cloud-provider, we actually ran into cluster-autoscaler issues with the Azure provider recently because of an extra slash in the ID (kubernetes-sigs/cluster-api-provider-azure#1293). kubernetes-sigs/cluster-api-provider-azure#655 was merged a long time ago to make it "consistent" but it actually wasn't right because it assumed the format was the same as AWS, which isn't true because Azure has an extra leading slash in the ID.

This is what a providerID looks like in Azure right now: azure:///subscriptions/85d99e6d-f6d6-408f-a9f1-b7a97237d5c4/resourceGroups/default-template/providers/Microsoft.Compute/virtualMachines/default-template-control-plane-fhrvh. Note that we obtain this by doing azure:// + resource ID, unlike AWS which does aws:/// + resourceID (Azure resource ID starts with /).

@enxebre
Copy link
Member Author

enxebre commented Apr 26, 2021

Thanks for that context @CecileRobertMichon.

What are you proposing the change be in cluster-api itself?

If we agree on "The providerIDs definitely need to be consistent with the ones expected by cloud-provider"
what I'm proposing is for capi to consider that a contract in the equality check, so i.e to change this method to compare the whole string ->

return p.CloudProvider() == o.CloudProvider() && p.ID() == o.ID()

@vincepri
Copy link
Member

IIRC the equality check was checking the entire string in the past, but we had to change it to match only on CloudProvider and the identifier given that some of the ProviderID information might be missing when the infrastructure provider provisions the machine and the cloud provider assigns the identifier later on.

By contract, the ProviderID's ID part (last chunk after /) should be unique, is the uniqueness not being guaranteed across multiple deployments? That sounds like an infrastructure provider issue that should be tackled separately.

The comparison method change proposed is also a breaking change, and I'm quite sure most infrastructure providers would break.

@enxebre
Copy link
Member Author

enxebre commented Apr 27, 2021

IIRC the equality check was checking the entire string in the past, but we had to change it to match only on CloudProvider and the identifier given that some of the ProviderID information might be missing when the infrastructure provider provisions the machine and the cloud provider assigns the identifier later on.

This is a good point, we should probably revisit and verify this is still the case.

By contract, the ProviderID's ID part (last chunk after /) should be unique, is the uniqueness not being guaranteed across multiple deployments? That sounds like an infrastructure provider issue that should be tackled separately.

Isn't this contract something we just assumed to be true for convenience because of the limitation you describe above but there's actually no reason nor guarantees for Cloud providers to necessarily satisfy this?

My concern is that uniqueness might not be necessarily the case for all cloud providers. E.g It seems to not be the case in GCP where you can get same ID in different zones https://cloud.google.com/compute/docs/instances/verifying-instance-identity
Why would this be an "infrastructure provider issue"? Our equality check would be the one wrongly returning true for instances with legit ProviderIDs.

Happy to close this if this concern proves to not be justified.

@vincepri
Copy link
Member

vincepri commented Jul 6, 2021

/milestone Next

@k8s-ci-robot k8s-ci-robot added this to the Next milestone Jul 6, 2021
@vincepri
Copy link
Member

vincepri commented Jul 6, 2021

/lifecycle backlog

@LochanRn
Copy link
Member

@enxebre @vincepri any progress on this issue ?

@vincepri
Copy link
Member

/assign @alexeldeib @randomvariable

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 9, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 9, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@alexeldeib
Copy link
Contributor

/reopen
/unassign
/remove-lifecycle rotten

still valid I think, never tackled this unfortunately

@k8s-ci-robot
Copy link
Contributor

@alexeldeib: Reopened this issue.

In response to this:

/reopen
/unassign
/remove-lifecycle rotten

still valid I think, never tackled this unfortunately

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Mar 29, 2022
@alexeldeib
Copy link
Contributor

I think comparing the full string as suggested in #4526 (comment) is correct

By contract, the ProviderID's ID part (last chunk after /) should be unique, is the uniqueness not being guaranteed across multiple deployments? That sounds like an infrastructure provider issue that should be tackled separately.

Isn't this contract something we just assumed to be true for convenience because of the limitation you describe above but there's actually no reason nor guarantees for Cloud providers to necessarily satisfy this?

My concern is that uniqueness might not be necessarily the case for all cloud providers. E.g It seems to not be the case in GCP where you can get same ID in different zones https://cloud.google.com/compute/docs/instances/verifying-instance-identity
Why would this be an "infrastructure provider issue"? Our equality check would be the one wrongly returning true for instances with legit ProviderIDs.

+1 as the assumptions made here are totally false for azure, these are not even unique with multiple VMSS in same region. every VMSS has instances identified by integers unique to that scaleset only (i.e., reused for every VMSS) starting at 0.

/help

@k8s-ci-robot
Copy link
Contributor

@alexeldeib:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

  • Why are we solving this issue?
  • To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
  • Does this issue have zero to low barrier of entry?
  • How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

I think comparing the full string as suggested in #4526 (comment) is correct

By contract, the ProviderID's ID part (last chunk after /) should be unique, is the uniqueness not being guaranteed across multiple deployments? That sounds like an infrastructure provider issue that should be tackled separately.

Isn't this contract something we just assumed to be true for convenience because of the limitation you describe above but there's actually no reason nor guarantees for Cloud providers to necessarily satisfy this?

My concern is that uniqueness might not be necessarily the case for all cloud providers. E.g It seems to not be the case in GCP where you can get same ID in different zones https://cloud.google.com/compute/docs/instances/verifying-instance-identity
Why would this be an "infrastructure provider issue"? Our equality check would be the one wrongly returning true for instances with legit ProviderIDs.

+1 as the assumptions made here are totally false for azure, these are not even unique with multiple VMSS in same region. every VMSS has instances identified by integers unique to that scaleset only (i.e., reused for every VMSS) starting at 0.

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Apr 4, 2022
@jackfrancis
Copy link
Contributor

/assign

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 3, 2022
@fabriziopandini
Copy link
Member

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 4, 2022
@CecileRobertMichon
Copy link
Contributor

/reopen

#6971 didn't actually fix this, we'll want #6412 to fully fix this issue

@k8s-ci-robot
Copy link
Contributor

@CecileRobertMichon: Reopened this issue.

In response to this:

/reopen

#6971 didn't actually fix this, we'll want #6412 to fully fix this issue

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot reopened this Jul 25, 2022
@fabriziopandini fabriziopandini added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jul 29, 2022
@fabriziopandini fabriziopandini removed this from the Next milestone Jul 29, 2022
@fabriziopandini fabriziopandini removed the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jul 29, 2022
@fabriziopandini
Copy link
Member

/triage accepted

@k8s-ci-robot k8s-ci-robot added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Sep 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet