Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MachinePools cannot be upgraded atomically when used with managed providers #7248

Open
shyamradhakrishnan opened this issue Sep 20, 2022 · 13 comments
Labels
area/machinepool Issues or PRs related to machinepools help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/bug Categorizes issue or PR as related to a bug. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@shyamradhakrishnan
Copy link

What steps did you take and what happened:
Many providers have built their managed node pool based on CPI machinepools. In most cases, customers can provide a custom vm image to be used by the node pool. for example here - https://github.com/oracle/cluster-api-provider-oci/blob/63f38727a6fba8ef78668fcb8187922ddf3cbfa8/exp/api/v1beta1/ocimanagedmachinepool_types.go#L180

The nodepool version is take from the machinepool spec like follows MachinePool.Spec.Template.Spec.Version.

This results in an inability to upgrade the machinepool(node pools) atomically. Because if we have to upgrade the kubernetes version, the version in CAPI MachinePool spec as well as the custom image infrastructure machinepool spec has to be changed and since they are in different objects, they cannot be changed atomically.

What did you expect to happen:

Machinepools can be upgraded tomically.

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Environment:

  • Cluster-api version:
  • minikube/kind version:
  • Kubernetes version: (use kubectl version):
  • OS (e.g. from /etc/os-release):

/kind bug
[One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 20, 2022
@shyamradhakrishnan
Copy link
Author

/cc @richardcase

@sbueringer
Copy link
Member

sbueringer commented Sep 20, 2022

Hm, on KCP/MD/MS it would work like this:
An additional InfrastructureMachineTemplate would be created and then the KCP/MD/MS would be updated to the new Kubernetes version and the new InfrastructureMachineTemplate at the same time.

I assume something similar doesn't work with the MachinePool as the InfraMachinePool is representing the "machines" directly so there is no rotation of the InfraMachinePool?

@shyamradhakrishnan
Copy link
Author

@sbueringer yeah, the current implementation of machinepools does not have a rotation, the machinepool references a direct infra machine pool. for example https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/main/templates/cluster-template-aks.yaml#L58
and its the same for all providers I saw.

@fabriziopandini
Copy link
Member

/cc @CecileRobertMichon @jackfrancis

@CecileRobertMichon
Copy link
Contributor

This isn't something we've encountered with AzureManagedMachinePools as there is no image reference and the image is managed by AKS, so changing the k8s version on the MachinePool is enough to trigger a k8s version + image upgrade of the managed service. However, the same think might be true with non-managed clusters that specify a custom image when using MachinePools.

@mboersma wonder if you have any thoughts on this one, and whether #4063 would improve this situation or maybe make it easier to improve in the future?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 19, 2022
@shyamradhakrishnan
Copy link
Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 20, 2022
@fabriziopandini
Copy link
Member

/triage accepted

Just for my understanding, is it possible to use the same "template rotation" approach used by other abstractions:

  • create a new template/new infra machine pool object
  • change MP setting both version and the pointer to the new template/new infra machine pool object (atomic operation)
  • delete the old template/olde infra machine pool object

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 28, 2022
@shyamradhakrishnan
Copy link
Author

@fabriziopandini the approach can be used but not currently as we dont use the concept of templates in machinepools, CAPI directly uses MachinePool infra objects. I think what you are suggesting is the right way forward and in sync with other places in CAPI.

@shyamradhakrishnan
Copy link
Author

@fabriziopandini One more advantage is machinepool support in ClusterClass can be added with your suggestion.

@k8s-triage-robot
Copy link

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

  • Confirm that this issue is still relevant with /triage accepted (org members only)
  • Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Jan 20, 2024
@fabriziopandini
Copy link
Member

/priority important-longterm
/kind api-change

@k8s-ci-robot k8s-ci-robot added priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API labels Apr 12, 2024
@fabriziopandini
Copy link
Member

@mboersma @willie-yao @Jont828
I think that addressing this issue - or the concerns behind it - should be taken into consideration before graduating MP to GA

@sbueringer sbueringer added the area/machinepool Issues or PRs related to machinepools label Jul 10, 2024
@fabriziopandini fabriziopandini added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jul 17, 2024
@k8s-ci-robot k8s-ci-robot removed the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jul 17, 2024
@fabriziopandini fabriziopandini added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Jul 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/machinepool Issues or PRs related to machinepools help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/bug Categorizes issue or PR as related to a bug. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

6 participants