Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MachineSet version changes, upgrading from v1alpha3 to v1alpha4 #5405

Closed
abhinavnagaraj opened this issue Oct 7, 2021 · 15 comments · Fixed by #5406
Closed

MachineSet version changes, upgrading from v1alpha3 to v1alpha4 #5405

abhinavnagaraj opened this issue Oct 7, 2021 · 15 comments · Fixed by #5406
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done.

Comments

@abhinavnagaraj
Copy link
Contributor

What steps did you take and what happened:
Launch a cluster in v1alpha3 (capi v0.3.19), with one MachineDeployment containing 2 replicas and with strategyType='RollingUpdate'.

Upgrade the cluster to v1alpha4(capi v0.4.0), without any changes in MachineDeployment spec.

This results in the creation of new MachineSets and hence new machines.

What did you expect to happen:
If there is no change in the MachineDeployment spec, there should be no rolling-upgrade of MachineSets.

Anything else you would like to add:
In v1alpha4 MachineDeployment and MachineSet template versions(spec.template.spec.version) is expected to be prefixed with a 'v'.
A Default() function in MachineDeployment webhook adds this prefix - #4670
But this is missing in the MachineSet webhook.

Environment:

  • Cluster-api version:v0.4.0
  • Minikube/KIND version:v0.11.1
  • Kubernetes version: (use kubectl version):v1.21.1
  • OS (e.g. from /etc/os-release):

/kind bug
[One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Oct 7, 2021
@abhinavnagaraj
Copy link
Contributor Author

/assign @abhinavnagaraj

@sbueringer
Copy link
Member

sbueringer commented Oct 7, 2021

I would assume defaulting in the MachineSet leads to an infinite rollout of the MD when the MD tries to rollout a MachineSet without the v prefix but after creation it gets the prefix through the webhook. (afaik there is a diff logic to compare the template in the MD with the MachineSet to figure out if the MD must be rolled out)

@ykakarap
Copy link
Contributor

ykakarap commented Oct 7, 2021

@abhinavnagaraj I tried the flow you explained in the issue and I have not been able to reproduce the issue.
Steps I took:

  • Use the latest release 0.3 version of clusterctl (v0.3.23) to initialize a local kind cluster (infrastructure = docker)
  • Use clusterctl config cluster to create a cluster YAML
  • Remove the v prefix from the spec.template.spec.version of the MachieDeploment in the cluster YAML.
  • Apply the cluster yaml (and the CNI) to wait for the cluster to come up.
  • Verified the API version of the machine deployments to be cluster.x-k8s.io/v1alpha3
  • Verify that the version in the MachineDeployment and the MachineSet did not have the v prefix
  • Download the newest version of the release 0.4 version of clusterctl (v0.4.4)
  • Run clusterctl upgrade apply --contract v1alpha4
  • Wait for the upgrade to finish

At the end the machine deployments and the machine sets were updated to the new API version (cluster.x-k8s.io/v1alpha4) and no machineset rollout was triggered.

The version on the MachineDeployment and the MachineSet did not have the v prefix after the upgrade too.

If possible, can you please share some more details about the state of the cluster and the yamls used before and after the upgrade.
Any details that can help with reproducing the issue would be helpful 🙂

@abhinavnagaraj
Copy link
Contributor Author

@ykakarap Yes, these steps are correct. Now, if we scale the MachineDeployment with API version v1alpha4, we would expect the corresponding MachineSet(v1alpha4) to also scale up. Instead, the MachineDeployment version updates to have the prefix 'v'. A new MachineSet is launched with the version set to have prefix 'v'. The old MachineSet finally reconciles to have 0 replicas, but does not get deleted.

@sbueringer
Copy link
Member

sbueringer commented Oct 11, 2021

Note: That the old MachineSet doesn't get deleted is intended behaviour because of revisionHistoryLimit (which defaults to 1)

From your description I would guess that the scale triggers the MD defaulting webhook (which runs on create/update) which adds the v prefix and thus triggers the rollout of a new MachineSet.

I wonder how adding the same defaulting to the MachineSet logic can fix the issue as the MS webhook is only run on create/update.

@vincepri vincepri reopened this Oct 13, 2021
@vincepri vincepri added this to the Next milestone Oct 22, 2021
@vincepri
Copy link
Member

/priority awaiting-more-evidence

@k8s-ci-robot k8s-ci-robot added the priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. label Oct 22, 2021
@sbueringer
Copy link
Member

sbueringer commented Oct 22, 2021

@vincepri I think you reopened the issue because of my comment here: #5406 (comment)

I confirmed it. So post-upgrade the capi-controller is upgrading the template references automatically and during that also updates the version field through the webhooks. I think this only leaves small edge cases like when someone updates from v1alpha3 to v1beta1 without updating the infra or bootstrap provider at the same time. I guess that's unlikely as most providers will update their code and api version between v1alpha3 and v1beta1.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 20, 2022
@fabriziopandini
Copy link
Member

/remove-lifecycle stale
@sbueringer to re-assess

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 20, 2022
@sbueringer
Copy link
Member

@fabriziopandini To summarize:

  • In a regular CAPI update we update core CAPI, the bootstrap, the controlplane and the infra provider.
  • for this issue it's relevant that either the bootstrap provider or the infra provider used in the MachineSet is updated
  • when the MachineSetReconciler updates either of the references in the MachineSet because of that, this issue is automatically mitigated (because an update to one of the refs, automatically adds the v prefix to the version field)

In my opinion now the issue can only occur if someone upgrades only core CAPI, without upgrading the bootstrap or infra provider. In my opinion that's as very uncommon case. So I personally wouldn't try to add handling for that case. Especially as we didn't get any reports that it is an issue.

(The PR which closed this issue added the v-prefix defaulting in the MS webhook, which made it possible for the reconciler to automatically mitigate)

@sbueringer
Copy link
Member

If we want to mitigate that edge case ,we could add code which adds the prefix to the version field during each reconcile (if necessary).

I think I would prefer avoiding that, if we don't have to.

@fabriziopandini
Copy link
Member

the issue can only occur if someone upgrades only core CAPI, without upgrading the bootstrap or infra provider.

That's a use case we are not supporting, given that clusterctl ensures that all the providers move up at the same time
I agree we should not fix

@sbueringer
Copy link
Member

the issue can only occur if someone upgrades only core CAPI, without upgrading the bootstrap or infra provider.

That's a use case we are not supporting, given that clusterctl ensures that all the providers move up at the same time I agree we should not fix

As discussed in the grooming on 18th February, let's close this issue based on that it's an unsupported use case to only upgrade core CAPI.

@sbueringer
Copy link
Member

/close

@k8s-ci-robot
Copy link
Contributor

@sbueringer: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants