Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 normalize MachineSet version validation #5406

Merged
merged 1 commit into from
Oct 13, 2021

Conversation

abhinavnagaraj
Copy link
Contributor

What this PR does / why we need it:
This PR normalizes the MachineSet Template version, to match the pattern v.<major>.<minor>.<patch>
This prevents the creation of new machines when upgrading from v1alpha3 to v1alpha4, when there are no changes in the MachineDeployment spec.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #5405

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 7, 2021
@k8s-ci-robot
Copy link
Contributor

Hi @abhinavnagaraj. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Oct 7, 2021
@sbueringer
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 7, 2021
Copy link
Contributor

@CecileRobertMichon CecileRobertMichon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 7, 2021
@fabriziopandini
Copy link
Member

/hold
For the ongoing discussion on the issue

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 8, 2021
@enxebre
Copy link
Member

enxebre commented Oct 12, 2021

/lgtm

Copy link
Member

@vincepri vincepri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vincepri

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 12, 2021
@sbueringer
Copy link
Member

As stated in #5405 (comment) I would really prefer to understand how this PR can fix the issue.

@vincepri
Copy link
Member

We should probably remove the Fixes #issue from the PR description, the change on its own seems a good one?

@sbueringer
Copy link
Member

sbueringer commented Oct 13, 2021

This change adds the v prefix on MachineSets which will make sure all new and updated MachineSets have the prefix. But all new MachineSets will have the prefix anyway because we enforce it on the MachineDeployment, so I think the change itself shouldn't be needed except to solve the edge case described in the issue (and imho it doesn't solve it).

So I wonder what happens after an upgrade. Initially (if I understood it correctly) you will have a MD and a MS without prefix. When you scale up the MD the v prefix will be added. I wonder who/what will trigger an update on the pre-existing MS to trigger the webhook.

Do status updates also trigger mutating webhooks configured for UPDATE? Or is our controller doing full updates on the MS resource?

If it is our controller which triggers the update, imho we will now have a race condition after this PR. More or less the MD and the MS have to get the prefix at the same time (or at least we should never run into a MD reconcile loop where only one of them has the prefix).

I'm probably missing something, but it sounds to me like the PR will make the issue harder to reproduce, not solve it. I think it could be solved for example by also adding the v prefix in the conversion webhook, because then we would always have the v prefix, which is our goal.

@sbueringer
Copy link
Member

sbueringer commented Oct 13, 2021

I played a bit around with this PR and I think it only helps when somehow the MS also is updated at the same time as the MD (but I couldn't reproduce this case)

Setup

  • Deploy controllers based on the current PR (ideally run the capi controller in the IDE to be able to inspect what the reconciler does)
  • Create an MD and a corresponding MS without the v prefix.

I had the following results

Option 1: Run kubectl scale to scale up the MD

  • MD will not get the v prefix because the defaulting webhook is not run on the scale subresource
  • MD controller tries to scale up the MS => the MS will get the v prefix as the MS defaulting webhook is run
  • Now we have a MD without prefix and a MS with the prefix
  • The MD will not be able to reconcile the MS without prefix as the MS webhook will always add the prefix
  • The MD controller now always runs into MS already exists errors as its not possible to reconcile the MS anymore (this return)

Option 2: Run kubectl edit and change replicas

  • MD will now get the prefix as the webhook is triggered when kubectl patches the MD
  • MD will create a new MS as the MD has the prefix but the existing MS does not (becaise the MS webhook was never triggered)

Option 3: Trigger updates on MD and MS concurrently

I guess if the MD and MS are updated both before a MD reconcile runs both will get the prefix and the reconcile will only scale up.

I think it might be a good idea to additionally adjust the conversion webhooks, so whenever a v1beta1 MD or MS is retrieved it will have the v prefix.
I think this should be safer as whenever an old MD or MS is read it will get the v prefix and as soon as the MD or MS is written to etcd it will actually get the prefix.
(Note: the only exception is MS which already have been converted now to v1beta1, but I would ignore that case)

@enxebre
Copy link
Member

enxebre commented Oct 13, 2021

, so I think the change itself shouldn't be needed except to solve the edge case described in the issue (and imho it doesn't solve it).

I think the change is legit orthogonally to the issue which I don't think it fixes. One for consistency and to eventually remove the edge cases you describe here #5406 (comment) and two because the MachineSet is nothing less than a user facing API which UX we should care same as a MachineDeployment, e.g as a user I might choose to run a MachineSet directly because I need more granular control than I have with a MachineDeployment and from a UX pov I'd be surprise if the fields differently.

I think it might be a good idea to additionally adjust the conversion webhooks, so whenever a v1beta1 MD or MS is retrieved it will have the v prefix.

sounds reasonable to me.

@vincepri
Copy link
Member

This change adds the v prefix on MachineSets which will make sure all new and updated MachineSets have the prefix. But all new MachineSets will have the prefix anyway because we enforce it on the MachineDeployment, so I think the change itself shouldn't be needed except to solve the edge case described in the issue (and imho it doesn't solve it).

Users might want to use MachineSet without a MachineDeployment though, which has been a valid use case for quite a while. There are cases where you need strict control on how MachineTemplate is rolled out, which you might want to leverage MachineSet for rather than a MachineDeployment resource. The change as-is seems valid from that point of view?

Do status updates also trigger mutating webhooks configured for UPDATE? Or is our controller doing full updates on the MS resource?

We usually don't configure our admission or validation webhooks to trigger on status (or any other subresource) changes

resources:
- machinesets

I'm probably missing something, but it sounds to me like the PR will make the issue harder to reproduce, not solve it. I think it could be solved for example by also adding the v prefix in the conversion webhook, because then we would always have the v prefix, which is our goal.

Wouldn't this cause a rollout?

To clarify, I don't think that the change in this PR solves the related issue, if it does it might be by chance or because something else gets triggered. We'd have to dig a little deeper on it.

@sbueringer
Copy link
Member

sbueringer commented Oct 13, 2021

Agree, the change itself is fine so that the MS itself works correctly.

| Wouldn't this cause a rollout?
I would assume it doesn't as MD and MS would be in sync

Ah I think I missed something. Maybe it fixes the issue because after an upgrade the MachineSetReconciler reconciles every MS at least once (after the list call (?)). During reconcile it reconciles the external references (aka upgrades the API versions of at least the bootstrap template ref) at least in our case because we've also upgrade the kubeadm types. Those updates should then trigger the defaulting.

Update:
I confirmed it. So post-upgrade the capi-controller is upgrading the template references automatically and during that also updates the version field through the webhooks. I think this only leaves small edge cases like when someone updates from v1alpha3 to v1beta1 without updating the infra or bootstrap provider at the same time. I guess that's unlikely as most providers will update their code and api version between v1alpha3 and v1beta1.

So I think it could be good enough to just fix it with this PR.

@sbueringer
Copy link
Member

/lgtm

@vincepri
Copy link
Member

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 13, 2021
@k8s-ci-robot k8s-ci-robot merged commit af2feb1 into kubernetes-sigs:main Oct 13, 2021
@k8s-ci-robot k8s-ci-robot added this to the v0.4 milestone Oct 13, 2021
@sbueringer
Copy link
Member

@vincepri We should make sure this PR gets into the next v1.0 release. Do we need a cherry-pick? I'm not sure if there was a consensus around fast-forward.

@vincepri
Copy link
Member

/cherrypick release-1.0

@k8s-infra-cherrypick-robot

@vincepri: failed to push cherry-picked changes in GitHub: pushing failed, output: "To https://github.com/k8s-infra-cherrypick-robot/cluster-api\n ! [remote rejected] cherry-pick-5406-to-release-1.0 -> cherry-pick-5406-to-release-1.0 (refusing to allow a Personal Access Token to create or update workflow .github/workflows/golangci-lint.yml without workflow scope)\nerror: failed to push some refs to 'https://github.com/k8s-infra-cherrypick-robot/cluster-api'\n", error: exit status 1

In response to this:

/cherrypick release-1.0

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sbueringer
Copy link
Member

/cherrypick release-1.0

@k8s-infra-cherrypick-robot

@sbueringer: failed to push cherry-picked changes in GitHub: pushing failed, output: "To https://github.com/k8s-infra-cherrypick-robot/cluster-api\n ! [remote rejected] cherry-pick-5406-to-release-1.0 -> cherry-pick-5406-to-release-1.0 (refusing to allow a Personal Access Token to create or update workflow .github/workflows/golangci-lint.yml without workflow scope)\nerror: failed to push some refs to 'https://github.com/k8s-infra-cherrypick-robot/cluster-api'\n", error: exit status 1

In response to this:

/cherrypick release-1.0

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sbueringer
Copy link
Member

I've opened a Slack thread: https://kubernetes.slack.com/archives/CDECRSC5U/p1634797739002500

@sbueringer
Copy link
Member

/cherrypick release-1.0

@k8s-infra-cherrypick-robot

@sbueringer: failed to push cherry-picked changes in GitHub: pushing failed, output: "To https://github.com/k8s-infra-cherrypick-robot/cluster-api\n ! [remote rejected] cherry-pick-5406-to-release-1.0 -> cherry-pick-5406-to-release-1.0 (refusing to allow a Personal Access Token to create or update workflow .github/workflows/golangci-lint.yml without workflow scope)\nerror: failed to push some refs to 'https://github.com/k8s-infra-cherrypick-robot/cluster-api'\n", error: exit status 1

In response to this:

/cherrypick release-1.0

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sbueringer
Copy link
Member

New thread in #testing-ops: https://kubernetes.slack.com/archives/C7J9RP96G/p1634827485007700

@sbueringer
Copy link
Member

/cherrypick release-1.0

@k8s-infra-cherrypick-robot

@sbueringer: failed to push cherry-picked changes in GitHub: pushing failed, output: "To https://github.com/k8s-infra-cherrypick-robot/cluster-api\n ! [remote rejected] cherry-pick-5406-to-release-1.0 -> cherry-pick-5406-to-release-1.0 (refusing to allow a Personal Access Token to create or update workflow .github/workflows/golangci-lint.yml without workflow scope)\nerror: failed to push some refs to 'https://github.com/k8s-infra-cherrypick-robot/cluster-api'\n", error: exit status 1

In response to this:

/cherrypick release-1.0

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sbueringer
Copy link
Member

/cherrypick release-0.4

@k8s-infra-cherrypick-robot

@sbueringer: new pull request created: #5482

In response to this:

/cherrypick release-0.4

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sbueringer
Copy link
Member

/cherrypick release-1.0

@k8s-infra-cherrypick-robot

@sbueringer: new pull request created: #5560

In response to this:

/cherrypick release-1.0

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MachineSet version changes, upgrading from v1alpha3 to v1alpha4
8 participants