Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ Add rolloutAfter for RollingUpdate strategy #4596

Closed
wants to merge 1 commit into from

Conversation

enxebre
Copy link
Member

@enxebre enxebre commented May 11, 2021

What this PR does / why we need it:
If the reconciliation time is after spec.rolloutAfter then a rollout should happen or has already happened.
A new MachineSet is needed the first time the reconciliation time is after spec.rolloutAfter.
Otherwise we pick the oldest with creation timestamp > lastRolloutAfter.
If a new MachineSet is needed we include the rolloutAfter hash into the MachineSet name so when a new MachineSet is created the name does not clash with the one for the existing MachineSet with the same template and the rollout can be orchestrated as usual.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #4536

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 11, 2021
@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label May 11, 2021
@enxebre enxebre changed the title WIP: Add rolloutAfter for RollingUpdate strategy WIP: Add rolloutAfter/rolloutOlderThan for RollingUpdate strategy May 11, 2021
@enxebre
Copy link
Member Author

enxebre commented May 11, 2021

@vincepri
Copy link
Member

@enxebre Do you think we could tackle rolloutOlderThan separately?

@enxebre
Copy link
Member Author

enxebre commented May 11, 2021

/retitle WIP: Add rolloutAfter for RollingUpdate strategy

@k8s-ci-robot k8s-ci-robot changed the title WIP: Add rolloutAfter/rolloutOlderThan for RollingUpdate strategy WIP: Add rolloutAfter for RollingUpdate strategy May 11, 2021
Copy link
Member

@sbueringer sbueringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could be totally wrong with most of these comments. I don't have a lot of experience with the up and downscale logic.

controllers/machinedeployment_controller.go Outdated Show resolved Hide resolved
controllers/machinedeployment_controller.go Outdated Show resolved Hide resolved
controllers/machinedeployment_controller.go Outdated Show resolved Hide resolved
controllers/machinedeployment_controller.go Outdated Show resolved Hide resolved
controllers/machinedeployment_controller.go Outdated Show resolved Hide resolved
controllers/machinedeployment_controller.go Outdated Show resolved Hide resolved
controllers/machinedeployment_controller.go Outdated Show resolved Hide resolved
@enxebre
Copy link
Member Author

enxebre commented May 14, 2021

I could be totally wrong with most of these comments. I don't have a lot of experience with the up and downscale logic.

Thanks @sbueringer this is still kinda pseudo code, I wanted to hear other's feedback to see if the algorithm is making sense.
I'll give it another pass to get it to a mergeable state.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 27, 2021
@enxebre
Copy link
Member Author

enxebre commented Sep 9, 2021

Back to this. We can probably simplify by using a label/annotation approach similar to what kubectl rollout restart does. I'll give it a go.

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 9, 2021
@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 9, 2021
@enxebre
Copy link
Member Author

enxebre commented Sep 9, 2021

@sbueringer @vincepri what do you think? if we want to go this path I'll get the PR to a mergeable state.

@enxebre enxebre force-pushed the rollout-after branch 3 times, most recently from 750c9cd to c5d806c Compare September 9, 2021 11:21
Comment on lines 38 to 40
// RolledoutAfterAnnotation annotates the MachineTemplateSpec to trigger a
// MachineDeployment rollout when the RolloutAfter criteria is met.
RolledoutAfterAnnotation = "machinedeployment.clusters.x-k8s.io/RolledoutAfter"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of relying on a label, could we add additional logic like we have in KCP to take into account the Spec.RolloutAfter field?

Copy link
Member Author

@enxebre enxebre Sep 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MachineDeployment.Spec.RolloutAfter is the source of truth and the annotation is an impl detail https://github.com/kubernetes-sigs/cluster-api/pull/4596/files#diff-34f64847a48c9fc1b3222c9bec6c93f30767a93262215eb4305d78c3b9097c63R225.

Orchestrate this "manually" as in KCP does not seems optimal here since Machines are owned by MachineSets and subject to its reconciliation logic and to MachineDeployment replicas so leveraging the MachineDeployment ability to rollout might be the most viable path.

Not sure if you were suggesting something different?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rollout is determined in this function https://github.com/kubernetes-sigs/cluster-api/blob/master/controllers/mdutil/util.go#L416-L431

Today we're only looking at the MachineTemplate, we could add some logic there to compare CreationTimestamp with RolloutAfter?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I though about that but I couldn't think of a working way in that direction so far because mdutil.FindNewMachineSet(d, msList) prefers the oldest MachineSet in case of equal templates., see my reasoning here #4596 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue with the above is that mdutil.FindNewMachineSet(d, msList) seems to prefer oldest MachineSet in case of equal templates.

From the text in there, the assumption made is that usually there should be only one MachineSet with the same template, otherwise we wouldn't need to roll out change; there is an edge case where there could be more than one MachineSet with the same template, and the code always picks the oldest one (which might already be scaled).

The above shouldn't impact adding new logic that looks at the creation timestamp and compares with a rolloutAfter field?

Copy link
Member Author

@enxebre enxebre Sep 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't we also check the CreationTimestamp now at every iteration though? If there is a RolloutAfter, the MachineSet CreationTimestamp must be after the RolloutAfter's value. Even if templates are the same, we shouldn't pick the oldest one in that case.

Yes we could do that but then we'll be changing FindNewMachineSet to always pick the newest even in the duplicated template edge case. We have no means to differentiate between the duplicated template edge use case and the rolloutAfter use case, so for both we'll be picking the newest one. If we don't care about the edge use case, then it might work. I tried to summarise that reasoning here #4596 (comment).

Adding an annotation or label to a MachineDeployment to rollout a template is truthfully something that might change in the future. I'd rather find a better way to rollout changes without using any behaviors, but rather be explicit about it when we need to calculate changes. In the future we might want to add more logic to determine if a MachineSet needs to be deployed, and relying on annotations is probably not going to work out long term.

I'm approaching the rolling upgrade logic here as a working blackbox building block. We can do any calculation we want outside it to decide we want a rolling upgrade, the annotation is just what we signal it from a level above. I'm seeing the decision of making a rolling upgrade and the rollout logic at two different levels. Though I can also see your angle.

Copy link
Member

@vincepri vincepri Sep 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we could do that but then we'll be changing FindNewMachineSet to always pick the newest even in the duplicated template edge case.

The duplicate one is truthfully an edge case that should be audited, but that's probably something for a different issue/PR. The code is the same as the ReplicaSet code upstream, which probably has this issue because api-server and controller-manager could be restarting and duplicating ReplicaSet(s). I don't know if this can happen for MachineSets.

To further clarify though, we wouldn't be changing the code to pick the newest one, but the oldest one with these conditions:

  • ms.metadata.creationTimestamp > md.spec.rolloutAfter
  • EqualMachineTemplate(md, ms) == true

We have no means to differentiate between the duplicated template edge use case and the rolloutAfter use case, so for both we'll be picking the newest one.

This shouldn't apply in case of RolloutAfter. If a rollout time is in effect, even if we have duplicates, we've already rolled out a new MachineSet at that point.

If there are duplicates (depending on whatever RevisionHistory is set to), cleanupDeployment function takes care of deleting outdated MachineSets.

Copy link
Member Author

@enxebre enxebre Sep 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The edge case issue was because different kube versions resulted in different hashes for ReplicaSet names resulting in duplicated ReplicaSets.

To further clarify though, we wouldn't be changing the code to pick the newest one, but the oldest one with these conditions:
ms.metadata.creationTimestamp > md.spec.rolloutAfter
EqualMachineTemplate(md, ms) == true

Ok this makes sense let me implement it and keep discussing through code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vincepri @sbueringer updated PTAL.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to test and support the following possible cases:

  • RolloutAfter == nil
    • Pick oldest
  • RolloutAfter < Now
    • Oldest
    • Created > last RolloutAfter
  • RolloutAfter > Now
    • Oldest
    • Created > last RolloutAfter

To keep track of the last rolloutAfter I'm using an annotation LastRolloutAfterAnnotation. We could use a status field but I'd rather wait for the implementation to get mature and see how/if it combines with things like maxAge.

@vincepri @sbueringer PTAL.

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Sep 13, 2021
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 20, 2021
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 22, 2021
@enxebre
Copy link
Member Author

enxebre commented Sep 22, 2021

Let's wait for #5297 to rebase and proceed with this

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 23, 2021
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 28, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign timothysc after the PR has been reviewed.
You can assign the PR to them by writing /assign @timothysc in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@enxebre
Copy link
Member Author

enxebre commented Sep 28, 2021

@vincepri @sbueringer I rebased this. Given the timelines and the lack of feedback/testing in real use I think we should probably punt on it until we release v1beta1.

@enxebre enxebre force-pushed the rollout-after branch 2 times, most recently from 5d0f71e to 9a2964d Compare September 28, 2021 15:16
Comment on lines +437 to +441
Name: "When rolloutAfter is not required but another one took place in the past it should return oldest created after the last rolloutAfter",
deployment: deploymentRolloutAfterHappened,
msList: []*clusterv1.MachineSet{&machineSetBeforeLastRolloutAfter, &machineSetAfterLastRolloutAfter, &oldestMachineSetAfterLastRolloutAfter},
expected: &oldestMachineSetAfterLastRolloutAfter,
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused when would this unit test case would happen. If I understood the code correctly, this use case adds the requirement for the annotation. Wouldn't the machineSetBeforeLastRolloutAfter be gone though after the rollout takes place?

Copy link
Member Author

@enxebre enxebre Sep 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It won't be necessarily gone, it might be scaled down to zero right? That depends only on .spec.RevisionHistoryLimit right?

If the reconciliation time is after spec.rolloutAfter then a rollout should happen or has already happened.
A new MachineSet is needed the first time the reconciliation time is after spec.rolloutAfter.
Otherwise we pick the oldest with creation timestamp > lastRolloutAfter.
If a new MachineSet is needed we include the rolloutAfter hash into the MachineSet name so when a new MachineSet is created the name does not clash with the one for the existing MachineSet with the same template and the rollout can be orchestrated as usual.
Comment on lines +61 to +62
// LastRolloutAfterAnnotation is the last MachineDeployment.Spec.RolloutAfter that was met.
LastRolloutAfterAnnotation = "machinedeployment.clusters.x-k8s.io/lastRollout"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the annotation applied and stored only on MachineSets in this case?

Comment on lines +396 to 397
// Deprecated: this does not for rolloutAfter. Use FindNewMachineSetWithRolloutAfter.
func FindNewMachineSet(deployment *clusterv1.MachineDeployment, msList []*clusterv1.MachineSet) *clusterv1.MachineSet {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably change these functions signature now instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't that be a breaking change? don't we need to target this to v1beta2 now since this is an API change?

Copy link
Member

@sbueringer sbueringer Oct 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not 100% sure, but I think the idea is that we can have code breaking changes between v1.0 and v1.1, i.e. only the API types and their behavior must not have breaking changes.

Or from another point of view: v1beta1 guards our API types and corresponding implementation, the release version the code (+ we're doing breaking changes there on minor versions as we're not incrementing major, similar to Kubernetes)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense, though I'm not sure a new field can be added with out creating a subsequent API version.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the idea was to allow new fields if they are non-breaking, but let's see if somebody else knows :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Internal packages are not exposed APIs, so we can break even within patch releases given that only internal code might use them

Copy link
Member

@sbueringer sbueringer Oct 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure but I was assuming we started talking at some point about the addition of the RolloutAfter field to the v1beta1 API type.

@k8s-ci-robot
Copy link
Contributor

@enxebre: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 25, 2021
@randomvariable
Copy link
Member

Are we still trying to get this one in?

@k8s-ci-robot
Copy link
Contributor

@enxebre: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cluster-api-e2e-main 98600a6 link true /test pull-cluster-api-e2e-main

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 8, 2022
@vincepri
Copy link
Member

/close

Closing due to age, @enxebre feel free to reopen after rebase if this is still a change we want to pursue

@k8s-ci-robot
Copy link
Contributor

@vincepri: Closed this PR.

In response to this:

/close

Closing due to age, @enxebre feel free to reopen after rebase if this is still a change we want to pursue

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for RolloutAfter to MachineDeployments
7 participants