-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
✨ Add rolloutAfter for RollingUpdate strategy #4596
Conversation
@enxebre Do you think we could tackle |
/retitle WIP: Add rolloutAfter for RollingUpdate strategy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could be totally wrong with most of these comments. I don't have a lot of experience with the up and downscale logic.
Thanks @sbueringer this is still kinda pseudo code, I wanted to hear other's feedback to see if the algorithm is making sense. |
Back to this. We can probably simplify by using a label/annotation approach similar to what |
fe925d4
to
d2072ce
Compare
d2072ce
to
7182629
Compare
@sbueringer @vincepri what do you think? if we want to go this path I'll get the PR to a mergeable state. |
750c9cd
to
c5d806c
Compare
// RolledoutAfterAnnotation annotates the MachineTemplateSpec to trigger a | ||
// MachineDeployment rollout when the RolloutAfter criteria is met. | ||
RolledoutAfterAnnotation = "machinedeployment.clusters.x-k8s.io/RolledoutAfter" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of relying on a label, could we add additional logic like we have in KCP to take into account the Spec.RolloutAfter
field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MachineDeployment.Spec.RolloutAfter
is the source of truth and the annotation is an impl detail https://github.com/kubernetes-sigs/cluster-api/pull/4596/files#diff-34f64847a48c9fc1b3222c9bec6c93f30767a93262215eb4305d78c3b9097c63R225.
Orchestrate this "manually" as in KCP does not seems optimal here since Machines are owned by MachineSets and subject to its reconciliation logic and to MachineDeployment replicas so leveraging the MachineDeployment ability to rollout might be the most viable path.
Not sure if you were suggesting something different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rollout is determined in this function https://github.com/kubernetes-sigs/cluster-api/blob/master/controllers/mdutil/util.go#L416-L431
Today we're only looking at the MachineTemplate, we could add some logic there to compare CreationTimestamp
with RolloutAfter
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I though about that but I couldn't think of a working way in that direction so far because mdutil.FindNewMachineSet(d, msList)
prefers the oldest MachineSet in case of equal templates., see my reasoning here #4596 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue with the above is that mdutil.FindNewMachineSet(d, msList) seems to prefer oldest MachineSet in case of equal templates.
From the text in there, the assumption made is that usually there should be only one MachineSet with the same template, otherwise we wouldn't need to roll out change; there is an edge case where there could be more than one MachineSet with the same template, and the code always picks the oldest one (which might already be scaled).
The above shouldn't impact adding new logic that looks at the creation timestamp and compares with a rolloutAfter
field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't we also check the CreationTimestamp now at every iteration though? If there is a RolloutAfter, the MachineSet CreationTimestamp must be after the RolloutAfter's value. Even if templates are the same, we shouldn't pick the oldest one in that case.
Yes we could do that but then we'll be changing FindNewMachineSet
to always pick the newest even in the duplicated template edge case. We have no means to differentiate between the duplicated template edge use case and the rolloutAfter use case, so for both we'll be picking the newest one. If we don't care about the edge use case, then it might work. I tried to summarise that reasoning here #4596 (comment).
Adding an annotation or label to a MachineDeployment to rollout a template is truthfully something that might change in the future. I'd rather find a better way to rollout changes without using any behaviors, but rather be explicit about it when we need to calculate changes. In the future we might want to add more logic to determine if a MachineSet needs to be deployed, and relying on annotations is probably not going to work out long term.
I'm approaching the rolling upgrade logic here as a working blackbox building block. We can do any calculation we want outside it to decide we want a rolling upgrade, the annotation is just what we signal it from a level above. I'm seeing the decision of making a rolling upgrade and the rollout logic at two different levels. Though I can also see your angle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we could do that but then we'll be changing FindNewMachineSet to always pick the newest even in the duplicated template edge case.
The duplicate one is truthfully an edge case that should be audited, but that's probably something for a different issue/PR. The code is the same as the ReplicaSet code upstream, which probably has this issue because api-server and controller-manager could be restarting and duplicating ReplicaSet(s). I don't know if this can happen for MachineSets.
To further clarify though, we wouldn't be changing the code to pick the newest one, but the oldest one with these conditions:
ms.metadata.creationTimestamp
>md.spec.rolloutAfter
EqualMachineTemplate(md, ms)
==true
We have no means to differentiate between the duplicated template edge use case and the rolloutAfter use case, so for both we'll be picking the newest one.
This shouldn't apply in case of RolloutAfter. If a rollout time is in effect, even if we have duplicates, we've already rolled out a new MachineSet at that point.
If there are duplicates (depending on whatever RevisionHistory is set to), cleanupDeployment
function takes care of deleting outdated MachineSets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The edge case issue was because different kube versions resulted in different hashes for ReplicaSet names resulting in duplicated ReplicaSets.
To further clarify though, we wouldn't be changing the code to pick the newest one, but the oldest one with these conditions:
ms.metadata.creationTimestamp > md.spec.rolloutAfter
EqualMachineTemplate(md, ms) == true
Ok this makes sense let me implement it and keep discussing through code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vincepri @sbueringer updated PTAL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to test and support the following possible cases:
- RolloutAfter == nil
- Pick oldest
- RolloutAfter < Now
- Oldest
- Created > last RolloutAfter
- RolloutAfter > Now
- Oldest
- Created > last RolloutAfter
To keep track of the last rolloutAfter I'm using an annotation LastRolloutAfterAnnotation
. We could use a status field but I'd rather wait for the implementation to get mature and see how/if it combines with things like maxAge.
@vincepri @sbueringer PTAL.
c5d806c
to
c6edd12
Compare
c6edd12
to
761be61
Compare
fa8000f
to
425e0bb
Compare
425e0bb
to
6208c03
Compare
6208c03
to
8ed5286
Compare
Let's wait for #5297 to rebase and proceed with this |
8ed5286
to
de831b2
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@vincepri @sbueringer I rebased this. Given the timelines and the lack of feedback/testing in real use I think we should probably punt on it until we release v1beta1. |
5d0f71e
to
9a2964d
Compare
Name: "When rolloutAfter is not required but another one took place in the past it should return oldest created after the last rolloutAfter", | ||
deployment: deploymentRolloutAfterHappened, | ||
msList: []*clusterv1.MachineSet{&machineSetBeforeLastRolloutAfter, &machineSetAfterLastRolloutAfter, &oldestMachineSetAfterLastRolloutAfter}, | ||
expected: &oldestMachineSetAfterLastRolloutAfter, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused when would this unit test case would happen. If I understood the code correctly, this use case adds the requirement for the annotation. Wouldn't the machineSetBeforeLastRolloutAfter
be gone though after the rollout takes place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It won't be necessarily gone, it might be scaled down to zero right? That depends only on .spec.RevisionHistoryLimit right?
If the reconciliation time is after spec.rolloutAfter then a rollout should happen or has already happened. A new MachineSet is needed the first time the reconciliation time is after spec.rolloutAfter. Otherwise we pick the oldest with creation timestamp > lastRolloutAfter. If a new MachineSet is needed we include the rolloutAfter hash into the MachineSet name so when a new MachineSet is created the name does not clash with the one for the existing MachineSet with the same template and the rollout can be orchestrated as usual.
9a2964d
to
98600a6
Compare
// LastRolloutAfterAnnotation is the last MachineDeployment.Spec.RolloutAfter that was met. | ||
LastRolloutAfterAnnotation = "machinedeployment.clusters.x-k8s.io/lastRollout" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the annotation applied and stored only on MachineSets in this case?
// Deprecated: this does not for rolloutAfter. Use FindNewMachineSetWithRolloutAfter. | ||
func FindNewMachineSet(deployment *clusterv1.MachineDeployment, msList []*clusterv1.MachineSet) *clusterv1.MachineSet { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can probably change these functions signature now instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't that be a breaking change? don't we need to target this to v1beta2 now since this is an API change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not 100% sure, but I think the idea is that we can have code breaking changes between v1.0 and v1.1, i.e. only the API types and their behavior must not have breaking changes.
Or from another point of view: v1beta1 guards our API types and corresponding implementation, the release version the code (+ we're doing breaking changes there on minor versions as we're not incrementing major, similar to Kubernetes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sense, though I'm not sure a new field can be added with out creating a subsequent API version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the idea was to allow new fields if they are non-breaking, but let's see if somebody else knows :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Internal packages are not exposed APIs, so we can break even within patch releases given that only internal code might use them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure but I was assuming we started talking at some point about the addition of the RolloutAfter
field to the v1beta1 API type.
@enxebre: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Are we still trying to get this one in? |
@enxebre: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/close Closing due to age, @enxebre feel free to reopen after rebase if this is still a change we want to pursue |
@vincepri: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What this PR does / why we need it:
If the reconciliation time is after spec.rolloutAfter then a rollout should happen or has already happened.
A new MachineSet is needed the first time the reconciliation time is after spec.rolloutAfter.
Otherwise we pick the oldest with creation timestamp > lastRolloutAfter.
If a new MachineSet is needed we include the rolloutAfter hash into the MachineSet name so when a new MachineSet is created the name does not clash with the one for the existing MachineSet with the same template and the rollout can be orchestrated as usual.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #4536