-
Notifications
You must be signed in to change notification settings - Fork 880
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: reset the progress condition when a pod is restarted #1649
fix: reset the progress condition when a pod is restarted #1649
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1649 +/- ##
==========================================
+ Coverage 81.97% 81.99% +0.01%
==========================================
Files 115 115
Lines 15913 15925 +12
==========================================
+ Hits 13045 13057 +12
Misses 2198 2198
Partials 670 670
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@huikang Do you know if this problem happen in the general sense? If so, I think this won't solve the problem in the generic case (e.g. when pods restart out-of-band from a rollout restart action) because this code is only in the pod restarter Reconcile
If we can reproduce this outside of the restart action, then we will need to fix this in a more generic way. I think one change that might be needed is: at the time when we transition from Healthy to Progressing, we also to also reset the progress counter so that the deadline is reset after we transition out of Healthy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pods can transition to a non-running state for a variety of reasons. For example, due to k8s liveness probe starts failing.
I think we should think a little bit more about the behavior. Do we expect an already completed rollout to transition to a "Progressing" state if one of the stable replicas go from "Running" to "Pending" state? In my mind, I wouldn't care if anything happens to the stable replicas once the rollout is completed.
cc @huikang @jessesuen What do you think?
@jessesuen , you are right. This only fixes the @agrawroh , if the pods of a completed rollout becomes unready (e.g., restart, failed liveness probe), we should mark the rollout from healthy to So a generic solution as @jessesuen mentioned is to reset the progress counter so that conditions.RolloutTimedOut doesn't return true based on an old |
@huikang Thanks for the explanation. This might be a naive question - Do we expect the rollout to create another AnalysisRun if it gets from "Completed" -> "Progressing" state? Also, if somehow the |
No. We only trigger analysis when we are in the middle of an update. Transitioning to progressing does not qualify in that regard. But if user "retries" the update, analysis will start again. We know we are in the middle of an update when the stable hash != desired hash.
The intended way progressDeadlineAbort is meant to work is: if we are in the middle of an update, we will abort and go back to stable if we fail to make progress. This does not apply when we are in a fully rolled out state (stable is already desired). That said, if we exceed the progressDeadlineSeconds when we are fully promoted (stable == desired), the rollout should be in a degraded status (just like Deployment). |
3033e19
to
5241e56
Compare
The logic is as follows: if complete rollout becomes incomplete and not in the middle of an update {
reset the progress LastUpdate time to t2
}
If (t3 - t2) > progressDeadlineSeconds (i.e., the stable RS of a fully promoted rollout becomes ready after `progressDeadlineSeconds`), {
the rollout is degraded
} else {
the rollout is healthy
} |
5241e56
to
f4cb704
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall. Lets some nits.
rollout/sync.go
Outdated
// if any rs status changes (e.g., pod restarted, evicted -> recreated ) to a previous completed rollout, | ||
// we need to reset the progressCondition to avoid timeout | ||
if changed && c.stableRS != nil && c.newRS != nil && (replicasetutil.GetPodTemplateHash(c.stableRS) == replicasetutil.GetPodTemplateHash(c.newRS)) { | ||
existProgressingCondition := conditions.GetRolloutCondition(newStatus, v1alpha1.RolloutProgressing) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: existProgressingCondition
--> isRolloutConditionInProgressing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isRolloutConditionInProgressing sounds a bool variable. existProgressingCondition is consistent with other places where GetRolloutCondition
is called. What do you think?
rollout/restart.go
Outdated
@@ -122,6 +124,18 @@ func (p *RolloutPodRestarter) Reconcile(roCtx *rolloutContext) error { | |||
} | |||
canRestart -= 1 | |||
restarted += 1 | |||
|
|||
// TODO: remove |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you still want to keep this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed. Thanks.
8c8ef2c
to
1bc4f43
Compare
rollout/sync.go
Outdated
|
||
// If the ReplicaSet status changes (e.g., one of the pod restarts, evicted -> recreated) for a previously | ||
// completed rollout, we'll need to reset the rollout's condition to `PROGRESSING` to avoid any timeouts. | ||
if changed && c.stableRS != nil && c.newRS != nil && (replicasetutil.GetPodTemplateHash(c.stableRS) == replicasetutil.GetPodTemplateHash(c.newRS)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems we only do this when we are fully promoted. But why isnt' this check just:
if changed {
In all cases, if we move from Completed to not Completed, don't we want to reset the progressing condition timestamp?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. When a rollout becomes incomplete from complete, we always want to reset the progressing condition.
I made the change and tested successfully with plugin rollout restart
and out-of-band pod recreation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jessesuen ,
I remember the reason of using if changed && c.stableRS != nil && c.newRS != nil && (replicasetutil.GetPodTemplateHash(c.stableRS) == replicasetutil.GetPodTemplateHash(c.newRS))
is to only reset the progress condition when (rollout becomes incomplete from complete) AND (not in the middle of update).
During an update, we don't need to reset the progressing condition here, because it has already been reset.
So using the only condition if changed
causes duplicated resetting. WDYK?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverted to the condition of (rollout becomes incomplete from complete) AND (not in the middle of update) due to the failed TestFunctionalSuite/TestWorkloadRef
in e2e test using only if changed {
ERRO[2021-11-13T00:20:14-05:00] Recovered from panic: runtime error: invalid memory address or nil pointer dereference
goroutine 185 [running]:
runtime/debug.Stack(0xc0006be678, 0x2f60a40, 0x472d5e0)
/usr/local/Cellar/go/1.16.3/libexec/src/runtime/debug/stack.go:24 +0x9f
github.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem.func1.1.1(0xc000ba6070, 0xc0006bfaf0)
/Users/hui.kang/go/src/github.com/huikang/argo-rollouts-backup2/utils/controller/controller.go:149 +0x5b
panic(0x2f60a40, 0x472d5e0)
/usr/local/Cellar/go/1.16.3/libexec/src/runtime/panic.go:965 +0x1b9
github.com/argoproj/argo-rollouts/rollout.(*rolloutContext).calculateRolloutConditions(0xc000bd3c00, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc0008e58e4, 0xa, 0x0, ...)
/Users/hui.kang/go/src/github.com/huikang/argo-rollouts-backup2/rollout/sync.go:597 +0x13c0
github.com/argoproj/argo-rollouts/rollout.(*rolloutContext).syncRolloutStatusBlueGreen(0xc000bd3c00, 0x0, 0xc000b30280, 0x0, 0x0)
/Users/hui.kang/go/src/github.com/huikang/argo-rollouts-backup2/rollout/bluegreen.go:312 +0x538
github.com/argoproj/argo-rollouts/rollout.(*rolloutContext).rolloutBlueGreen(0xc000bd3c00, 0x32fa72f, 0x17)
/Users/hui.kang/go/src/github.com/huikang/argo-rollouts-backup2/rollout/bluegreen.go:35 +0x2f5
github.com/argoproj/argo-rollouts/rollout.(*rolloutContext).reconcile(0xc000bd3c00, 0xc000026c00, 0xc000bd3c00)
/Users/hui.kang/go/src/github.com/huikang/argo-rollouts-backup2/rollout/context.go:82 +0x1e5
github.com/argoproj/argo-rollouts/rollout.(*Controller).syncHandler(0xc000c12000, 0xc00094a040, 0x1e, 0x0, 0x0)
/Users/hui.kang/go/src/github.com/huikang/argo-rollouts-backup2/rollout/controller.go:396 +0x630
github.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem.func1.1(0x0, 0x0)
/Users/hui.kang/go/src/github.com/huikang/argo-rollouts-backup2/utils/controller/controller.go:153 +0x7c
github.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem.func1(0x36dec60, 0xc000b6ed80, 0x32df9ad, 0x7, 0xc00149de60, 0xc000490e00, 0x2e4c200, 0xc001192850, 0x0, 0x0)
/Users/hui.kang/go/src/github.com/huikang/argo-rollouts-backup2/utils/controller/controller.go:157 +0x326
github.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem(0x36dec60, 0xc000b6ed80, 0x32df9ad, 0x7, 0xc00149de60, 0xc000490e00, 0x1)
/Users/hui.kang/go/src/github.com/huikang/argo-rollouts-backup2/utils/controller/controller.go:171 +0x9a
github.com/argoproj/argo-rollouts/utils/controller.RunWorker(...)
/Users/hui.kang/go/src/github.com/huikang/argo-rollouts-backup2/utils/controller/controller.go:104
github.com/argoproj/argo-rollouts/rollout.(*Controller).Run.func1()
/Users/hui.kang/go/src/github.com/huikang/argo-rollouts-backup2/rollout/controller.go:326 +0xa5
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc000e4a050)
/Users/hui.kang/go/src/github.com/huikang/argo-rollouts-backup2/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x5f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc000e4a050, 0x366f0e0, 0xc0005d6c00, 0x1, 0xc0005a0300)
/Users/hui.kang/go/src/github.com/huikang/argo-rollouts-backup2/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0x9b
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000e4a050, 0x3b9aca00, 0x0, 0x1, 0xc0005a0300)
/Users/hui.kang/go/src/github.com/huikang/argo-rollouts-backup2/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x98
k8s.io/apimachinery/pkg/util/wait.Until(0xc000e4a050, 0x3b9aca00, 0xc0005a0300)
/Users/hui.kang/go/src/github.com/huikang/argo-rollouts-backup2/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90 +0x4d
created by github.com/argoproj/argo-rollouts/rollout.(*Controller).Run
/Users/hui.kang/go/src/github.com/huikang/argo-rollouts-backup2/rollout/controller.go:325 +0xac namespace=default rollou
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm. I was under the impression that we would not enter into a completed condition when the rollout was in the middle of an update, which is why I suggested the change. Are you finding that is not the case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm. I was under the impression that we would not enter into a completed condition when the rollout was in the middle of an update, which is why I suggested the change. Are you finding that is not the case
There might be some bug for workloadRef
, which causes entering a completed condition in the middle of an update. Let me dig deep to the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Let me know what you find. I think we might be masking some underlying issue by having the second check. If it turns out it is really necessary, I think a better option is to use the convenience method:
conditions.RolloutComplete(c.rollout, &newStatus)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jessesuen , it turns out the panic was due to the nil pointer of c.StableRS of workloadRef. if changed
works fine and I also added a new progressing reason and its associated msg.
43926aa
to
0992906
Compare
rollout/sync.go
Outdated
newProgressingCondition := conditions.NewRolloutCondition(v1alpha1.RolloutProgressing, corev1.ConditionTrue, conditions.RolloutBecomesIncomplete, conditions.RolloutBecomesIncompleteMessage) | ||
conditions.SetRolloutCondition(&newStatus, *newProgressingCondition) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@huikang i think this message and reason will get clobbered by the switch statement that comes later.
Instead of updating the Progressing condition here (where we detect that we became incomplete), can we update the Progressing condition at the current place in the code where we are managing the Progressing condition (in the switch statement)? This will require some new boolean (e.g. becameIncomplete) set in the completed check:
var becameIncomplete bool // remember if we transitioned from completed:
if !isPaused && conditions.RolloutComplete(c.rollout, &newStatus) {
...
} else {
if completeCond != nil {
updateCompletedCond := conditions.NewRolloutCondition(v1alpha1.RolloutCompleted, corev1.ConditionFalse, conditions.RolloutCompletedReason, conditions.RolloutCompletedReason)
becameIncomplete = conditions.SetRolloutCondition(&newStatus, *updateCompletedCond)
changed := conditions.SetRolloutCondition(&newStatus, *updateCompletedCond)
...
if !isCompleteRollout && !isAborted {
switch {
...
case conditions.RolloutProgressing(c.rollout, &newStatus):
...
// Give a more accurate reason for the Progressing condition
if newStatus.StableRS == newStatus.CurrentPodHash {
reason = conditions.ReplicaSetNotAvailableReason
msg = conditions.NotAvailableMessage // re-use existing message
} else {
reason = conditions.ReplicaSetUpdatedReason
}
condition := conditions.NewRolloutCondition(v1alpha1.RolloutProgressing, corev1.ConditionTrue, reason, msg)
if currentCond != nil || becameIncomplete {
if currentCond.Status == corev1.ConditionTrue {
condition.LastTransitionTime = currentCond.LastTransitionTime
}
conditions.RemoveRolloutCondition(&newStatus, v1alpha1.RolloutProgressing)
}
conditions.SetRolloutCondition(&newStatus, *condition)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. Separating the processing of progressing and complete condition sounds better. Let me update the PR to test the logic.
utils/conditions/conditions.go
Outdated
// RolloutBecomesIncomplete is added when a fully promoted rollout becomes incomplete, e.g., | ||
// due to pod restarts, evicted -> recreated. In this case, we'll need to reset the rollout's | ||
// condition to `PROGRESSING` to avoid any timeouts. | ||
RolloutBecomesIncomplete = "RolloutBecomesIncomplete" | ||
// RolloutBecomesIncompleteMessage is added when a fully promoted rollout becomes incomplete, e.g., | ||
// due to pod restarts, evicted -> recreated. In this case, we'll need to reset the rollout's | ||
// condition to `PROGRESSING` to avoid any timeouts. | ||
RolloutBecomesIncompleteMessage = "Fully promoted rollout becomes incomplete" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets create a new reason:
ReplicaSetNotAvailableReason = "ReplicaSet is not available"
and instead of RolloutBecomesIncompleteMessage
, we can use existing NotAvailableMessage
message
rollout/sync.go
Outdated
reason := conditions.ReplicaSetUpdatedReason | ||
|
||
// When a fully promoted rollout becomes Incomplete, e.g., due to the ReplicaSet status changes like | ||
// pod restarts, evicted -> recreated, we'll need to reset the rollout's condition to `PROGRESSING` to | ||
// avoid any timeouts. | ||
if becameIncomplete { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@huikang i have a feeling that this check will cause the Progressing reason to be ReplicaSetUpdatedReason
after two reconciliations, because the next reconciliation becameIncomplete will be false. Which is why in my suggestion i had the check:
if newStatus.StableRS == newStatus.CurrentPodHash {
Do you agree?
- if the progress condition is not reset, the timeout check returns true for a healthy rollout Signed-off-by: Hui Kang <[email protected]>
… in the middle of update Signed-off-by: Hui Kang <[email protected]>
- add log message for state change Co-authored-by: Rohit Agrawal <[email protected]> Signed-off-by: Hui Kang <[email protected]>
Signed-off-by: Hui Kang <[email protected]>
…essing Condition Signed-off-by: Hui Kang <[email protected]>
Signed-off-by: Hui Kang <[email protected]>
Signed-off-by: Hui Kang <[email protected]>
dbca4da
to
91e1a03
Compare
Looks like my suggestion caused unit test to fail because change in message:
|
2fb4be4
to
e4e713b
Compare
Signed-off-by: Hui Kang <[email protected]>
e4e713b
to
0c2ba6a
Compare
Kudos, SonarCloud Quality Gate passed! 0 Bugs No Coverage information |
Hi, @jessesuen , added a new unit test. Please take a look. Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
fix: reset the progress condition when a pod is restarted (#1649) Signed-off-by: Hui Kang <[email protected]> Co-authored-by: Hui Kang <[email protected]> Co-authored-by: Rohit Agrawal <[email protected]>
returns true for a healthy rollout
Signed-off-by: Hui Kang [email protected]
fix: #1624
Checklist:
"fix(controller): Updates such and such. Fixes #1234"
.