Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: reset the progress condition when a pod is restarted #1649

Merged

Conversation

huikang
Copy link
Member

@huikang huikang commented Nov 11, 2021

  • if the progress condition is not reset, the timeout check
    returns true for a healthy rollout

Signed-off-by: Hui Kang [email protected]

fix: #1624

Checklist:

  • Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this is a chore.
  • The title of the PR is (a) conventional, (b) states what changed, and (c) suffixes the related issues number. E.g. "fix(controller): Updates such and such. Fixes #1234".
  • I've signed my commits with DCO
  • I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
  • My builds are green. Try syncing with master if they are not.
  • My organization is added to USERS.md.

@codecov
Copy link

codecov bot commented Nov 11, 2021

Codecov Report

Merging #1649 (0c2ba6a) into master (55c70a1) will increase coverage by 0.01%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1649      +/-   ##
==========================================
+ Coverage   81.97%   81.99%   +0.01%     
==========================================
  Files         115      115              
  Lines       15913    15925      +12     
==========================================
+ Hits        13045    13057      +12     
  Misses       2198     2198              
  Partials      670      670              
Impacted Files Coverage Δ
utils/conditions/conditions.go 80.76% <ø> (ø)
rollout/sync.go 78.29% <100.00%> (+0.36%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 55c70a1...0c2ba6a. Read the comment docs.

Copy link
Member

@jessesuen jessesuen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@huikang Do you know if this problem happen in the general sense? If so, I think this won't solve the problem in the generic case (e.g. when pods restart out-of-band from a rollout restart action) because this code is only in the pod restarter Reconcile

If we can reproduce this outside of the restart action, then we will need to fix this in a more generic way. I think one change that might be needed is: at the time when we transition from Healthy to Progressing, we also to also reset the progress counter so that the deadline is reset after we transition out of Healthy.

Copy link
Member

@agrawroh agrawroh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pods can transition to a non-running state for a variety of reasons. For example, due to k8s liveness probe starts failing.
I think we should think a little bit more about the behavior. Do we expect an already completed rollout to transition to a "Progressing" state if one of the stable replicas go from "Running" to "Pending" state? In my mind, I wouldn't care if anything happens to the stable replicas once the rollout is completed.
cc @huikang @jessesuen What do you think?

@huikang
Copy link
Member Author

huikang commented Nov 11, 2021

@jessesuen , you are right. This only fixes the restart pod case by resetting the progress counter, but not for a general case.

@agrawroh , if the pods of a completed rollout becomes unready (e.g., restart, failed liveness probe), we should mark the rollout from healthy to progressing because the rollout controller is reconciling the state.

So a generic solution as @jessesuen mentioned is to reset the progress counter so that conditions.RolloutTimedOut doesn't return true based on an old LastUpdateTime. Will update the PR.

@huikang huikang marked this pull request as draft November 11, 2021 18:48
@agrawroh
Copy link
Member

@jessesuen , you are right. This only fixes the restart pod case by resetting the progress counter, but not for a general case.

@agrawroh , if the pods of a completed rollout becomes unready (e.g., restart, failed liveness probe), we should mark the rollout from healthy to progressing because the rollout controller is reconciling the state.

So a generic solution as @jessesuen mentioned is to reset the progress counter so that conditions.RolloutTimedOut doesn't return true based on an old LastUpdateTime. Will update the PR.

@huikang Thanks for the explanation. This might be a naive question - Do we expect the rollout to create another AnalysisRun if it gets from "Completed" -> "Progressing" state? Also, if somehow the progressDeadlineAbort gets fired while the rollout is still in the "Progressing" state, will we mark it as degraded?

@jessesuen
Copy link
Member

Do we expect the rollout to create another AnalysisRun if it gets from "Completed" -> "Progressing" state

No. We only trigger analysis when we are in the middle of an update. Transitioning to progressing does not qualify in that regard. But if user "retries" the update, analysis will start again. We know we are in the middle of an update when the stable hash != desired hash.

Also, if somehow the progressDeadlineAbort gets fired while the rollout is still in the "Progressing" state, will we mark it as degraded

The intended way progressDeadlineAbort is meant to work is: if we are in the middle of an update, we will abort and go back to stable if we fail to make progress. This does not apply when we are in a fully rolled out state (stable is already desired). That said, if we exceed the progressDeadlineSeconds when we are fully promoted (stable == desired), the rollout should be in a degraded status (just like Deployment).

@huikang huikang force-pushed the 1624-fix-degraded-rollouts-podRestarted branch from 3033e19 to 5241e56 Compare November 12, 2021 06:31
@huikang
Copy link
Member Author

huikang commented Nov 12, 2021

t1:   Complete and healthy Rollout (stable RS == desired RS)
t2:   Something happened to the stable RS, e.g., a pod is evicted.
       The rollout becomes progressing.

    ...  After some time

t3:  The evicted pod becomes ready so the stable RS becomes available again

The logic is as follows:

if complete rollout becomes incomplete and not in the middle of an update {
    reset the progress LastUpdate time to t2
}

If (t3 - t2) > progressDeadlineSeconds (i.e., the stable RS of a fully promoted rollout becomes ready after `progressDeadlineSeconds`), {
     the rollout is degraded
} else {
    the rollout is healthy
}

@huikang huikang marked this pull request as ready for review November 12, 2021 06:36
@huikang huikang force-pushed the 1624-fix-degraded-rollouts-podRestarted branch from 5241e56 to f4cb704 Compare November 12, 2021 14:46
Copy link
Member

@agrawroh agrawroh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall. Lets some nits.

rollout/sync.go Outdated
// if any rs status changes (e.g., pod restarted, evicted -> recreated ) to a previous completed rollout,
// we need to reset the progressCondition to avoid timeout
if changed && c.stableRS != nil && c.newRS != nil && (replicasetutil.GetPodTemplateHash(c.stableRS) == replicasetutil.GetPodTemplateHash(c.newRS)) {
existProgressingCondition := conditions.GetRolloutCondition(newStatus, v1alpha1.RolloutProgressing)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: existProgressingCondition --> isRolloutConditionInProgressing

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isRolloutConditionInProgressing sounds a bool variable. existProgressingCondition is consistent with other places where GetRolloutCondition is called. What do you think?

rollout/sync.go Outdated Show resolved Hide resolved
rollout/sync.go Outdated Show resolved Hide resolved
@@ -122,6 +124,18 @@ func (p *RolloutPodRestarter) Reconcile(roCtx *rolloutContext) error {
}
canRestart -= 1
restarted += 1

// TODO: remove
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you still want to keep this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed. Thanks.

@huikang huikang force-pushed the 1624-fix-degraded-rollouts-podRestarted branch 5 times, most recently from 8c8ef2c to 1bc4f43 Compare November 12, 2021 20:09
rollout/sync.go Outdated

// If the ReplicaSet status changes (e.g., one of the pod restarts, evicted -> recreated) for a previously
// completed rollout, we'll need to reset the rollout's condition to `PROGRESSING` to avoid any timeouts.
if changed && c.stableRS != nil && c.newRS != nil && (replicasetutil.GetPodTemplateHash(c.stableRS) == replicasetutil.GetPodTemplateHash(c.newRS)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we only do this when we are fully promoted. But why isnt' this check just:

if changed {

In all cases, if we move from Completed to not Completed, don't we want to reset the progressing condition timestamp?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. When a rollout becomes incomplete from complete, we always want to reset the progressing condition.

I made the change and tested successfully with plugin rollout restart and out-of-band pod recreation.

Copy link
Member Author

@huikang huikang Nov 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jessesuen ,
I remember the reason of using if changed && c.stableRS != nil && c.newRS != nil && (replicasetutil.GetPodTemplateHash(c.stableRS) == replicasetutil.GetPodTemplateHash(c.newRS)) is to only reset the progress condition when (rollout becomes incomplete from complete) AND (not in the middle of update).

During an update, we don't need to reset the progressing condition here, because it has already been reset.
So using the only condition if changed causes duplicated resetting. WDYK?

Copy link
Member Author

@huikang huikang Nov 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted to the condition of (rollout becomes incomplete from complete) AND (not in the middle of update) due to the failed TestFunctionalSuite/TestWorkloadRef in e2e test using only if changed {

ERRO[2021-11-13T00:20:14-05:00] Recovered from panic: runtime error: invalid memory address or nil pointer dereference
goroutine 185 [running]:
runtime/debug.Stack(0xc0006be678, 0x2f60a40, 0x472d5e0)
	/usr/local/Cellar/go/1.16.3/libexec/src/runtime/debug/stack.go:24 +0x9f
github.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem.func1.1.1(0xc000ba6070, 0xc0006bfaf0)
	/Users/hui.kang/go/src/github.com/huikang/argo-rollouts-backup2/utils/controller/controller.go:149 +0x5b
panic(0x2f60a40, 0x472d5e0)
	/usr/local/Cellar/go/1.16.3/libexec/src/runtime/panic.go:965 +0x1b9
github.com/argoproj/argo-rollouts/rollout.(*rolloutContext).calculateRolloutConditions(0xc000bd3c00, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc0008e58e4, 0xa, 0x0, ...)
	/Users/hui.kang/go/src/github.com/huikang/argo-rollouts-backup2/rollout/sync.go:597 +0x13c0
github.com/argoproj/argo-rollouts/rollout.(*rolloutContext).syncRolloutStatusBlueGreen(0xc000bd3c00, 0x0, 0xc000b30280, 0x0, 0x0)
	/Users/hui.kang/go/src/github.com/huikang/argo-rollouts-backup2/rollout/bluegreen.go:312 +0x538
github.com/argoproj/argo-rollouts/rollout.(*rolloutContext).rolloutBlueGreen(0xc000bd3c00, 0x32fa72f, 0x17)
	/Users/hui.kang/go/src/github.com/huikang/argo-rollouts-backup2/rollout/bluegreen.go:35 +0x2f5
github.com/argoproj/argo-rollouts/rollout.(*rolloutContext).reconcile(0xc000bd3c00, 0xc000026c00, 0xc000bd3c00)
	/Users/hui.kang/go/src/github.com/huikang/argo-rollouts-backup2/rollout/context.go:82 +0x1e5
github.com/argoproj/argo-rollouts/rollout.(*Controller).syncHandler(0xc000c12000, 0xc00094a040, 0x1e, 0x0, 0x0)
	/Users/hui.kang/go/src/github.com/huikang/argo-rollouts-backup2/rollout/controller.go:396 +0x630
github.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem.func1.1(0x0, 0x0)
	/Users/hui.kang/go/src/github.com/huikang/argo-rollouts-backup2/utils/controller/controller.go:153 +0x7c
github.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem.func1(0x36dec60, 0xc000b6ed80, 0x32df9ad, 0x7, 0xc00149de60, 0xc000490e00, 0x2e4c200, 0xc001192850, 0x0, 0x0)
	/Users/hui.kang/go/src/github.com/huikang/argo-rollouts-backup2/utils/controller/controller.go:157 +0x326
github.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem(0x36dec60, 0xc000b6ed80, 0x32df9ad, 0x7, 0xc00149de60, 0xc000490e00, 0x1)
	/Users/hui.kang/go/src/github.com/huikang/argo-rollouts-backup2/utils/controller/controller.go:171 +0x9a
github.com/argoproj/argo-rollouts/utils/controller.RunWorker(...)
	/Users/hui.kang/go/src/github.com/huikang/argo-rollouts-backup2/utils/controller/controller.go:104
github.com/argoproj/argo-rollouts/rollout.(*Controller).Run.func1()
	/Users/hui.kang/go/src/github.com/huikang/argo-rollouts-backup2/rollout/controller.go:326 +0xa5
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc000e4a050)
	/Users/hui.kang/go/src/github.com/huikang/argo-rollouts-backup2/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x5f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc000e4a050, 0x366f0e0, 0xc0005d6c00, 0x1, 0xc0005a0300)
	/Users/hui.kang/go/src/github.com/huikang/argo-rollouts-backup2/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0x9b
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000e4a050, 0x3b9aca00, 0x0, 0x1, 0xc0005a0300)
	/Users/hui.kang/go/src/github.com/huikang/argo-rollouts-backup2/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x98
k8s.io/apimachinery/pkg/util/wait.Until(0xc000e4a050, 0x3b9aca00, 0xc0005a0300)
	/Users/hui.kang/go/src/github.com/huikang/argo-rollouts-backup2/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90 +0x4d
created by github.com/argoproj/argo-rollouts/rollout.(*Controller).Run
	/Users/hui.kang/go/src/github.com/huikang/argo-rollouts-backup2/rollout/controller.go:325 +0xac  namespace=default rollou

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm. I was under the impression that we would not enter into a completed condition when the rollout was in the middle of an update, which is why I suggested the change. Are you finding that is not the case

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm. I was under the impression that we would not enter into a completed condition when the rollout was in the middle of an update, which is why I suggested the change. Are you finding that is not the case

There might be some bug for workloadRef, which causes entering a completed condition in the middle of an update. Let me dig deep to the case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Let me know what you find. I think we might be masking some underlying issue by having the second check. If it turns out it is really necessary, I think a better option is to use the convenience method:

conditions.RolloutComplete(c.rollout, &newStatus)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jessesuen , it turns out the panic was due to the nil pointer of c.StableRS of workloadRef. if changed works fine and I also added a new progressing reason and its associated msg.

@huikang huikang force-pushed the 1624-fix-degraded-rollouts-podRestarted branch 3 times, most recently from 43926aa to 0992906 Compare November 16, 2021 03:23
rollout/sync.go Outdated
Comment on lines 598 to 599
newProgressingCondition := conditions.NewRolloutCondition(v1alpha1.RolloutProgressing, corev1.ConditionTrue, conditions.RolloutBecomesIncomplete, conditions.RolloutBecomesIncompleteMessage)
conditions.SetRolloutCondition(&newStatus, *newProgressingCondition)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@huikang i think this message and reason will get clobbered by the switch statement that comes later.

Instead of updating the Progressing condition here (where we detect that we became incomplete), can we update the Progressing condition at the current place in the code where we are managing the Progressing condition (in the switch statement)? This will require some new boolean (e.g. becameIncomplete) set in the completed check:

	var becameIncomplete bool // remember if we transitioned from completed:
	if !isPaused && conditions.RolloutComplete(c.rollout, &newStatus) {
...
	} else {
		if completeCond != nil {
			updateCompletedCond := conditions.NewRolloutCondition(v1alpha1.RolloutCompleted, corev1.ConditionFalse, conditions.RolloutCompletedReason, conditions.RolloutCompletedReason)
			becameIncomplete = conditions.SetRolloutCondition(&newStatus, *updateCompletedCond)
changed := conditions.SetRolloutCondition(&newStatus, *updateCompletedCond)
...

	if !isCompleteRollout && !isAborted {
		switch {
...
		case conditions.RolloutProgressing(c.rollout, &newStatus):
...
			// Give a more accurate reason for the Progressing condition
			if newStatus.StableRS == newStatus.CurrentPodHash {
				reason = conditions.ReplicaSetNotAvailableReason
				msg = conditions.NotAvailableMessage // re-use existing message
			} else {
				reason = conditions.ReplicaSetUpdatedReason
			}
			condition := conditions.NewRolloutCondition(v1alpha1.RolloutProgressing, corev1.ConditionTrue, reason, msg)

			if currentCond != nil  || becameIncomplete {
				if currentCond.Status == corev1.ConditionTrue {
					condition.LastTransitionTime = currentCond.LastTransitionTime
				}
				conditions.RemoveRolloutCondition(&newStatus, v1alpha1.RolloutProgressing)
			}
			conditions.SetRolloutCondition(&newStatus, *condition)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. Separating the processing of progressing and complete condition sounds better. Let me update the PR to test the logic.

Comment on lines 89 to 94
// RolloutBecomesIncomplete is added when a fully promoted rollout becomes incomplete, e.g.,
// due to pod restarts, evicted -> recreated. In this case, we'll need to reset the rollout's
// condition to `PROGRESSING` to avoid any timeouts.
RolloutBecomesIncomplete = "RolloutBecomesIncomplete"
// RolloutBecomesIncompleteMessage is added when a fully promoted rollout becomes incomplete, e.g.,
// due to pod restarts, evicted -> recreated. In this case, we'll need to reset the rollout's
// condition to `PROGRESSING` to avoid any timeouts.
RolloutBecomesIncompleteMessage = "Fully promoted rollout becomes incomplete"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets create a new reason:

ReplicaSetNotAvailableReason = "ReplicaSet is not available"

and instead of RolloutBecomesIncompleteMessage, we can use existing NotAvailableMessage message

rollout/sync.go Outdated
Comment on lines 630 to 635
reason := conditions.ReplicaSetUpdatedReason

// When a fully promoted rollout becomes Incomplete, e.g., due to the ReplicaSet status changes like
// pod restarts, evicted -> recreated, we'll need to reset the rollout's condition to `PROGRESSING` to
// avoid any timeouts.
if becameIncomplete {
Copy link
Member

@jessesuen jessesuen Nov 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@huikang i have a feeling that this check will cause the Progressing reason to be ReplicaSetUpdatedReason after two reconciliations, because the next reconciliation becameIncomplete will be false. Which is why in my suggestion i had the check:

if newStatus.StableRS == newStatus.CurrentPodHash {

Do you agree?

Hui Kang added 2 commits November 18, 2021 18:34
- if the progress condition is not reset, the timeout check
  returns true for a healthy rollout

Signed-off-by: Hui Kang <[email protected]>
huikang and others added 5 commits November 18, 2021 18:34
- add log message for state change

Co-authored-by: Rohit Agrawal <[email protected]>
Signed-off-by: Hui Kang <[email protected]>
Signed-off-by: Hui Kang <[email protected]>
@huikang huikang force-pushed the 1624-fix-degraded-rollouts-podRestarted branch from dbca4da to 91e1a03 Compare November 18, 2021 23:34
@jessesuen
Copy link
Member

jessesuen commented Nov 18, 2021

Looks like my suggestion caused unit test to fail because change in message:

-{"status":{"blueGreen":{"previewSelector":"68f45df865"},"conditions":[{"lastTransitionTime":"2021-11-18T23:43:38Z","lastUpdateTime":"2021-11-18T23:43:38Z","message":"ReplicaSet \"foo-68f45df865\" is progressing.","reason":"ReplicaSetUpdated","status":"True","type":"Progressing"},{"lastTransitionTime":"2021-11-18T23:43:38Z","lastUpdateTime":"2021-11-18T23:43:38Z","message":"Rollout does not have minimum availability","reason":"AvailableReason","status":"False","type":"Available"}],"message":"more replicas need to be updated","observedGeneration":"123","phase":"Progressing","selector":"foo=bar","stableRS":"68f45df865"}}
+{"status":{"blueGreen":{"previewSelector":"68f45df865"},"conditions":[{"lastTransitionTime":"2021-11-18T23:43:38Z","lastUpdateTime":"2021-11-18T23:43:38Z","message":"Rollout does not have minimum availability","reason":"ReplicaSetNotAvailable","status":"True","type":"Progressing"},{"lastTransitionTime":"2021-11-18T23:43:38Z","lastUpdateTime":"2021-11-18T23:43:38Z","message":"Rollout does not have minimum availability","reason":"AvailableReason","status":"False","type":"Available"}],"message":"more replicas need to be updated","observedGeneration":"123","phase":"Progressing","selector":"foo=bar","stableRS":"68f45df865"}}

@huikang huikang force-pushed the 1624-fix-degraded-rollouts-podRestarted branch 2 times, most recently from 2fb4be4 to e4e713b Compare November 19, 2021 05:29
@huikang huikang force-pushed the 1624-fix-degraded-rollouts-podRestarted branch from e4e713b to 0c2ba6a Compare November 19, 2021 15:00
@sonarcloud
Copy link

sonarcloud bot commented Nov 19, 2021

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
17.8% 17.8% Duplication

@huikang
Copy link
Member Author

huikang commented Nov 19, 2021

Hi, @jessesuen , added a new unit test. Please take a look. Thanks.

Copy link
Contributor

@alexmt alexmt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@alexmt alexmt merged commit 615bce3 into argoproj:master Nov 19, 2021
alexmt pushed a commit that referenced this pull request Nov 29, 2021
fix: reset the progress condition when a pod is restarted (#1649)

Signed-off-by: Hui Kang <[email protected]>

Co-authored-by: Hui Kang <[email protected]>
Co-authored-by: Rohit Agrawal <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

progressDeadlineAbort causes degraded Rollout outside of a deployment
4 participants