Panic while trying to reconcile Rollout, most stable pods restarted #1696

bpoland · 2021-12-09T16:00:00Z

Summary

I'm not 100% sure how it got into this state but I had a Rollout which was panicing the controller. I tried to follow the stack trace which seems to be pointing at this line saying that the latest replicaset didn't have a pod hash label, but I checked and it did have a correct rollouts-pod-template-hash label. So I'm not sure what happened there.

I was able to get out of this situation but it caused most of the stable pods to restart and I was left with very few running pods, which would have caused an outage if this wasn't a test environment.

Details

Here's what the rollout looked like while the controller was panicing:

Name:            name
Namespace:       default
Status:          ◌ Progressing
Message:         more replicas need to be updated
Strategy:        Canary
  Step:          0/16
  SetWeight:     1
  ActualWeight:  1
Images:          image:one (stable)
                 image:two
Replicas:
  Desired:       4
  Current:       5
  Updated:       0
  Ready:         5
  Available:     5

NAME                              KIND         STATUS         AGE    INFO
⟳ name                            Rollout      ◌ Progressing  27d
├──# revision:265
│  ├──⧉ name-b4889b8cc            ReplicaSet   • ScaledDown   175m   delay:passed
│  └──α name-b4889b8cc-265        AnalysisRun  ✖ Failed       175m   ✔ 40,✖ 3
├──# revision:264
│  ├──⧉ name-548cd6f455           ReplicaSet   • ScaledDown   3h36m
│  └──α name-548cd6f455-264       AnalysisRun  ✖ Failed       3h36m  ✔ 73,✖ 3
├──# revision:263
│  ├──⧉ name-5d768bfc49           ReplicaSet   ✔ Healthy      5h41m
│  │  └──□ name-5d768bfc49-x6w5t  Pod          ✔ Running      4h18m  ready:2/2
│  ├──α name-5d768bfc49-263       AnalysisRun  ✖ Failed       5h41m  ✔ 6,✖ 3
│  ├──α name-5d768bfc49-263.1     AnalysisRun  ✖ Failed       4h36m  ✔ 17,✖ 3
│  └──α name-5d768bfc49-263.2     AnalysisRun  ✔ Successful   4h18m  ✔ 81,✖ 1
└──# revision:261
   ├──⧉ name-575976c6dd           ReplicaSet   ✔ Healthy      7h13m  stable
   │  ├──□ name-575976c6dd-p5q7f  Pod          ✔ Running      7h13m  ready:2/2
   │  ├──□ name-575976c6dd-9dclk  Pod          ✔ Running      7h7m   ready:2/2
   │  ├──□ name-575976c6dd-g8s2f  Pod          ✔ Running      7h1m   ready:2/2
   │  └──□ name-575976c6dd-d9phs  Pod          ✔ Running      6h59m  ready:2/2
   └──α name-575976c6dd-261       AnalysisRun  ✔ Successful   7h13m  ✔ 29

I was able to get the controller to stop panicing by updating the source deployment back to image:one (I'm using workload referencing). Right after I did that, the rollout was degraded but it picked up the fact that I had stable version pods running:

Name:            name
Namespace:       default
Status:          ✖ Degraded
Message:         ProgressDeadlineExceeded: Rollout "name" has timed out progressing.
Strategy:        Canary
  Step:          0/16
  SetWeight:     1
  ActualWeight:  1
Images:          image:one (stable)
                 image:two
Replicas:
  Desired:       4
  Current:       5
  Updated:       0
  Ready:         5
  Available:     5

NAME                              KIND         STATUS        AGE    INFO
⟳ name                            Rollout      ✖ Degraded    27d
├──# revision:266
│  └──⧉ name-575976c6dd           ReplicaSet   ✔ Healthy     7h35m  stable
│     ├──□ name-575976c6dd-p5q7f  Pod          ✔ Running     7h35m  ready:2/2
│     ├──□ name-575976c6dd-9dclk  Pod          ✔ Running     7h30m  ready:2/2
│     ├──□ name-575976c6dd-g8s2f  Pod          ✔ Running     7h24m  ready:2/2
│     └──□ name-575976c6dd-d9phs  Pod          ✔ Running     7h22m  ready:2/2
├──# revision:265
│  ├──⧉ name-b4889b8cc            ReplicaSet   • ScaledDown  3h18m  delay:passed
│  └──α name-b4889b8cc-265        AnalysisRun  ✖ Failed      3h18m  ✔ 40,✖ 3
├──# revision:264
│  ├──⧉ name-548cd6f455           ReplicaSet   • ScaledDown  3h58m
│  └──α name-548cd6f455-264       AnalysisRun  ✖ Failed      3h58m  ✔ 73,✖ 3
├──# revision:263
│  ├──⧉ name-5d768bfc49           ReplicaSet   ✔ Healthy     6h4m
│  │  └──□ name-5d768bfc49-x6w5t  Pod          ✔ Running     4h41m  ready:2/2
│  ├──α name-5d768bfc49-263       AnalysisRun  ✖ Failed      6h4m   ✔ 6,✖ 3
│  ├──α name-5d768bfc49-263.1     AnalysisRun  ✖ Failed      4h59m  ✔ 17,✖ 3
│  └──α name-5d768bfc49-263.2     AnalysisRun  ✔ Successful  4h41m  ✔ 81,✖ 1
└──# revision:261
   └──α name-575976c6dd-261       AnalysisRun  ✔ Successful  7h35m  ✔ 29

Soon afterwards, the status and message changed to:

Status:          ◌ Progressing
Message:         waiting for rollout spec update to be observed

But then immediately after that, all but one of the stable version pods were restarted (seems somewhat similar to problem 2 from #1681 ?). Note that an AnalysisRun was created but immediately marked as successful by the controller, not sure if that's related?

Name:            name
Namespace:       default
Status:          ◌ Progressing
Message:         updated replicas are still becoming available
Strategy:        Canary
  Step:          16/16
  SetWeight:     100
  ActualWeight:  100
Images:          image:one (stable)
                 image:two
Replicas:
  Desired:       4
  Current:       5
  Updated:       4
  Ready:         2
  Available:     2

NAME                              KIND         STATUS         AGE    INFO
⟳ name                            Rollout      ◌ Progressing  27d
├──# revision:266
│  ├──⧉ name-575976c6dd           ReplicaSet   ◌ Progressing  7h36m  stable
│  │  ├──□ name-575976c6dd-p5q7f  Pod          ✔ Running      7h36m  ready:2/2
│  │  ├──□ name-575976c6dd-9dclk  Pod          ◌ Terminating  7h30m  ready:2/2
│  │  ├──□ name-575976c6dd-g8s2f  Pod          ◌ Terminating  7h24m  ready:2/2
│  │  ├──□ name-575976c6dd-d9phs  Pod          ◌ Terminating  7h22m  ready:2/2
│  │  ├──□ name-575976c6dd-pmqpj  Pod          ◌ Pending      5s     ready:0/2
│  │  ├──□ name-575976c6dd-tkcj5  Pod          ◌ Pending      5s     ready:0/2
│  │  └──□ name-575976c6dd-w58j5  Pod          ◌ Pending      5s     ready:0/2
│  └──α name-575976c6dd-266       AnalysisRun  ✔ Successful   5s
├──# revision:265
│  ├──⧉ name-b4889b8cc            ReplicaSet   • ScaledDown   3h18m  delay:passed
│  └──α name-b4889b8cc-265        AnalysisRun  ✖ Failed       3h18m  ✔ 40,✖ 3
├──# revision:264
│  ├──⧉ name-548cd6f455           ReplicaSet   • ScaledDown   3h58m
│  └──α name-548cd6f455-264       AnalysisRun  ✖ Failed       3h58m  ✔ 73,✖ 3
├──# revision:263
│  ├──⧉ name-5d768bfc49           ReplicaSet   ✔ Healthy      6h4m
│  │  └──□ name-5d768bfc49-x6w5t  Pod          ✔ Running      4h41m  ready:2/2
│  ├──α name-5d768bfc49-263       AnalysisRun  ✖ Failed       6h4m   ✔ 6,✖ 3
│  ├──α name-5d768bfc49-263.1     AnalysisRun  ✖ Failed       4h59m  ✔ 17,✖ 3
│  └──α name-5d768bfc49-263.2     AnalysisRun  ✔ Successful   4h41m  ✔ 81,✖ 1
└──# revision:261
   └──α name-575976c6dd-261       AnalysisRun  ✔ Successful   7h36m  ✔ 29

Then eventually once all the replacement pods started, the "extra" pod in revision 263 was terminated and the rollout became healthy again:

Name:            name
Namespace:       default
Status:          ✔ Healthy
Strategy:        Canary
  Step:          16/16
  SetWeight:     100
  ActualWeight:  100
Images:          image:one (stable)
Replicas:
  Desired:       4
  Current:       5
  Updated:       4
  Ready:         5
  Available:     4

NAME                              KIND         STATUS         AGE    INFO
⟳ name                            Rollout      ✔ Healthy      27d
├──# revision:266
│  ├──⧉ name-575976c6dd           ReplicaSet   ◌ Progressing  7h39m  stable
│  │  ├──□ name-575976c6dd-p5q7f  Pod          ✔ Running      7h39m  ready:2/2
│  │  ├──□ name-575976c6dd-pmqpj  Pod          ✔ Running      3m44s  ready:2/2
│  │  ├──□ name-575976c6dd-tkcj5  Pod          ✔ Running      3m44s  ready:2/2
│  │  └──□ name-575976c6dd-w58j5  Pod          ✔ Running      3m44s  ready:2/2
│  └──α name-575976c6dd-266       AnalysisRun  ✔ Successful   3m44s
├──# revision:265
│  ├──⧉ name-b4889b8cc            ReplicaSet   • ScaledDown   3h21m  delay:passed
│  └──α name-b4889b8cc-265        AnalysisRun  ✖ Failed       3h21m  ✔ 40,✖ 3
├──# revision:264
│  ├──⧉ name-548cd6f455           ReplicaSet   • ScaledDown   4h2m
│  └──α name-548cd6f455-264       AnalysisRun  ✖ Failed       4h2m   ✔ 73,✖ 3
├──# revision:263
│  └──α name-5d768bfc49-263       AnalysisRun  ✖ Failed       6h7m   ✔ 6,✖ 3
└──# revision:261
   └──α name-575976c6dd-261       AnalysisRun  ✔ Successful   7h39m  ✔ 29

Diagnostics

Controller v1.1.1

https://gist.github.com/bpoland/7e3ef6ab03d42668ae8f0d1e867d4fa4

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

The text was updated successfully, but these errors were encountered:

jessesuen · 2021-12-10T23:42:30Z

Thanks for the detailed information in the bug, it is helpful.

Here's the more readable stacktrace:

goroutine 285 [running]:
runtime/debug.Stack(0xc001782d88, 0x1f8de60, 0x33ed3c0)
	/usr/local/go/src/runtime/debug/stack.go:24 +0x9f
github.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem.func1.1.1(0xc0017cd490, 0xc001783af0)
	/go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:149 +0x5b
panic(0x1f8de60, 0x33ed3c0)
	/usr/local/go/src/runtime/panic.go:965 +0x1b9
github.com/argoproj/argo-rollouts/rollout.(*rolloutContext).createAnalysisRun(0xc0020e8c00, 0xc000ffecc0, 0x0, 0x0, 0xc002808900, 0x1d37a6f, 0x1f6b0a0, 0xc00063aea0)
	/go/src/github.com/argoproj/argo-rollouts/rollout/analysis.go:353 +0x28a
github.com/argoproj/argo-rollouts/rollout.(*rolloutContext).reconcileBackgroundAnalysisRun(0xc0020e8c00, 0x0, 0x0, 0x0)
	/go/src/github.com/argoproj/argo-rollouts/rollout/analysis.go:334 +0x2bb
github.com/argoproj/argo-rollouts/rollout.(*rolloutContext).reconcileAnalysisRuns(0xc0020e8c00, 0xc0020e8c00, 0x0)
	/go/src/github.com/argoproj/argo-rollouts/rollout/analysis.go:92 +0xd77
github.com/argoproj/argo-rollouts/rollout.(*rolloutContext).syncReplicasOnly(0xc0020e8c00, 0x22a0901, 0x17, 0x0)
	/go/src/github.com/argoproj/argo-rollouts/rollout/sync.go:311 +0x167
github.com/argoproj/argo-rollouts/rollout.(*rolloutContext).reconcile(0xc0020e8c00, 0xc0000aa600, 0xc0020e8c00)
	/go/src/github.com/argoproj/argo-rollouts/rollout/context.go:78 +0x193
github.com/argoproj/argo-rollouts/rollout.(*Controller).syncHandler(0xc0000ec540, 0xc000c18000, 0x13, 0x0, 0x0)
	/go/src/github.com/argoproj/argo-rollouts/rollout/controller.go:391 +0x630
github.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem.func1.1(0x0, 0x0)
	/go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:153 +0x7c
github.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem.func1(0x262c050, 0xc00017e3a0, 0x22876fd, 0x7, 0xc0006e1e60, 0xc00053ae80, 0x1edde80, 0xc000a5a2a0, 0x0, 0x0)
	/go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:157 +0x326
github.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem(0x262c050, 0xc00017e3a0, 0x22876fd, 0x7, 0xc0006e1e60, 0xc00053ae80, 0x0)
	/go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:171 +0x9a
github.com/argoproj/argo-rollouts/utils/controller.RunWorker(...)
	/go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:104
github.com/argoproj/argo-rollouts/rollout.(*Controller).Run.func1()
	/go/src/github.com/argoproj/argo-rollouts/rollout/controller.go:321 +0xa5
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc00196b410)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:155 +0x5f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc00196b410, 0x25c5a40, 0xc00195d7a0, 0x1, 0xc0000a2f60)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:156 +0x9b
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc00196b410, 0x3b9aca00, 0x0, 0x1, 0xc0000a2f60)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:133 +0x98
k8s.io/apimachinery/pkg/util/wait.Until(0xc00196b410, 0x3b9aca00, 0xc0000a2f60)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:90 +0x4d
created by github.com/argoproj/argo-rollouts/rollout.(*Controller).Run
	/go/src/github.com/argoproj/argo-rollouts/rollout/controller.go:320 +0xac

I tried to follow the stack trace which seems to be pointing at this line saying that the latest replicaset didn't have a pod hash label, but I checked and it did have a correct rollouts-pod-template-hash label. So I'm not sure what happened there.

It's not that the label was missing. The only way it could have panicked is if: c.newRS is nil. The fact that we got to this line of code with c.newRS being nil, is unexpected. And since you mentioned you examined the ReplicaSet, it must have existed, so something is amiss.

func (c *rolloutContext) createAnalysisRun(rolloutAnalysis *v1alpha1.RolloutAnalysis, infix string, labels map[string]string) (*v1alpha1.AnalysisRun, error) {
	args := analysisutil.BuildArgumentsForRolloutAnalysisRun(rolloutAnalysis.Args, c.stableRS, c.newRS, c.rollout)
	podHash := replicasetutil.GetPodTemplateHash(c.newRS)
	if podHash == "" {
		return nil, fmt.Errorf("Latest ReplicaSet '%s' has no pod hash in the labels", c.newRS.Name) // nil pointer derefernce! c.newRS must be nil.
	}

There is one interesting point in the logs which is isScaling: true:

time="2021-12-07T22:27:51Z" level=info msg="Syncing replicas only (userPaused false, isScaling: true)" namespace=default rollout=name

Scaling events are not as common, and are detected when rollout spec.replicas changes from previous value. There is a separate code path to handle scaling events which could be exposing a corner case.

jessesuen · 2021-12-11T00:01:59Z

I feel the syncReplicasOnly() method is doing too much work, not to mention there is redundant effort with the normal reconcile codepath. Since the scaling path is not executed as frequently, I feel it could be a source of corner case bugs. syncReplicasOnly() was originally supposed to only handle scaling events, but I see it trying to reconcile analysis, introduced by this change:

cb26133

So a fundamental fix is to reduce the work done here, so that we have fewer branches in the codepath.

bpoland added the bug Something isn't working label Dec 9, 2021

jessesuen added the cherry-pick/release-1.1 label Dec 10, 2021

jessesuen mentioned this issue Dec 11, 2021

refactor: controller could panic in scaling events with analysis #1699

Merged

harikrongali added this to the v1.2 milestone Jan 12, 2022

jessesuen removed the cherry-pick/release-1.1 label Jan 19, 2022

jessesuen closed this as completed in #1699 Jan 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Panic while trying to reconcile Rollout, most stable pods restarted #1696

Panic while trying to reconcile Rollout, most stable pods restarted #1696

bpoland commented Dec 9, 2021 •

edited

Loading

jessesuen commented Dec 10, 2021

jessesuen commented Dec 11, 2021 •

edited

Loading

Panic while trying to reconcile Rollout, most stable pods restarted #1696

Panic while trying to reconcile Rollout, most stable pods restarted #1696

Comments

bpoland commented Dec 9, 2021 • edited Loading

Summary

Details

Diagnostics

jessesuen commented Dec 10, 2021

jessesuen commented Dec 11, 2021 • edited Loading

bpoland commented Dec 9, 2021 •

edited

Loading

jessesuen commented Dec 11, 2021 •

edited

Loading