Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stuck on progressing after upgrade to v1.0.1 #1359

Closed
eilonmonday opened this issue Jul 18, 2021 · 10 comments · Fixed by #1378
Closed

stuck on progressing after upgrade to v1.0.1 #1359

eilonmonday opened this issue Jul 18, 2021 · 10 comments · Fixed by #1378
Labels
bug Something isn't working cherry-pick/release-1.0

Comments

@eilonmonday
Copy link

eilonmonday commented Jul 18, 2021

We are working with Argo rollouts for 2 years (version 0.8.2), lately, we have thought of upgrading to v1.0.1 our production environment. since it is a very sensitive environment we upgraded staging environment 1 month ago.

Everything went good, and we had hundreds of deployments to staging with Argo v1.0.1,
All of a sudden, the whole thing stopped working. after deployment, and when all pods are up, it is stuck on progressing while both(blue and green) are healthy.

It happened in 2 different clusters

Screen Shot 2021-07-18 at 12 42 10

I tried to investigate logs:
time="2021-07-18T09:46:17Z" level=error msg="Recovered from panic: runtime error: invalid memory address or nil pointer dereference\ngoroutine 160 [running]:\nruntime/debug.Stack(0xc000b4d438, 0x1c18d40, 0x2d971f0)\n\t/usr/local/go/src/runtime/debug/stack.go:24 +0x9f\ngithub.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem.func1.1.1(0xc002f69d50, 0xc000b4db30)\n\t/go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:149 +0x5b\npanic(0x1c18d40, 0x2d971f0)\n\t/usr/local/go/src/runtime/panic.go:965 +0x1b9\ngithub.com/argoproj/argo-rollouts/rollout.(*rolloutContext).reconcileBlueGreenPause(0xc002bcaa80, 0xc001caa000, 0xc00260d900)\n\t/go/src/github.com/argoproj/argo-rollouts/rollout/bluegreen.go:177 +0x5fe\ngithub.com/argoproj/argo-rollouts/rollout.(*rolloutContext).rolloutBlueGreen(0xc002bcaa80, 0x1ea0d24, 0x17)\n\t/go/src/github.com/argoproj/argo-rollouts/rollout/bluegreen.go:48 +0x17d\ngithub.com/argoproj/argo-rollouts/rollout.(*rolloutContext).reconcile(0xc002bcaa80, 0xc00001d800, 0xc002bcaa80)\n\t/go/src/github.com/argoproj/argo-rollouts/rollout/context.go:79 +0x1e5\ngithub.com/argoproj/argo-rollouts/rollout.(*Controller).syncHandler(0xc0009ae000, 0xc000c0d560, 0x19, 0x0, 0x0)\n\t/go/src/github.com/argoproj/argo-rollouts/rollout/controller.go:387 +0x51a\ngithub.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem.func1.1(0x0, 0x0)\n\t/go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:153 +0x7c\ngithub.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem.func1(0x21621d0, 0xc00007c1e0, 0x1e8bdd7, 0x7, 0xc001c4de60, 0xc00076bec0, 0x1b8c1a0, 0xc003010710, 0x0, 0x0)\n\t/go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:157 +0x323\ngithub.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem(0x21621d0, 0xc00007c1e0, 0x1e8bdd7, 0x7, 0xc001c4de60, 0xc00076bec0, 0xc0005cce01)\n\t/go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:169 +0x9a\ngithub.com/argoproj/argo-rollouts/utils/controller.RunWorker(...)\n\t/go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:104\ngithub.com/argoproj/argo-rollouts/rollout.(*Controller).Run.func1()\n\t/go/src/github.com/argoproj/argo-rollouts/rollout/controller.go:319 +0xa5\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc001e04450)\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:155 +0x5f\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc001e04450, 0x2107e40, 0xc001c236b0, 0x7e6e01, 0xc0000a2f00)\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:156 +0x9b\nk8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc001e04450, 0x3b9aca00, 0x0, 0x1, 0xc0000a2f00)\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:133 +0x98\nk8s.io/apimachinery/pkg/util/wait.Until(0xc001e04450, 0x3b9aca00, 0xc0000a2f00)\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:90 +0x4d\ncreated by github.com/argoproj/argo-rollouts/rollout.(*Controller).Run\n\t/go/src/github.com/argoproj/argo-rollouts/rollout/controller.go:318 +0xac\n" namespace=monday rollout=staging-monday-api time="2021-07-18T09:46:17Z" level=error msg="rollout syncHandler error: Recovered from Panic" namespace=monday rollout=staging-monday-api

I tried to:
delete Argo rollout, its namespace, and then install - the same issue occurred.
then I tried to uninstall again, and install v0.8.2 and the deployment passed...


Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

@eilonmonday eilonmonday added the bug Something isn't working label Jul 18, 2021
@huikang
Copy link
Member

huikang commented Jul 19, 2021

There have been many changes in both the controller's logic and CRD since v0.8.2, so you may need to migrate the rollout's CRs to the latest version to make it work.

@jessesuen
Copy link
Member

Stracktrace but formatted:

time="2021-07-18T09:46:17Z" level=error msg="Recovered from panic: runtime error: invalid memory address or nil pointer dereference
goroutine 160 [running]:
runtime/debug.Stack(0xc000b4d438, 0x1c18d40, 0x2d971f0)
    /usr/local/go/src/runtime/debug/stack.go:24 +0x9f
github.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem.func1.1.1(0xc002f69d50, 0xc000b4db30)
    /go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:149 +0x5b
panic(0x1c18d40, 0x2d971f0)
    /usr/local/go/src/runtime/panic.go:965 +0x1b9
github.com/argoproj/argo-rollouts/rollout.(*rolloutContext).reconcileBlueGreenPause(0xc002bcaa80, 0xc001caa000, 0xc00260d900)
    /go/src/github.com/argoproj/argo-rollouts/rollout/bluegreen.go:177 +0x5fe
github.com/argoproj/argo-rollouts/rollout.(*rolloutContext).rolloutBlueGreen(0xc002bcaa80, 0x1ea0d24, 0x17)
    /go/src/github.com/argoproj/argo-rollouts/rollout/bluegreen.go:48 +0x17d
github.com/argoproj/argo-rollouts/rollout.(*rolloutContext).reconcile(0xc002bcaa80, 0xc00001d800, 0xc002bcaa80)
    /go/src/github.com/argoproj/argo-rollouts/rollout/context.go:79 +0x1e5
github.com/argoproj/argo-rollouts/rollout.(*Controller).syncHandler(0xc0009ae000, 0xc000c0d560, 0x19, 0x0, 0x0)
    /go/src/github.com/argoproj/argo-rollouts/rollout/controller.go:387 +0x51a
github.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem.func1.1(0x0, 0x0)
    /go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:153 +0x7c
github.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem.func1(0x21621d0, 0xc00007c1e0, 0x1e8bdd7, 0x7, 0xc001c4de60, 0xc00076bec0, 0x1b8c1a0, 0xc003010710, 0x0, 0x0)
    /go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:157 +0x323
github.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem(0x21621d0, 0xc00007c1e0, 0x1e8bdd7, 0x7, 0xc001c4de60, 0xc00076bec0, 0xc0005cce01)
    /go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:169 +0x9a
github.com/argoproj/argo-rollouts/utils/controller.RunWorker(...)
    /go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:104
github.com/argoproj/argo-rollouts/rollout.(*Controller).Run.func1()
    /go/src/github.com/argoproj/argo-rollouts/rollout/controller.go:319 +0xa5
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc001e04450)
    /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:155 +0x5f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc001e04450, 0x2107e40, 0xc001c236b0, 0x7e6e01, 0xc0000a2f00)
    /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:156 +0x9b
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc001e04450, 0x3b9aca00, 0x0, 0x1, 0xc0000a2f00)
    /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:133 +0x98
k8s.io/apimachinery/pkg/util/wait.Until(0xc001e04450, 0x3b9aca00, 0xc0000a2f00)
    /go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:90 +0x4d
created by github.com/argoproj/argo-rollouts/rollout.(*Controller).Run
    /go/src/github.com/argoproj/argo-rollouts/rollout/controller.go:318 +0xac
" namespace=monday rollout=staging-monday-api time="2021-07-18T09:46:17Z" level=error msg="rollout syncHandler error: Recovered from Panic" namespace=monday rollout=staging-monday-api

@jessesuen
Copy link
Member

jessesuen commented Jul 19, 2021

The only possible explanation is pauseCond is nil in this code block

func (c *rolloutContext) reconcileBlueGreenPause(activeSvc, previewSvc *corev1.Service) {
...
	pauseCond := getPauseCondition(c.rollout, v1alpha1.PauseReasonBlueGreenPause)
	if pauseCond == nil && !c.rollout.Status.ControllerPause {
		if pauseCond == nil {
			c.log.Info("pausing")
		}
		c.pauseContext.AddPauseCondition(v1alpha1.PauseReasonBlueGreenPause)
		return
	}

	if !c.pauseContext.CompletedBlueGreenPause() {
		c.log.Info("pause incomplete")
		if c.rollout.Spec.Strategy.BlueGreen.AutoPromotionSeconds > 0 {
			c.checkEnqueueRolloutDuringWait(pauseCond.StartTime, c.rollout.Spec.Strategy.BlueGreen.AutoPromotionSeconds)  // <<<< panic. 
		}
	} else {
		c.log.Infof("pause completed")
		c.pauseContext.RemovePauseCondition(v1alpha1.PauseReasonBlueGreenPause)
	}

@huikang
Copy link
Member

huikang commented Jul 20, 2021

I was trying to reproduce the error with the following steps, but couldn't get the same results.

  1. install argo-rollouts v0.8.2
  2. deploy the following rollout CR
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: rollout-bluegreen
spec:
  replicas: 2
  revisionHistoryLimit: 2
  selector:
    matchLabels:
      app: rollout-bluegreen
  template:
    metadata:
      labels:
        app: rollout-bluegreen
    spec:
      containers:
      - name: rollouts-demo
        image: argoproj/rollouts-demo:blue
        imagePullPolicy: Always
        ports:
        - containerPort: 8080
  strategy:
    blueGreen: 
      activeService: rollout-bluegreen-active
      previewService: rollout-bluegreen-preview
      autoPromotionEnabled: false
---
apiVersion: v1
kind: Service
metadata:
  name: rollout-bluegreen-active
spec:
  ports:
    - port: 80
      targetPort: http
      protocol: TCP
      name: http
  selector:
    app: rollout-bluegreen
---
apiVersion: v1
kind: Service
metadata:
  name: rollout-bluegreen-preview
spec:
  ports:
    - port: 80
      targetPort: http
      protocol: TCP
      name: http
  selector:
    app: rollout-bluegreen
  1. Delete the argo-rollout v0.8.2 Deployment
  2. Install the latest argo-rollout (master branch) 'kubectl apply -n argo-rollouts -f https://github.com/argoproj/argo-rollouts/master/download/install.yaml'
  3. Promote the rollout
kubectl argo rollouts set image rollout-bluegreen rollouts-demo=argoproj/rollouts-demo:yellow
kubectl argo rollouts promote rollout-bluegreen  
  1. The rollout is healthy and the contorller runs fine
Name:            rollout-bluegreen
Namespace:       default
Status:          ✔ Healthy
Strategy:        BlueGreen
Images:          argoproj/rollouts-demo:blue
                 argoproj/rollouts-demo:yellow (stable, active)
Replicas:
  Desired:       2
  Current:       4
  Updated:       2
  Ready:         2
  Available:     2

NAME                                           KIND        STATUS     AGE    INFO
⟳ rollout-bluegreen                            Rollout     ✔ Healthy  4m35s  
├──# revision:2                                                              
│  └──⧉ rollout-bluegreen-6b5dc99488           ReplicaSet  ✔ Healthy  38s    stable,active
│     ├──□ rollout-bluegreen-6b5dc99488-gx2px  Pod         ✔ Running  38s    ready:1/1
│     └──□ rollout-bluegreen-6b5dc99488-s57d8  Pod         ✔ Running  38s    ready:1/1
└──# revision:1                                                              
   └──⧉ rollout-bluegreen-5f49884f5c           ReplicaSet  ✔ Healthy  3m14s  delay:2s
      ├──□ rollout-bluegreen-5f49884f5c-cb2mn  Pod         ✔ Running  3m14s  ready:1/1
      └──□ rollout-bluegreen-5f49884f5c-svbc2  Pod         ✔ Running  3m14s  ready:1/1

@eilonmonday , could you provide more details about how to reproduce the error?

@eilonmonday
Copy link
Author

guys, seems to be an issue with the autoPromotionEnabled flag
when I changed from True to False everything is working. maybe we can investigate from here

@huikang
Copy link
Member

huikang commented Jul 21, 2021

@eilonmonday , I tested with autoPromotionEnabled: true in the above steps, but still can't reproduce the error. Could you provide more detail about how you produced the error.

@BillyMorgan
Copy link

Hi guys, I'm experiencing a similar issue with v1.0.2+7a23fe5. I've got autoPromotionEnabled: false and autoPromotionSeconds: 45 but the auto promotion never happens. The rollout sits there happily paused with all the replicasets showing healthy status. Lemme know what info would be useful to debug.

@jessesuen
Copy link
Member

@BillyMorgan are you seeing the stack trace in the logs as well?

@jessesuen
Copy link
Member

This panic should be resolved in v1.0.3. Please give it a try.
https://github.com/argoproj/argo-rollouts/releases/tag/v1.0.3

@jessesuen
Copy link
Member

There is another issue when using the previewReplicaCount feature. This is being fixed and will be in a v1.0.4 release
#1383

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cherry-pick/release-1.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants