Backoff is applied even when `retryStrategy.limit` has been reached #7588

dimitri-fert · 2022-01-19T17:04:42Z

Summary

What happened/what you expected to happen?

When using retryStrategy.backoff.limit: 2 and all child nodes (0 ; 1 and 2) already failed, I am expecting the parent node to fail immediatly as no more retries are allowed. Instead, after node 2 fails, new backoff is still applied (12mn in my case) and parent node fails later than expected.

What version of Argo Workflows are you running?
v3.2.4

Diagnostics

I used this ClusterWorkflowTemplate (truncated) to emulate node durations on a failing terraform apply step. To be close to my use case, it takes longer on the first run (21mn) that on retries (5mn).

---
apiVersion: argoproj.io/v1alpha1
kind: ClusterWorkflowTemplate
metadata:
  name: bootstrap-sandbox-infrastructure
spec:
  entrypoint: main
  volumeClaimTemplates:
    - metadata:
        name: workdir
      spec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 1Gi
  securityContext:
    fsGroup: 1001
  templates:
    - name: terraform-apply-infrastructure
      retryStrategy:
        limit: 2
        backoff:
          duration: 3m
          factor: 2
          maxDuration: 60m
      container:
        image: hashicorp/terraform:v1.0.9
        imagePullPolicy: IfNotPresent
        workingDir: /sources/k8s-apps-descriptors/terraform
        command:
          - sh
        args:
          - -c
          - |
            set -eo pipefail
            mkdir -p /sources/tmp

            if [[ -f /sources/tmp/duration ]]; then
              duration="$(cat /sources/tmp/duration)"
            else
              duration="21m"
            fi
            sleep "${duration}"

            echo "5m" > /sources/tmp/duration
            exit 1
        securityContext:
          allowPrivilegeEscalation: false
          privileged: false
          readOnlyRootFilesystem: false  # terraform needs /tmp
          runAsNonRoot: false
          runAsUser: 0
        volumeMounts:
          - name: workdir
            mountPath: /sources

What Kubernetes provider are you using?
GKE v1.20.11-gke.1300

What executor are you running? Docker/K8SAPI/Kubelet/PNS/Emissary
Docker

# Logs from the workflow controller:

time="2022-01-19T15:25:24.189Z" level=info msg="node bootstrap-sandbox-infrastrucuture-sbx-qo-pr-4337-d29n2-3538259764 message: Backoff for 3 minutes 0 seconds" namespace=apps-sbx-qo-pr-4337 workflow=bootstrap-sandbox-infrastrucuture-sbx-qo-pr-4337-d29n2
time="2022-01-19T15:33:56.542Z" level=info msg="node bootstrap-sandbox-infrastrucuture-sbx-qo-pr-4337-d29n2-3538259764 message: Backoff for 6 minutes 0 seconds" namespace=apps-sbx-qo-pr-4337 workflow=bootstrap-sandbox-infrastrucuture-sbx-qo-pr-4337-d29n2
time="2022-01-19T15:45:28.966Z" level=info msg="node bootstrap-sandbox-infrastrucuture-sbx-qo-pr-4337-d29n2-3538259764 message: Backoff for 12 minutes 0 seconds" namespace=apps-sbx-qo-pr-4337 workflow=bootstrap-sandbox-infrastrucuture-sbx-qo-pr-4337-d29n2
[...]
time="2022-01-19T15:57:39.032Z" level=info msg="node has maxDuration set, setting executionDeadline to: Wed Jan 19 16:04:02 +0000 (6 minutes from now)" namespace=apps-sbx-qo-pr-4337 node="bootstrap-sandbox-infrastrucuture-sbx-qo-pr-4337-d29n2[0].main.terraform-apply-infrastructure" workflow=bootstrap-sandbox-infrastrucuture-sbx-qo-pr-4337-d29n2
time="2022-01-19T15:57:39.032Z" level=info msg="node bootstrap-sandbox-infrastrucuture-sbx-qo-pr-4337-d29n2-3538259764 message: " namespace=apps-sbx-qo-pr-4337 workflow=bootstrap-sandbox-infrastrucuture-sbx-qo-pr-4337-d29n2
time="2022-01-19T15:57:39.032Z" level=info msg="No more retries left. Failing..." namespace=apps-sbx-qo-pr-4337 workflow=bootstrap-sandbox-infrastrucuture-sbx-qo-pr-4337-d29n2
time="2022-01-19T15:57:39.032Z" level=info msg="node bootstrap-sandbox-infrastrucuture-sbx-qo-pr-4337-d29n2-3538259764 phase Running -> Failed" namespace=apps-sbx-qo-pr-4337 workflow=bootstrap-sandbox-infrastrucuture-sbx-qo-pr-4337-d29n2
time="2022-01-19T15:57:39.032Z" level=info msg="node bootstrap-sandbox-infrastrucuture-sbx-qo-pr-4337-d29n2-3538259764 message: No more retries left" namespace=apps-sbx-qo-pr-4337 workflow=bootstrap-sandbox-infrastrucuture-sbx-qo-pr-4337-d29n2
time="2022-01-19T15:57:39.032Z" level=info msg="node bootstrap-sandbox-infrastrucuture-sbx-qo-pr-4337-d29n2-3538259764 finished: 2022-01-19 15:57:39.032781319 +0000 UTC" namespace=apps-sbx-qo-pr-4337 workflow=bootstrap-sandbox-infrastrucuture-sbx-qo-pr-4337-d29n2

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

The text was updated successfully, but these errors were encountered:

sarabala1979 · 2022-01-20T17:09:53Z

@dimitri-fert are you submitting clustertemplate directly or referring templateRef in workflow?

dimitri-fert · 2022-01-21T12:58:15Z

@sarabala1979 I'm referring templateRef in workflow

alexec · 2022-02-04T19:19:28Z

It seems to me that the limit is 1 off. Can you confirm?

dimitri-fert · 2022-02-11T10:17:12Z

Sure, will try on a simpler workflow and see how it ends-up. Will share the results here.

dimitri-fert · 2022-02-14T10:53:30Z

Performed a 2nd test with a simple exit 1.
I confirm the diagnosis. With limit: 2, back-off is still applied when the limit has been reached.

Here's my histogram :

1st try --> Fails
back-off for 2m
1st retry --> Fails
back-off for 2*2m=4m
2nd retry --> Fails (limit reached)
back-off for 2*4m=8m

I'm not expecting to wait the extra 8m here as there is no more retry left.

Here's my test workflow and the associated logs from the workflow controller

---
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: test-backoff
  namespace: argo-events
spec:
  securityContext:
    fsGroup: 1001
  serviceAccountName: workflow-executor
  entrypoint: main
  templates:
    - name: main
      retryStrategy:
        limit: 2
        backoff:
          duration: 2m
          factor: 2
          maxDuration: 60m
      container:
        image: hashicorp/terraform:1.0.9
        imagePullPolicy: IfNotPresent
        command:
          - sh
        args:
          - -c
          - |
            exit 1

associated (truncated) logs

time="2022-02-14T10:27:59.885Z" level=info msg="node test-backoff message: Backoff for 2 minutes 0 seconds" namespace=argo-events workflow=test-backoff
time="2022-02-14T10:31:03.736Z" level=info msg="node test-backoff message: Backoff for 4 minutes 0 seconds" namespace=argo-events workflow=test-backoff
time="2022-02-14T10:35:33.813Z" level=info msg="node test-backoff message: Backoff for 8 minutes 0 seconds" namespace=argo-events workflow=test-backoff
time="2022-02-14T10:43:43.830Z" level=info msg="No more retries left. Failing..." namespace=argo-events workflow=test-backoff

dimitri-fert · 2022-02-14T15:44:56Z

Not familiar with the code yet but regarding this function, it seems to me that we first compute and apply the backoff duration and then check if the limit reach after this duration.

I guess we could avoid requeing the wof if the limit has already been reached (see here). What would you think ?

alexec · 2022-02-14T16:29:41Z

I'm not super familiar with the code, but that looks to be the right area. What I cannot see in that code is where limit is actually enforced. Do we need to look wider?

dimitri-fert · 2022-02-15T08:26:33Z

The limit is enforced at the very end of the function :

argo-workflows/workflow/controller/operator.go

Line 917 in bf3b58b

    
           if retryStrategy.Limit != nil && limit != nil && int32(len(node.Children)) > *limit {

alexec · 2022-02-15T16:04:27Z

Should it be a case of changing ">" to ">="?

dimitri-fert · 2022-02-16T10:48:15Z

I don't think so no. If we use >= here we'll skip a retry. We only want to skip a backoff, so we should instead change the condition of the requeue (i.e. don't requeue if the limit has been reached)

alexec · 2022-02-16T16:32:42Z

Do you mean woc.requeue()? If so, that function can be called for many reasons, so I don't think fix can be there.

dimitri-fert · 2022-02-16T16:54:16Z

I'm not asking to patch the function woc.requeue() itself no, but to change the if statement we use for calling woc.requeueAfter() inside processNodeRetries()

It could translate from this (source)

		if time.Now().Before(waitingDeadline) {
			woc.requeueAfter(timeToWait)
			retryMessage := fmt.Sprintf("Backoff for %s", humanize.Duration(timeToWait))
			return woc.markNodePhase(node.Name, node.Phase, retryMessage), false, nil
		}

to something like this (if it makes any sense)

		if time.Now().Before(waitingDeadline) && int32(len(node.Children)) <= *limit {
			woc.requeueAfter(timeToWait)
			retryMessage := fmt.Sprintf("Backoff for %s", humanize.Duration(timeToWait))
			return woc.markNodePhase(node.Name, node.Phase, retryMessage), false, nil
		}

alexec · 2022-02-16T17:23:35Z

sure, do you want to submit PR?

stale · 2022-03-02T08:10:04Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

…Fixes argoproj#7588 Signed-off-by: Rohan Kumar <[email protected]>

…7588 (#8090) Signed-off-by: Rohan Kumar <[email protected]>

dimitri-fert added type/bug triage labels Jan 19, 2022

dimitri-fert changed the title ~~Backoff is applied even when retryStrategy.backoff.limit has been reached~~ Backoff is applied even when retryStrategy.backoff.limit has been reached Jan 19, 2022

dimitri-fert changed the title ~~Backoff is applied even when retryStrategy.backoff.limit has been reached~~ Backoff is applied even when retryStrategy.limit has been reached Jan 19, 2022

whynowy assigned sarabala1979 Jan 19, 2022

sarabala1979 added the more-information-needed label Jan 20, 2022

no-response bot removed the more-information-needed label Jan 21, 2022

alexec added the v3.2 label Feb 4, 2022

alexec added good first issue Good for newcomers area/controller Controller issues, panics labels Feb 4, 2022

sarabala1979 removed their assignment Feb 7, 2022

LaloLoop mentioned this issue Feb 10, 2022

I would like a ~mentor~ GSoC #7849

Closed

stale bot added the problem/stale This has not had a response in some time label Mar 2, 2022

rohankmr414 added a commit to rohankmr414/argo-workflows that referenced this issue Mar 7, 2022

fix: backoff is not applied when retryStrategy.limit has been reached. …

520f7a4

…Fixes argoproj#7588 Signed-off-by: Rohan Kumar <[email protected]>

rohankmr414 mentioned this issue Mar 7, 2022

fix: prevent backoff when retryStrategy.limit has been reached. Fixes #7588 #8090

Merged

alexec removed the v3.2 label Mar 9, 2022

stale bot removed the problem/stale This has not had a response in some time label Mar 9, 2022

rohankmr414 added a commit to rohankmr414/argo-workflows that referenced this issue Apr 4, 2022

fix: backoff is not applied when retryStrategy.limit has been reached. …

a85501e

…Fixes argoproj#7588 Signed-off-by: Rohan Kumar <[email protected]>

rohankmr414 added a commit to rohankmr414/argo-workflows that referenced this issue Apr 5, 2022

fix: backoff is not applied when retryStrategy.limit has been reached. …

0cc7358

…Fixes argoproj#7588 Signed-off-by: Rohan Kumar <[email protected]>

rohankmr414 added a commit to rohankmr414/argo-workflows that referenced this issue Apr 7, 2022

fix: backoff is not applied when retryStrategy.limit has been reached. …

135c1f8

…Fixes argoproj#7588 Signed-off-by: Rohan Kumar <[email protected]>

rohankmr414 added a commit to rohankmr414/argo-workflows that referenced this issue Apr 12, 2022

fix: backoff is not applied when retryStrategy.limit has been reached. …

5efb148

…Fixes argoproj#7588 Signed-off-by: Rohan Kumar <[email protected]>

alexec closed this as completed in #8090 Apr 14, 2022

alexec pushed a commit that referenced this issue Apr 14, 2022

fix: prevent backoff when retryStrategy.limit has been reached. Fixes #…

a648ccd

…7588 (#8090) Signed-off-by: Rohan Kumar <[email protected]>

sarabala1979 mentioned this issue Apr 14, 2022

Cherry pick v3.3.2 #8401

Closed

85 tasks

alexec mentioned this issue May 3, 2022

v3.3.5 #8594

Closed

alexec pushed a commit that referenced this issue May 3, 2022

fix: prevent backoff when retryStrategy.limit has been reached. Fixes #…

ba8c600

…7588 (#8090) Signed-off-by: Rohan Kumar <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backoff is applied even when `retryStrategy.limit` has been reached #7588

Backoff is applied even when `retryStrategy.limit` has been reached #7588

dimitri-fert commented Jan 19, 2022 •

edited

Loading

sarabala1979 commented Jan 20, 2022

dimitri-fert commented Jan 21, 2022

alexec commented Feb 4, 2022

dimitri-fert commented Feb 11, 2022

dimitri-fert commented Feb 14, 2022 •

edited

Loading

dimitri-fert commented Feb 14, 2022 •

edited

Loading

alexec commented Feb 14, 2022

dimitri-fert commented Feb 15, 2022 •

edited

Loading

alexec commented Feb 15, 2022

dimitri-fert commented Feb 16, 2022

alexec commented Feb 16, 2022

dimitri-fert commented Feb 16, 2022

alexec commented Feb 16, 2022

stale bot commented Mar 2, 2022

Backoff is applied even when retryStrategy.limit has been reached #7588

Backoff is applied even when retryStrategy.limit has been reached #7588

Comments

dimitri-fert commented Jan 19, 2022 • edited Loading

Summary

Diagnostics

sarabala1979 commented Jan 20, 2022

dimitri-fert commented Jan 21, 2022

alexec commented Feb 4, 2022

dimitri-fert commented Feb 11, 2022

dimitri-fert commented Feb 14, 2022 • edited Loading

dimitri-fert commented Feb 14, 2022 • edited Loading

alexec commented Feb 14, 2022

dimitri-fert commented Feb 15, 2022 • edited Loading

alexec commented Feb 15, 2022

dimitri-fert commented Feb 16, 2022

alexec commented Feb 16, 2022

dimitri-fert commented Feb 16, 2022

alexec commented Feb 16, 2022

stale bot commented Mar 2, 2022

Backoff is applied even when `retryStrategy.limit` has been reached #7588

Backoff is applied even when `retryStrategy.limit` has been reached #7588

dimitri-fert commented Jan 19, 2022 •

edited

Loading

dimitri-fert commented Feb 14, 2022 •

edited

Loading

dimitri-fert commented Feb 14, 2022 •

edited

Loading

dimitri-fert commented Feb 15, 2022 •

edited

Loading