Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backoff is applied even when retryStrategy.limit has been reached #7588

Closed
dimitri-fert opened this issue Jan 19, 2022 · 14 comments · Fixed by #8090
Closed

Backoff is applied even when retryStrategy.limit has been reached #7588

dimitri-fert opened this issue Jan 19, 2022 · 14 comments · Fixed by #8090
Labels
area/controller Controller issues, panics good first issue Good for newcomers type/bug

Comments

@dimitri-fert
Copy link

dimitri-fert commented Jan 19, 2022

Summary

What happened/what you expected to happen?

When using retryStrategy.backoff.limit: 2 and all child nodes (0 ; 1 and 2) already failed, I am expecting the parent node to fail immediatly as no more retries are allowed. Instead, after node 2 fails, new backoff is still applied (12mn in my case) and parent node fails later than expected.

image

What version of Argo Workflows are you running?
v3.2.4

Diagnostics

I used this ClusterWorkflowTemplate (truncated) to emulate node durations on a failing terraform apply step. To be close to my use case, it takes longer on the first run (21mn) that on retries (5mn).

---
apiVersion: argoproj.io/v1alpha1
kind: ClusterWorkflowTemplate
metadata:
  name: bootstrap-sandbox-infrastructure
spec:
  entrypoint: main
  volumeClaimTemplates:
    - metadata:
        name: workdir
      spec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 1Gi
  securityContext:
    fsGroup: 1001
  templates:
    - name: terraform-apply-infrastructure
      retryStrategy:
        limit: 2
        backoff:
          duration: 3m
          factor: 2
          maxDuration: 60m
      container:
        image: hashicorp/terraform:v1.0.9
        imagePullPolicy: IfNotPresent
        workingDir: /sources/k8s-apps-descriptors/terraform
        command:
          - sh
        args:
          - -c
          - |
            set -eo pipefail
            mkdir -p /sources/tmp

            if [[ -f /sources/tmp/duration ]]; then
              duration="$(cat /sources/tmp/duration)"
            else
              duration="21m"
            fi
            sleep "${duration}"

            echo "5m" > /sources/tmp/duration
            exit 1
        securityContext:
          allowPrivilegeEscalation: false
          privileged: false
          readOnlyRootFilesystem: false  # terraform needs /tmp
          runAsNonRoot: false
          runAsUser: 0
        volumeMounts:
          - name: workdir
            mountPath: /sources

What Kubernetes provider are you using?
GKE v1.20.11-gke.1300

What executor are you running? Docker/K8SAPI/Kubelet/PNS/Emissary
Docker

# Logs from the workflow controller:

time="2022-01-19T15:25:24.189Z" level=info msg="node bootstrap-sandbox-infrastrucuture-sbx-qo-pr-4337-d29n2-3538259764 message: Backoff for 3 minutes 0 seconds" namespace=apps-sbx-qo-pr-4337 workflow=bootstrap-sandbox-infrastrucuture-sbx-qo-pr-4337-d29n2
time="2022-01-19T15:33:56.542Z" level=info msg="node bootstrap-sandbox-infrastrucuture-sbx-qo-pr-4337-d29n2-3538259764 message: Backoff for 6 minutes 0 seconds" namespace=apps-sbx-qo-pr-4337 workflow=bootstrap-sandbox-infrastrucuture-sbx-qo-pr-4337-d29n2
time="2022-01-19T15:45:28.966Z" level=info msg="node bootstrap-sandbox-infrastrucuture-sbx-qo-pr-4337-d29n2-3538259764 message: Backoff for 12 minutes 0 seconds" namespace=apps-sbx-qo-pr-4337 workflow=bootstrap-sandbox-infrastrucuture-sbx-qo-pr-4337-d29n2
[...]
time="2022-01-19T15:57:39.032Z" level=info msg="node has maxDuration set, setting executionDeadline to: Wed Jan 19 16:04:02 +0000 (6 minutes from now)" namespace=apps-sbx-qo-pr-4337 node="bootstrap-sandbox-infrastrucuture-sbx-qo-pr-4337-d29n2[0].main.terraform-apply-infrastructure" workflow=bootstrap-sandbox-infrastrucuture-sbx-qo-pr-4337-d29n2
time="2022-01-19T15:57:39.032Z" level=info msg="node bootstrap-sandbox-infrastrucuture-sbx-qo-pr-4337-d29n2-3538259764 message: " namespace=apps-sbx-qo-pr-4337 workflow=bootstrap-sandbox-infrastrucuture-sbx-qo-pr-4337-d29n2
time="2022-01-19T15:57:39.032Z" level=info msg="No more retries left. Failing..." namespace=apps-sbx-qo-pr-4337 workflow=bootstrap-sandbox-infrastrucuture-sbx-qo-pr-4337-d29n2
time="2022-01-19T15:57:39.032Z" level=info msg="node bootstrap-sandbox-infrastrucuture-sbx-qo-pr-4337-d29n2-3538259764 phase Running -> Failed" namespace=apps-sbx-qo-pr-4337 workflow=bootstrap-sandbox-infrastrucuture-sbx-qo-pr-4337-d29n2
time="2022-01-19T15:57:39.032Z" level=info msg="node bootstrap-sandbox-infrastrucuture-sbx-qo-pr-4337-d29n2-3538259764 message: No more retries left" namespace=apps-sbx-qo-pr-4337 workflow=bootstrap-sandbox-infrastrucuture-sbx-qo-pr-4337-d29n2
time="2022-01-19T15:57:39.032Z" level=info msg="node bootstrap-sandbox-infrastrucuture-sbx-qo-pr-4337-d29n2-3538259764 finished: 2022-01-19 15:57:39.032781319 +0000 UTC" namespace=apps-sbx-qo-pr-4337 workflow=bootstrap-sandbox-infrastrucuture-sbx-qo-pr-4337-d29n2

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

@dimitri-fert dimitri-fert changed the title Backoff is applied even when retryStrategy.backoff.limit has been reached Backoff is applied even when retryStrategy.backoff.limit has been reached Jan 19, 2022
@dimitri-fert dimitri-fert changed the title Backoff is applied even when retryStrategy.backoff.limit has been reached Backoff is applied even when retryStrategy.limit has been reached Jan 19, 2022
@sarabala1979
Copy link
Member

@dimitri-fert are you submitting clustertemplate directly or referring templateRef in workflow?

@dimitri-fert
Copy link
Author

@sarabala1979 I'm referring templateRef in workflow

@alexec
Copy link
Contributor

alexec commented Feb 4, 2022

It seems to me that the limit is 1 off. Can you confirm?

@alexec alexec added good first issue Good for newcomers area/controller Controller issues, panics labels Feb 4, 2022
@sarabala1979 sarabala1979 removed their assignment Feb 7, 2022
@dimitri-fert
Copy link
Author

Sure, will try on a simpler workflow and see how it ends-up. Will share the results here.

@dimitri-fert
Copy link
Author

dimitri-fert commented Feb 14, 2022

Performed a 2nd test with a simple exit 1.
I confirm the diagnosis. With limit: 2, back-off is still applied when the limit has been reached.

Here's my histogram :

  • 1st try --> Fails
  • back-off for 2m
  • 1st retry --> Fails
  • back-off for 2*2m=4m
  • 2nd retry --> Fails (limit reached)
  • back-off for 2*4m=8m

I'm not expecting to wait the extra 8m here as there is no more retry left.

Here's my test workflow and the associated logs from the workflow controller

---
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: test-backoff
  namespace: argo-events
spec:
  securityContext:
    fsGroup: 1001
  serviceAccountName: workflow-executor
  entrypoint: main
  templates:
    - name: main
      retryStrategy:
        limit: 2
        backoff:
          duration: 2m
          factor: 2
          maxDuration: 60m
      container:
        image: hashicorp/terraform:1.0.9
        imagePullPolicy: IfNotPresent
        command:
          - sh
        args:
          - -c
          - |
            exit 1

associated (truncated) logs

time="2022-02-14T10:27:59.885Z" level=info msg="node test-backoff message: Backoff for 2 minutes 0 seconds" namespace=argo-events workflow=test-backoff
time="2022-02-14T10:31:03.736Z" level=info msg="node test-backoff message: Backoff for 4 minutes 0 seconds" namespace=argo-events workflow=test-backoff
time="2022-02-14T10:35:33.813Z" level=info msg="node test-backoff message: Backoff for 8 minutes 0 seconds" namespace=argo-events workflow=test-backoff
time="2022-02-14T10:43:43.830Z" level=info msg="No more retries left. Failing..." namespace=argo-events workflow=test-backoff

@dimitri-fert
Copy link
Author

dimitri-fert commented Feb 14, 2022

Not familiar with the code yet but regarding this function, it seems to me that we first compute and apply the backoff duration and then check if the limit reach after this duration.

I guess we could avoid requeing the wof if the limit has already been reached (see here). What would you think ?

@alexec
Copy link
Contributor

alexec commented Feb 14, 2022

I'm not super familiar with the code, but that looks to be the right area. What I cannot see in that code is where limit is actually enforced. Do we need to look wider?

@dimitri-fert
Copy link
Author

dimitri-fert commented Feb 15, 2022

The limit is enforced at the very end of the function :

if retryStrategy.Limit != nil && limit != nil && int32(len(node.Children)) > *limit {

@alexec
Copy link
Contributor

alexec commented Feb 15, 2022

Should it be a case of changing ">" to ">="?

@dimitri-fert
Copy link
Author

I don't think so no. If we use >= here we'll skip a retry. We only want to skip a backoff, so we should instead change the condition of the requeue (i.e. don't requeue if the limit has been reached)

@alexec
Copy link
Contributor

alexec commented Feb 16, 2022

Do you mean woc.requeue()? If so, that function can be called for many reasons, so I don't think fix can be there.

@dimitri-fert
Copy link
Author

I'm not asking to patch the function woc.requeue() itself no, but to change the if statement we use for calling woc.requeueAfter() inside processNodeRetries()

It could translate from this (source)

		if time.Now().Before(waitingDeadline) {
			woc.requeueAfter(timeToWait)
			retryMessage := fmt.Sprintf("Backoff for %s", humanize.Duration(timeToWait))
			return woc.markNodePhase(node.Name, node.Phase, retryMessage), false, nil
		}

to something like this (if it makes any sense)

		if time.Now().Before(waitingDeadline) && int32(len(node.Children)) <= *limit {
			woc.requeueAfter(timeToWait)
			retryMessage := fmt.Sprintf("Backoff for %s", humanize.Duration(timeToWait))
			return woc.markNodePhase(node.Name, node.Phase, retryMessage), false, nil
		}

@alexec
Copy link
Contributor

alexec commented Feb 16, 2022

sure, do you want to submit PR?

@stale
Copy link

stale bot commented Mar 2, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the problem/stale This has not had a response in some time label Mar 2, 2022
rohankmr414 added a commit to rohankmr414/argo-workflows that referenced this issue Mar 7, 2022
@alexec alexec removed the v3.2 label Mar 9, 2022
@stale stale bot removed the problem/stale This has not had a response in some time label Mar 9, 2022
rohankmr414 added a commit to rohankmr414/argo-workflows that referenced this issue Apr 4, 2022
rohankmr414 added a commit to rohankmr414/argo-workflows that referenced this issue Apr 5, 2022
rohankmr414 added a commit to rohankmr414/argo-workflows that referenced this issue Apr 7, 2022
rohankmr414 added a commit to rohankmr414/argo-workflows that referenced this issue Apr 12, 2022
@sarabala1979 sarabala1979 mentioned this issue Apr 14, 2022
85 tasks
@alexec alexec mentioned this issue May 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics good first issue Good for newcomers type/bug
Projects
None yet
3 participants