-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Backoff is applied even when retryStrategy.limit
has been reached
#7588
Comments
retryStrategy.backoff.limit
has been reached
retryStrategy.backoff.limit
has been reachedretryStrategy.limit
has been reached
@dimitri-fert are you submitting clustertemplate directly or referring templateRef in workflow? |
@sarabala1979 I'm referring templateRef in workflow |
It seems to me that the limit is 1 off. Can you confirm? |
Sure, will try on a simpler workflow and see how it ends-up. Will share the results here. |
Performed a 2nd test with a simple Here's my histogram :
I'm not expecting to wait the extra 8m here as there is no more retry left. Here's my test workflow and the associated logs from the workflow controller ---
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
name: test-backoff
namespace: argo-events
spec:
securityContext:
fsGroup: 1001
serviceAccountName: workflow-executor
entrypoint: main
templates:
- name: main
retryStrategy:
limit: 2
backoff:
duration: 2m
factor: 2
maxDuration: 60m
container:
image: hashicorp/terraform:1.0.9
imagePullPolicy: IfNotPresent
command:
- sh
args:
- -c
- |
exit 1 associated (truncated) logs
|
Not familiar with the code yet but regarding this function, it seems to me that we first compute and apply the backoff duration and then check if the limit reach after this duration. I guess we could avoid requeing the wof if the limit has already been reached (see here). What would you think ? |
I'm not super familiar with the code, but that looks to be the right area. What I cannot see in that code is where limit is actually enforced. Do we need to look wider? |
The limit is enforced at the very end of the function :
|
Should it be a case of changing ">" to ">="? |
I don't think so no. If we use |
Do you mean |
I'm not asking to patch the function It could translate from this (source) if time.Now().Before(waitingDeadline) {
woc.requeueAfter(timeToWait)
retryMessage := fmt.Sprintf("Backoff for %s", humanize.Duration(timeToWait))
return woc.markNodePhase(node.Name, node.Phase, retryMessage), false, nil
} to something like this (if it makes any sense) if time.Now().Before(waitingDeadline) && int32(len(node.Children)) <= *limit {
woc.requeueAfter(timeToWait)
retryMessage := fmt.Sprintf("Backoff for %s", humanize.Duration(timeToWait))
return woc.markNodePhase(node.Name, node.Phase, retryMessage), false, nil
} |
sure, do you want to submit PR? |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
…Fixes argoproj#7588 Signed-off-by: Rohan Kumar <[email protected]>
…Fixes argoproj#7588 Signed-off-by: Rohan Kumar <[email protected]>
…Fixes argoproj#7588 Signed-off-by: Rohan Kumar <[email protected]>
…Fixes argoproj#7588 Signed-off-by: Rohan Kumar <[email protected]>
…Fixes argoproj#7588 Signed-off-by: Rohan Kumar <[email protected]>
…7588 (#8090) Signed-off-by: Rohan Kumar <[email protected]>
…7588 (#8090) Signed-off-by: Rohan Kumar <[email protected]>
Summary
What happened/what you expected to happen?
When using
retryStrategy.backoff.limit: 2
and all child nodes (0 ; 1 and 2) already failed, I am expecting the parent node to fail immediatly as no more retries are allowed. Instead, after node 2 fails, new backoff is still applied (12mn in my case) and parent node fails later than expected.What version of Argo Workflows are you running?
v3.2.4
Diagnostics
I used this ClusterWorkflowTemplate (truncated) to emulate node durations on a failing
terraform apply
step. To be close to my use case, it takes longer on the first run (21mn) that on retries (5mn).What Kubernetes provider are you using?
GKE v1.20.11-gke.1300
What executor are you running? Docker/K8SAPI/Kubelet/PNS/Emissary
Docker
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.
The text was updated successfully, but these errors were encountered: