-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTTP Template Manual Retry Hangs After First Retry Failure #11889
Comments
Tested with multiple scenarios including: OnExit - workflow level All produce the same result. |
Issue linked to this PR - #11839 |
I think the root cause is the We can see following logs when retrying: time="2023-10-01T16:09:59.384Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=test-failure-2-l5f9g
time="2023-10-01T16:09:59.385Z" level=warning msg="[SPECIAL][DEBUG] returning but assumed validity before" namespace=argo workflow=test-failure-2-l5f9g
time="2023-10-01T16:09:59.385Z" level=error msg="[DEBUG] Was unable to obtain node for test-failure-2-l5f9g-2934565976" namespace=argo workflow=test-failure-2-l5f9g If we delete |
Are you interested in submitting a PR? ( I'm asking you because you've checked 3rd box :) ) |
I'm unable to build - #11936 I attempted from dev conatiners and make start, mac and windows. |
Updated the above issue with errors. |
I'll have to try building on my computer at home. |
Is it possible to access the UI in GitHub Codespaces? This is what I see after running ■ port-forwa running [9000] Handling connection for 9000
■ controller running [9090] time="2023-10-30T15:30:47.272Z" level=debug msg="Update leases 200"
■ server running [2746] time="2023-10-30T15:27:11.585Z" level=info msg="Alloc=11589 TotalAlloc=19750 Sys=23141 NumGC=6 Goroutines=105"
■ ui running [8080] webpack 5.89.0 compiled with 42 warnings in 24484 ms
v0.1.14 8m44s logs in logs [1..4+Enter] enable logging at ERROR..DEBUG [0+Enter] disable logging |
@toyamagu-2021 I'm finally able to build using GitHub Codespaces. When deleting woc.deleteTaskSet(ctx) I get the following error: time="2023-10-30T18:05:40.035Z" level=debug msg="ignore signal child exited" argo=true
time="2023-10-30T18:05:41.008Z" level=info msg="sub-process exited" argo=true error="<nil>" I've also tried replacing |
More detailed logs from controller: time="2023-10-30T18:49:10.376Z" level=info msg="Processing workflow" namespace=argo workflow=nss-test-failure-test-sgpnj
time="2023-10-30T18:49:10.376Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=0 workflow=nss-test-failure-test-sgpnj
time="2023-10-30T18:49:10.376Z" level=debug msg="Evaluating node nss-test-failure-test-sgpnj: template: *v1alpha1.WorkflowStep (start-test-fail), boundaryID: " namespace=argo workflow=nss-test-failure-test-sgpnj
time="2023-10-30T18:49:10.376Z" level=debug msg="Resolving the template" base="*v1alpha1.Workflow (namespace=,name=)" tmpl="*v1alpha1.WorkflowStep (start-test-fail)"
time="2023-10-30T18:49:10.376Z" level=debug msg="Getting the template" base="*v1alpha1.Workflow (namespace=,name=)" tmpl="*v1alpha1.WorkflowStep (start-test-fail)"
time="2023-10-30T18:49:10.376Z" level=debug msg="Getting the template by name: start-test-fail" base="*v1alpha1.Workflow (namespace=,name=)" tmpl="*v1alpha1.WorkflowStep (start-test-fail)"
time="2023-10-30T18:49:10.376Z" level=debug msg="Executing node nss-test-failure-test-sgpnj of Steps is Running" namespace=argo workflow=nss-test-failure-test-sgpnj
time="2023-10-30T18:49:10.376Z" level=debug msg="Step group node &NodeStatus{ID:nss-test-failure-test-sgpnj-3770584575,Name:nss-test-failure-test-sgpnj[0],DisplayName:[0],Type:StepGroup,TemplateName:,TemplateRef:nil,Phase:Succeeded,BoundaryID:nss-test-failure-test-sgpnj,Message:,StartedAt:2023-10-30 18:48:18 +0000 UTC,FinishedAt:2023-10-30 18:48:26 +0000 UTC,PodIP:,Daemoned:nil,Inputs:nil,Outputs:nil,Children:[nss-test-failure-test-sgpnj-311221617],OutboundNodes:[],TemplateScope:local/,ResourcesDuration:ResourcesDuration{cpu: 5s,memory: 3s,},HostNodeName:,MemoizationStatus:nil,EstimatedDuration:7,SynchronizationStatus:nil,Progress:1/4,NodeFlag:&NodeFlag{Hooked:false,Retried:false,},} already marked completed" namespace=argo workflow=nss-test-failure-test-sgpnj
time="2023-10-30T18:49:10.376Z" level=debug msg="Resolving the template" base="*v1alpha1.Workflow (namespace=,name=)" tmpl="*v1alpha1.NodeStatus (node)"
time="2023-10-30T18:49:10.376Z" level=debug msg="Getting the template" base="*v1alpha1.Workflow (namespace=,name=)" tmpl="*v1alpha1.NodeStatus (node)"
time="2023-10-30T18:49:10.376Z" level=debug msg="Getting the template by name: node" base="*v1alpha1.Workflow (namespace=,name=)" tmpl="*v1alpha1.NodeStatus (node)"
time="2023-10-30T18:49:10.376Z" level=info msg="SG Outbound nodes of nss-test-failure-test-sgpnj-311221617 are [nss-test-failure-test-sgpnj-311221617]" namespace=argo workflow=nss-test-failure-test-sgpnj
time="2023-10-30T18:49:10.376Z" level=debug msg="Evaluating node nss-test-failure-test-sgpnj[1].get-stores-status: template: *v1alpha1.WorkflowStep (http-retry), boundaryID: nss-test-failure-test-sgpnj" namespace=argo workflow=nss-test-failure-test-sgpnj
time="2023-10-30T18:49:10.376Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=argo workflow=nss-test-failure-test-sgpnj
time="2023-10-30T18:49:10.376Z" level=debug msg="Resolving the template" base="*v1alpha1.Workflow (namespace=,name=)" tmpl="*v1alpha1.WorkflowStep (http-retry)"
time="2023-10-30T18:49:10.376Z" level=debug msg="Getting the template" base="*v1alpha1.Workflow (namespace=,name=)" tmpl="*v1alpha1.WorkflowStep (http-retry)"
time="2023-10-30T18:49:10.377Z" level=debug msg="Getting the template by name: http-retry" base="*v1alpha1.Workflow (namespace=,name=)" tmpl="*v1alpha1.WorkflowStep (http-retry)"
time="2023-10-30T18:49:10.377Z" level=debug msg="unresolved is allowed " error=unresolved
time="2023-10-30T18:49:10.377Z" level=debug msg="Resolving the template" base="*v1alpha1.Workflow (namespace=,name=)" tmpl="*v1alpha1.NodeStatus (start-test-fail)"
time="2023-10-30T18:49:10.377Z" level=debug msg="Getting the template" base="*v1alpha1.Workflow (namespace=,name=)" tmpl="*v1alpha1.NodeStatus (start-test-fail)"
time="2023-10-30T18:49:10.377Z" level=debug msg="Getting the template by name: start-test-fail" base="*v1alpha1.Workflow (namespace=,name=)" tmpl="*v1alpha1.NodeStatus (start-test-fail)"
time="2023-10-30T18:49:10.377Z" level=debug msg="Inject a retry node for node nss-test-failure-test-sgpnj[1].get-stores-status" namespace=argo workflow=nss-test-failure-test-sgpnj
time="2023-10-30T18:49:10.377Z" level=debug msg="Initializing node nss-test-failure-test-sgpnj[1].get-stores-status: template: *v1alpha1.WorkflowStep (http-retry), boundaryID: nss-test-failure-test-sgpnj" namespace=argo workflow=nss-test-failure-test-sgpnj
time="2023-10-30T18:49:10.377Z" level=info msg="Retry node nss-test-failure-test-sgpnj-1314530950 initialized Running" namespace=argo workflow=nss-test-failure-test-sgpnj
time="2023-10-30T18:49:10.377Z" level=debug msg="Initializing node nss-test-failure-test-sgpnj[1].get-stores-status(0): template: *v1alpha1.WorkflowStep (http-retry), boundaryID: nss-test-failure-test-sgpnj" namespace=argo workflow=nss-test-failure-test-sgpnj
time="2023-10-30T18:49:10.377Z" level=info msg="HTTP node nss-test-failure-test-sgpnj-1520507653 initialized Pending" namespace=argo workflow=nss-test-failure-test-sgpnj
time="2023-10-30T18:49:10.377Z" level=info msg="Workflow step group node nss-test-failure-test-sgpnj-2764074530 not yet completed" namespace=argo workflow=nss-test-failure-test-sgpnj
time="2023-10-30T18:49:10.377Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=nss-test-failure-test-sgpnj
time="2023-10-30T18:49:10.377Z" level=warning msg="[SPECIAL][DEBUG] returning but assumed validity before" namespace=argo workflow=nss-test-failure-test-sgpnj
time="2023-10-30T18:49:10.377Z" level=error msg="[DEBUG] Was unable to obtain node for nss-test-failure-test-sgpnj-1318882035" namespace=argo workflow=nss-test-failure-test-sgpnj
time="2023-10-30T18:49:10.377Z" level=error msg="error in workflowtaskset reconciliation" error="key was not found for nss-test-failure-test-sgpnj-1318882035" namespace=argo workflow=nss-test-failure-test-sgpnj
time="2023-10-30T18:49:10.378Z" level=debug msg="Log changes patch: {\"status\":{\"nodes\":{\"nss-test-failure-test-sgpnj-1314530950\":{\"boundaryID\":\"nss-test-failure-test-sgpnj\",\"children\":[\"nss-test-failure-test-sgpnj-1520507653\"],\"displayName\":\"get-stores-status\",\"estimatedDuration\":12,\"finishedAt\":null,\"id\":\"nss-test-failure-test-sgpnj-1314530950\",\"inputs\":{\"parameters\":[{\"name\":\"url\",\"value\":\"https://google.com\"}]},\"name\":\"nss-test-failure-test-sgpnj[1].get-stores-status\",\"phase\":\"Running\",\"startedAt\":\"2023-10-30T18:49:10Z\",\"templateName\":\"http-retry\",\"templateScope\":\"local/\",\"type\":\"Retry\"},\"nss-test-failure-test-sgpnj-1520507653\":{\"boundaryID\":\"nss-test-failure-test-sgpnj\",\"displayName\":\"get-stores-status(0)\",\"estimatedDuration\":19,\"finishedAt\":null,\"id\":\"nss-test-failure-test-sgpnj-1520507653\",\"inputs\":{\"parameters\":[{\"name\":\"url\",\"value\":\"https://google.com\"}]},\"name\":\"nss-test-failure-test-sgpnj[1].get-stores-status(0)\",\"nodeFlag\":{\"retried\":true},\"phase\":\"Pending\",\"startedAt\":\"2023-10-30T18:49:10Z\",\"templateName\":\"http-retry\",\"templateScope\":\"local/\",\"type\":\"HTTP\"},\"nss-test-failure-test-sgpnj-2764074530\":{\"children\":[\"nss-test-failure-test-sgpnj-1314530950\"]}}}}"
time="2023-10-30T18:49:10.378Z" level=warning msg="Coudn't obtain child for nss-test-failure-test-sgpnj-2732484172, panicking"
time="2023-10-30T18:49:10.378Z" level=info msg="Workflow to be dehydrated" Workflow Size=4782
time="2023-10-30T18:49:10.384Z" level=debug msg="Update workflows 200"
time="2023-10-30T18:49:10.386Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=13385 workflow=nss-test-failure-test-sgpnj
time="2023-10-30T18:49:10.386Z" level=debug msg="Event(v1.ObjectReference{Kind:\"Workflow\", Namespace:\"argo\", Name:\"nss-test-failure-test-sgpnj\", UID:\"9b8ab2ae-511d-48ef-855e-7ffe7611439c\", APIVersion:\"argoproj.io/v1alpha1\", ResourceVersion:\"13385\", FieldPath:\"\"}): type: 'Normal' reason: 'WorkflowNodeRunning' Running node nss-test-failure-test-sgpnj[1].get-stores-status"
time="2023-10-30T18:49:10.397Z" level=debug msg="Patch events 200"
time="2023-10-30T18:49:10.400Z" level=debug msg="Patch workflowtasksets 200"
time="2023-10-30T18:49:11.376Z" level=info msg="Processing workflow" namespace=argo workflow=nss-test-failure-test-sgpnj
time="2023-10-30T18:49:11.377Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=0 workflow=nss-test-failure-test-sgpnj
time="2023-10-30T18:49:11.377Z" level=debug msg="Evaluating node nss-test-failure-test-sgpnj: template: *v1alpha1.WorkflowStep (start-test-fail), boundaryID: " namespace=argo workflow=nss-test-failure-test-sgpnj
time="2023-10-30T18:49:11.377Z" level=debug msg="Resolving the template" base="*v1alpha1.Workflow (namespace=,name=)" tmpl="*v1alpha1.WorkflowStep (start-test-fail)"
time="2023-10-30T18:49:11.377Z" level=debug msg="Getting the template" base="*v1alpha1.Workflow (namespace=,name=)" tmpl="*v1alpha1.WorkflowStep (start-test-fail)"
time="2023-10-30T18:49:11.377Z" level=debug msg="Getting the template by name: start-test-fail" base="*v1alpha1.Workflow (namespace=,name=)" tmpl="*v1alpha1.WorkflowStep (start-test-fail)"
time="2023-10-30T18:49:11.377Z" level=debug msg="Executing node nss-test-failure-test-sgpnj of Steps is Running" namespace=argo workflow=nss-test-failure-test-sgpnj
time="2023-10-30T18:49:11.377Z" level=debug msg="Step group node &NodeStatus{ID:nss-test-failure-test-sgpnj-3770584575,Name:nss-test-failure-test-sgpnj[0],DisplayName:[0],Type:StepGroup,TemplateName:,TemplateRef:nil,Phase:Succeeded,BoundaryID:nss-test-failure-test-sgpnj,Message:,StartedAt:2023-10-30 18:48:18 +0000 UTC,FinishedAt:2023-10-30 18:48:26 +0000 UTC,PodIP:,Daemoned:nil,Inputs:nil,Outputs:nil,Children:[nss-test-failure-test-sgpnj-311221617],OutboundNodes:[],TemplateScope:local/,ResourcesDuration:ResourcesDuration{cpu: 5s,memory: 3s,},HostNodeName:,MemoizationStatus:nil,EstimatedDuration:7,SynchronizationStatus:nil,Progress:1/2,NodeFlag:&NodeFlag{Hooked:false,Retried:false,},} already marked completed" namespace=argo workflow=nss-test-failure-test-sgpnj
time="2023-10-30T18:49:11.377Z" level=debug msg="Resolving the template" base="*v1alpha1.Workflow (namespace=,name=)" tmpl="*v1alpha1.NodeStatus (node)"
time="2023-10-30T18:49:11.377Z" level=debug msg="Getting the template" base="*v1alpha1.Workflow (namespace=,name=)" tmpl="*v1alpha1.NodeStatus (node)"
time="2023-10-30T18:49:11.377Z" level=debug msg="Getting the template by name: node" base="*v1alpha1.Workflow (namespace=,name=)" tmpl="*v1alpha1.NodeStatus (node)"
time="2023-10-30T18:49:11.377Z" level=info msg="SG Outbound nodes of nss-test-failure-test-sgpnj-311221617 are [nss-test-failure-test-sgpnj-311221617]" namespace=argo workflow=nss-test-failure-test-sgpnj
time="2023-10-30T18:49:11.377Z" level=debug msg="Evaluating node nss-test-failure-test-sgpnj[1].get-stores-status: template: *v1alpha1.WorkflowStep (http-retry), boundaryID: nss-test-failure-test-sgpnj" namespace=argo workflow=nss-test-failure-test-sgpnj
time="2023-10-30T18:49:11.377Z" level=debug msg="Resolving the template" base="*v1alpha1.Workflow (namespace=,name=)" tmpl="*v1alpha1.WorkflowStep (http-retry)"
time="2023-10-30T18:49:11.377Z" level=debug msg="Getting the template" base="*v1alpha1.Workflow (namespace=,name=)" tmpl="*v1alpha1.WorkflowStep (http-retry)"
time="2023-10-30T18:49:11.377Z" level=debug msg="Getting the template by name: http-retry" base="*v1alpha1.Workflow (namespace=,name=)" tmpl="*v1alpha1.WorkflowStep (http-retry)"
time="2023-10-30T18:49:11.377Z" level=debug msg="unresolved is allowed " error=unresolved
time="2023-10-30T18:49:11.377Z" level=debug msg="Executing node nss-test-failure-test-sgpnj[1].get-stores-status of Retry is Running" namespace=argo workflow=nss-test-failure-test-sgpnj
time="2023-10-30T18:49:11.377Z" level=info msg="Workflow step group node nss-test-failure-test-sgpnj-2764074530 not yet completed" namespace=argo workflow=nss-test-failure-test-sgpnj
time="2023-10-30T18:49:11.377Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=nss-test-failure-test-sgpnj
time="2023-10-30T18:49:11.377Z" level=warning msg="[SPECIAL][DEBUG] returning but assumed validity before" namespace=argo workflow=nss-test-failure-test-sgpnj
time="2023-10-30T18:49:11.377Z" level=error msg="[DEBUG] Was unable to obtain node for nss-test-failure-test-sgpnj-1318882035" namespace=argo workflow=nss-test-failure-test-sgpnj
time="2023-10-30T18:49:11.377Z" level=error msg="error in workflowtaskset reconciliation" error="key was not found for nss-test-failure-test-sgpnj-1318882035" namespace=argo workflow=nss-test-failure-test-sgpnj
time="2023-10-30T18:49:11.442Z" level=debug msg="Syncing all CronWorkflows"
time="2023-10-30T18:49:12.004Z" level=debug msg="Get leases 200"
time="2023-10-30T18:49:12.007Z" level=debug msg="Update leases 200"
time="2023-10-30T18:49:17.010Z" level=debug msg="Get leases 200"
time="2023-10-30T18:49:17.013Z" level=debug msg="Update leases 200"
time="2023-10-30T18:49:21.443Z" level=debug msg="Syncing all CronWorkflows"
time="2023-10-30T18:49:22.016Z" level=debug msg="Get leases 200"
time="2023-10-30T18:49:22.019Z" level=debug msg="Update leases 200"
time="2023-10-30T18:49:23.244Z" level=info msg="cleaning up pod" action=killContainers key=argo/nss-test-failure-test-sgpnj-node-2560466722/killContainers
time="2023-10-30T18:49:23.249Z" level=info msg="cleaning up pod" action=killContainers key=argo/nss-test-failure-test-sgpnj-node-3806041575/killContainers |
…ixes argoproj#11889 Signed-off-by: Wesley Scholl <[email protected]>
Signed-off-by: Wesley Scholl <[email protected]>
Signed-off-by: Wesley Scholl <[email protected]>
Signed-off-by: Wesley Scholl <[email protected]>
…rgoproj#11889 Signed-off-by: Wesley Scholl <[email protected]>
Signed-off-by: GitHub <[email protected]>
…#11889 Signed-off-by: GitHub <[email protected]>
Signed-off-by: GitHub <[email protected]>
Signed-off-by: GitHub <[email protected]>
Yes, the reason is completed node status in |
This PR #12620 fixes the |
Can you paste a small workflow to reproduce it ? @wesleyscholl |
metadata:
name: test-failure-test
generateName: test-failure-test-
namespace: new-setup
spec:
templates:
- name: start-test-fail
inputs: {}
outputs: {}
metadata: {}
steps:
- - name: get-stores-status
template: http-retry
arguments:
parameters:
- name: url
value: http://httpstat.us/Random/400-404,500-504
- name: http-retry
inputs:
parameters:
- name: url
outputs: {}
metadata: {}
http:
method: GET
url: '{{inputs.parameters.url}}'
timeoutSeconds: 20
successCondition: response.statusCode == 200
retryStrategy:
limit: 3
retryPolicy: Always
backoff:
duration: 10s
factor: 1
maxDuration: 10m
entrypoint: start-test-fail
arguments: {} |
@wesleyscholl Sorry, I can't reproduce it with the workflow you provided. |
@jswxstw @shuangkun @toyamagu-2021 Apologies, I forgot to include the exit handler and we are using v3.5.4. Reproducible (Run the work flow then retry the http template, then it hangs): metadata:
name: test-failure-test
spec:
templates:
- name: start-test-fail
steps:
- - name: get-status
template: http-retry
arguments:
parameters:
- name: url
value: http://httpstat.us/Random/400-404,500-504
- name: http-retry
inputs:
parameters:
- name: url
http:
method: GET
url: '{{inputs.parameters.url}}'
timeoutSeconds: 20
successCondition: response.statusCode == 200
retryStrategy:
limit: 3
retryPolicy: Always
backoff:
duration: 10s
factor: 1
maxDuration: 10m
- name: exit-handler
steps:
- - name: log-workflow-status
template: log-message
- - name: conditional-alert
template: log-message
when: '{{workflow.status}} != "Succeeded"'
- name: log-message
container:
name: ''
image: alpine:latest
command:
- echo
entrypoint: start-test-fail
arguments: {}
onExit: exit-handler |
Please try v3.5.5 or above, this issue has been fixed. |
Upgraded to v3.5.5 and tested. Confirmed that the manual retry works for workflow task sets. 👍🏻 ArgoWorkflowsRetrySuccessful.mov |
Fixed by #12620 based on the above comment. Thanks all! |
Pre-requisites
:latest
What happened/what you expected to happen?
HTTP Template Step Manual Retry Hangs After First Retry Failure
When manually retrying failed argo step workflows, if the step fails once again the workflow hangs. Regardless of the configured retry strategy. Is the global onExit hook causing the issue? We ideally want to be able to retry the complete strategy just in case of failure in the manual retry.
Details
Pod Failure Message
Version
v3.4.5
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
Logs from the workflow controller
Logs from in your workflow's wait container
The text was updated successfully, but these errors were encountered: