Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Fixed hanging http template manual retry with previous outputs. Fixes #11889 #12122

Closed
wants to merge 9 commits into from

Conversation

wesleyscholl
Copy link

@wesleyscholl wesleyscholl commented Nov 1, 2023

This PR Fixes #11889.

Motivation

Created the initial issue and indicated that I'd be willing to submit a PR to fix this issue.

Changes

As suggested here, Replaced return err with continue to allow TaskSets to be reconciled when manually retrying with previous outputs.

Screenshots

Step Workflow With Exit Hook:

ManualRetryWithPreviousOutputsFix720.mov

Step Workflow Without Exit Hook:
Screenshot 2023-11-01 at 1 43 16 PM

Single Workflow With Exit Hook:
Screenshot 2023-11-01 at 2 12 03 PM

Single Workflow Without Exit Hook:
Screenshot 2023-11-01 at 1 59 22 PM

Multistep Workflow With Exit Hook:
Screenshot 2023-11-01 at 2 05 20 PM

Multistep Workflow Without Exit Hook:
Screenshot 2023-11-01 at 2 10 39 PM

Verification

Tested and verified manual retry does not hang for:

  • Http retry step workflow with exit hook.
  • Http retry step workflow without exit hook.
  • Http retry single workflow with exit hook.
  • Http retry single workflow without exit hook.
  • Http retry multistep workflow with exit hook.
  • Http retry multistep workflow without exit hook.

@terrytangyuan
Copy link
Member

cc @toyamagu-2021 Could you help review this?

@wesleyscholl wesleyscholl marked this pull request as ready for review November 1, 2023 20:40
@wesleyscholl
Copy link
Author

Apologies for multiple commits, I'm having GitHub Codespaces gpg key issues.

@agilgur5 agilgur5 added area/controller Controller issues, panics area/templates/http area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries labels Nov 2, 2023
server/static/files.go Outdated Show resolved Hide resolved
Copy link
Member

@toyamagu-2021 toyamagu-2021 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for submitting PR.
We should delete unnecessary file (likely from codespace?) and improve logging, but this PR fixes #11189.

FYI: root cause is described in original issue

workflow/controller/taskset.go Outdated Show resolved Hide resolved
@wesleyscholl
Copy link
Author

wesleyscholl commented Nov 6, 2023

Thank you for submitting PR. We should delete unnecessary file (likely from codespace?) and improve logging, but this PR fixes #11189.

FYI: root cause is described in original issue

Removed the modifications from files.go, unintentionally created by GitHub Codespaces when running make pre-commit -B.

Output from the CLI:

webpack 5.89.0 compiled with 17 warnings in 82357 ms
Done in 85.07s.
# Pack UI into a Go file
/home/vscode/go/bin/staticfiles -o server/static/files.go ui/dist/app

@agilgur5
Copy link
Contributor

agilgur5 commented Nov 7, 2023

Removed the modifications from files.go, unintentionally created by GitHub Codespaces when running make pre-commit -B.

Huh we do have an exact version specified in the Makefile, odd that it produced different output.
staticfiles is also deprecated in favor of native embed functionality in later versions of Go (#11654), so I also don't think it's had updates either. Strange 🤔

woc.log.Warnf("[SPECIAL][DEBUG] returning but assumed validity before")
woc.log.Errorf("[DEBUG] Was unable to obtain node for %s", nodeID)
return err
woc.log.Info("Continuing to retry")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should definitely be more specific here -- if you read this message in the Controller's logs, it would be pretty ambiguous.

Suggested change
woc.log.Info("Continuing to retry")
woc.log.Debugf("Unable to obtain node for %s during TaskSet Reconciliation. Expected during a retry, continuing", nodeId)

I also think debug level makes more sense here, since this is now an expected error

@wesleyscholl
Copy link
Author

After further testing, I discovered that the outputs from the failed pods were saved and not cleared.

As mentioned here,TaskSets needs to be deleted before retrying. - #11889 (comment)

woc.patchTaskSet(ctx, woc.getDeleteTaskAndNodePatch(), types.MergePatchType)

Adding this line at the beginning of reconcileTaskSet deletes TaskSets and clears previous outputs from failed pods. Then the workflow retries as expected on manual retry.

ArgoWorkflowsManualRetryClearedOutputsFixed.mov

@wesleyscholl
Copy link
Author

wesleyscholl commented Jun 28, 2024

This issue has been fixed in >= v3.5.5.

ArgoMultiStepFailRetrySucccess720.mov

@agilgur5 agilgur5 added the solution/superseded This PR or issue has been superseded by another one (slightly different from a duplicate) label Jun 28, 2024
@agilgur5
Copy link
Contributor

agilgur5 commented Jun 28, 2024

Superseded by #12620 based on this issue comment. Thanks all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries area/templates/http solution/superseded This PR or issue has been superseded by another one (slightly different from a duplicate)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

HTTP Template Manual Retry Hangs After First Retry Failure
4 participants