Skip to content
This repository has been archived by the owner on Oct 9, 2023. It is now read-only.

Abort subworkflow on subnode failure #468

Merged
merged 7 commits into from
Sep 8, 2022
Merged

Conversation

hamersaw
Copy link
Contributor

@hamersaw hamersaw commented Aug 9, 2022

TL;DR

Currently, if an internal node in a subworkflow fails the parent workflow aborts. However, this abort does not traverse into the subworkflow to abort other nodes that may be running. This means that Flyte Pods may be left orphaned, this is especially important if finalizers are set and Flyte never cleans up the Pods.

Type

  • Bug Fix
  • Feature
  • Plugin

Are all requirements met?

  • Code completed
  • Smoke tested
  • Unit tests added
  • Code documentation added
  • Any pending items have an associated Issue

Complete description

We updated the failure handling in the subworkflow to match the current dynamic task implementation. If an internal node (in the subworkflow) fails we mark the WorkflowNodeState as failing but keep the NodeState as running. Then in the next iteration FlytePropeller processes the failing subworkflow by first aborting it and then processing the failure node (if it ever exists). Errors are propagated up the workflow identically in failing subworkflows and dynamic tasks.

Tracking Issue

fixes flyteorg/flyte#2533
fixes flyteorg/flyte#2574

Follow-up issue

NA

@codecov
Copy link

codecov bot commented Aug 9, 2022

Codecov Report

Merging #468 (0ef221c) into master (aa3e9a6) will decrease coverage by 0.03%.
The diff coverage is 0.00%.

@kumare3
Copy link
Contributor

kumare3 commented Aug 9, 2022

Ohh man this is a very good catch

@honnix
Copy link
Member

honnix commented Aug 9, 2022

This is a very good finding! Unfortunately I do not have sufficient knowledge to do a constructive review, but I'm looking forward to testing it after it is merged.

@ckiosidis
Copy link
Contributor

ckiosidis commented Aug 9, 2022

Hey @hamersaw I confirmed by checking some stuck Pods in our cluster currently.
The executions contain subworkflows.
The pods are indeed orphans, the flyteworkflow k8s resources are not in the cluster and the executions finished with errors.

Signed-off-by: Daniel Rammer <[email protected]>
@honnix
Copy link
Member

honnix commented Sep 1, 2022

Shall we get this going? It would be great to get this fix so we don't need to have a hacky way to clean up terminating pods. Thanks.

@hamersaw hamersaw merged commit 8b8e5a8 into master Sep 8, 2022
@hamersaw hamersaw deleted the bug/subworkflow-abort branch September 8, 2022 13:44
eapolinario pushed a commit to eapolinario/flytepropeller that referenced this pull request Aug 9, 2023
* using 'failing' state to handle subworkflow aborts

Signed-off-by: Daniel Rammer <[email protected]>

* propogating node failure is subworkflow to subworkflow failure message in ui

Signed-off-by: Daniel Rammer <[email protected]>

* working with other failure scenarios

Signed-off-by: Daniel Rammer <[email protected]>

* fixed lint issue

Signed-off-by: Daniel Rammer <[email protected]>

* updated error message to match

Signed-off-by: Daniel Rammer <[email protected]>

Signed-off-by: Daniel Rammer <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Subtasks status is not updated on abort [BUG] Pods stuck on Terminating with finalizer
5 participants