Abort subworkflow on subnode failure #468

hamersaw · 2022-08-09T00:42:42Z

TL;DR

Currently, if an internal node in a subworkflow fails the parent workflow aborts. However, this abort does not traverse into the subworkflow to abort other nodes that may be running. This means that Flyte Pods may be left orphaned, this is especially important if finalizers are set and Flyte never cleans up the Pods.

Type

Bug Fix
Feature
Plugin

Are all requirements met?

Complete description

We updated the failure handling in the subworkflow to match the current dynamic task implementation. If an internal node (in the subworkflow) fails we mark the WorkflowNodeState as failing but keep the NodeState as running. Then in the next iteration FlytePropeller processes the failing subworkflow by first aborting it and then processing the failure node (if it ever exists). Errors are propagated up the workflow identically in failing subworkflows and dynamic tasks.

Tracking Issue

fixes flyteorg/flyte#2533
fixes flyteorg/flyte#2574

Follow-up issue

NA

Signed-off-by: Daniel Rammer <[email protected]>

…e in ui Signed-off-by: Daniel Rammer <[email protected]>

Signed-off-by: Daniel Rammer <[email protected]>

codecov · 2022-08-09T00:51:14Z

Codecov Report

Merging #468 (0ef221c) into master (aa3e9a6) will decrease coverage by 0.03%.
The diff coverage is 0.00%.

kumare3 · 2022-08-09T01:32:16Z

Ohh man this is a very good catch

honnix · 2022-08-09T08:52:29Z

This is a very good finding! Unfortunately I do not have sufficient knowledge to do a constructive review, but I'm looking forward to testing it after it is merged.

ckiosidis · 2022-08-09T09:00:13Z

Hey @hamersaw I confirmed by checking some stuck Pods in our cluster currently.
The executions contain subworkflows.
The pods are indeed orphans, the flyteworkflow k8s resources are not in the cluster and the executions finished with errors.

Signed-off-by: Daniel Rammer <[email protected]>

honnix · 2022-09-01T12:54:10Z

Shall we get this going? It would be great to get this fix so we don't need to have a hacky way to clean up terminating pods. Thanks.

* using 'failing' state to handle subworkflow aborts Signed-off-by: Daniel Rammer <[email protected]> * propogating node failure is subworkflow to subworkflow failure message in ui Signed-off-by: Daniel Rammer <[email protected]> * working with other failure scenarios Signed-off-by: Daniel Rammer <[email protected]> * fixed lint issue Signed-off-by: Daniel Rammer <[email protected]> * updated error message to match Signed-off-by: Daniel Rammer <[email protected]> Signed-off-by: Daniel Rammer <[email protected]>

hamersaw added 3 commits August 8, 2022 10:06

using 'failing' state to handle subworkflow aborts

2528c2a

Signed-off-by: Daniel Rammer <[email protected]>

propogating node failure is subworkflow to subworkflow failure messag…

4eb0853

…e in ui Signed-off-by: Daniel Rammer <[email protected]>

working with other failure scenarios

ecf12ca

Signed-off-by: Daniel Rammer <[email protected]>

hamersaw requested review from kumare3 and EngHabu as code owners August 9, 2022 00:42

hamersaw added 2 commits August 8, 2022 19:43

fixed lint issue

102d875

Signed-off-by: Daniel Rammer <[email protected]>

Merge branch 'master' into bug/subworkflow-abort

7be20bc

Signed-off-by: Daniel Rammer <[email protected]>

hamersaw mentioned this pull request Aug 9, 2022

[BUG] Pods stuck on Terminating with finalizer flyteorg/flyte#2533

Closed

2 tasks

updated error message to match

9f3aae7

Signed-off-by: Daniel Rammer <[email protected]>

Merge branch 'master' into bug/subworkflow-abort

0ef221c

EngHabu approved these changes Sep 7, 2022

View reviewed changes

hamersaw merged commit 8b8e5a8 into master Sep 8, 2022

hamersaw deleted the bug/subworkflow-abort branch September 8, 2022 13:44

dosubot bot mentioned this pull request Jan 16, 2024

[BUG] flytepropeller fails trying to get pod resource using the kubeClient flyteorg/flyte#4730

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Abort subworkflow on subnode failure #468

Abort subworkflow on subnode failure #468

hamersaw commented Aug 9, 2022 •

edited

Loading

codecov bot commented Aug 9, 2022 •

edited

Loading

kumare3 commented Aug 9, 2022

honnix commented Aug 9, 2022

ckiosidis commented Aug 9, 2022 •

edited

Loading

honnix commented Sep 1, 2022

Abort subworkflow on subnode failure #468

Abort subworkflow on subnode failure #468

Conversation

hamersaw commented Aug 9, 2022 • edited Loading

TL;DR

Type

Are all requirements met?

Complete description

Tracking Issue

Follow-up issue

codecov bot commented Aug 9, 2022 • edited Loading

Codecov Report

kumare3 commented Aug 9, 2022

honnix commented Aug 9, 2022

ckiosidis commented Aug 9, 2022 • edited Loading

honnix commented Sep 1, 2022

hamersaw commented Aug 9, 2022 •

edited

Loading

codecov bot commented Aug 9, 2022 •

edited

Loading

ckiosidis commented Aug 9, 2022 •

edited

Loading