Implement workflow execution recovery #290

katrogan · 2021-07-13T21:51:18Z

TL;DR

Implement workflow execution recovery. For ordinary task nodes this modifies pre-execute to attempt to recover node executions when the overall workflow execution is run in recovery mode, helping elide needless re-computation for previously succeeded node executions.

For workflow nodes (that is, those that call out to create an execution from a launch plan) this sets the triggered executions to recover from the original node execution which created an execution using the same launch plan.

For dynamic nodes (that is those that contain a subworkflow) the behavior remains the same as task nodes. If the original dynamic node succeeded, great, we'll recover it. Otherwise the dynamic workflow runs again.

Type

Bug Fix
Feature
Plugin

Are all requirements met?

Complete description

How did you fix the bug, make the feature etc. Link to any design docs etc

Tracking Issue

flyteorg/flyte#1151

Follow-up issue

NA

Signed-off-by: Katrina Rogan <[email protected]>

kumare3 · 2021-07-13T21:57:38Z

pkg/apis/flyteworkflow/v1alpha1/iface.go

@@ -92,6 +94,10 @@ func (p NodePhase) String() string {
 		return "RetryableFailure"
 	case NodePhaseDynamicRunning:
 		return "DynamicRunning"
+	case NodePhaseRecovering:


you have to do op_code_generate

Signed-off-by: Katrina Rogan <[email protected]>

codecov · 2021-07-15T21:50:32Z

Codecov Report

Merging #290 (95fcbf8) into master (9d0d964) will decrease coverage by 0.27%.
The diff coverage is 44.07%.

Signed-off-by: Katrina Rogan <[email protected]>

kumare3 · 2021-07-16T23:51:33Z

pkg/controller/nodes/subworkflow/launchplan/admin_test.go

@@ -193,6 +195,42 @@ func TestAdminLaunchPlanExecutor_Launch(t *testing.T) {
 		assert.NoError(t, err)
 	})

+	t.Run("happy recover", func(t *testing.T) {


what if recover fails?

updated to call create execution when applicable

kumare3 · 2021-07-16T23:54:25Z

pkg/controller/nodes/executor.go

+	} else {
+		logger.Debugf(ctx, "No outputs found for recovered node [%+v]", nCtx.NodeExecutionMetadata().GetNodeExecutionID())
+	}
+	outputFile := v1alpha1.GetOutputsFile(nCtx.NodeStatus().GetOutputDir())


ohh no are we still using this?

@kumare3 what should i be using instead?

i think we use this now - https://github.com/flyteorg/flyteplugins/blob/60b94c688ef2b98aa53a9224b529ac672af04540/go/tasks/pluginmachinery/ioutils/paths.go

@kumare3 this doesn't include output file paths though. should i update it?

kumare3 · 2021-07-16T23:57:21Z

pkg/controller/nodes/executor.go

@@ -429,14 +554,19 @@ func (c *nodeExecutor) handleQueuedOrRunningNode(ctx context.Context, nCtx *node
 		np = v1alpha1.NodePhaseSucceeded
 		finalStatus = executors.NodeStatusSuccess
 	}
+	if np == v1alpha1.NodePhaseRecovering && !h.FinalizeRequired() {


if it is recovering we do not even care about the handler or finalization right?

do we really need Recovering?

kumare3 · 2021-07-16T23:59:43Z

Its amazing how clean this change is. It really fits in well into the premise?

EngHabu · 2021-07-17T15:35:24Z

pkg/controller/controller.go

+	// The admin client might not be initialized if EnableAdminLauncher is set to False.
+	if adminClient == nil {
+		var err error
+		adminClient, err = getAdminClient(ctx)


If we now always want an admin client, let's remove the condition and just always initialize it and use it up there if needed and down here all the time...

EngHabu · 2021-07-17T15:39:41Z

pkg/controller/nodes/subworkflow/launchplan.go

@@ -17,7 +21,8 @@ import (
 )

 type launchPlanHandler struct {
-	launchPlan launchplan.Executor
+	launchPlan     launchplan.Executor
+	recoveryClient recovery.Client


I'm a bit confused about why do we need the recovery client at this layer... won't that be handled at the Node executor layer?

Isn't the reason that we want to call the recover execution endpoint instead in case the child workflow node has failed, so that we can recover a partially failed child workflow?

That's what I thought.. but looking at the interface and implementation of recovery client... it retrieves Node executions from admin... I think... I obviously could be missing something since I'm half looking at this and half prepping my bags :-D

yes, we retrieve the node execution - which has target metadata of type workflow node, which we can use to fetch the originally-launched child execution

Signed-off-by: Katrina Rogan <[email protected]>

katrogan · 2021-07-19T20:54:08Z

PTAL @kumare3

Signed-off-by: Katrina Rogan <[email protected]>

katrogan added 3 commits June 25, 2021 14:25

wip

3a5970e

Signed-off-by: Katrina Rogan <[email protected]>

conflicts

c5ce2f7

Signed-off-by: Katrina Rogan <[email protected]>

wip

5c83167

Signed-off-by: Katrina Rogan <[email protected]>

katrogan requested review from EngHabu and kumare3 as code owners July 13, 2021 21:51

kumare3 reviewed Jul 13, 2021

View reviewed changes

katrogan added 4 commits July 14, 2021 10:26

gen

ee5096c

Signed-off-by: Katrina Rogan <[email protected]>

conflicts

2584e9e

Signed-off-by: Katrina Rogan <[email protected]>

gen

37b9389

Signed-off-by: Katrina Rogan <[email protected]>

try, try again

bd81dde

Signed-off-by: Katrina Rogan <[email protected]>

katrogan changed the title ~~wip: node execution recovery~~ Implement workflow execution recovery Jul 15, 2021

lint

2260e06

Signed-off-by: Katrina Rogan <[email protected]>

katrogan added 3 commits July 16, 2021 11:45

proto changes

ff8e156

Signed-off-by: Katrina Rogan <[email protected]>

idl release version

06de8a8

Signed-off-by: Katrina Rogan <[email protected]>

Merge branch 'master' into recovery

d1bea21

kumare3 reviewed Jul 16, 2021

View reviewed changes

EngHabu reviewed Jul 17, 2021

View reviewed changes

katrogan added 4 commits July 19, 2021 12:35

wip

3354c82

Signed-off-by: Katrina Rogan <[email protected]>

mock

a6e3f98

Signed-off-by: Katrina Rogan <[email protected]>

sigh

7f117f4

Signed-off-by: Katrina Rogan <[email protected]>

recovery fails test case

36066f3

Signed-off-by: Katrina Rogan <[email protected]>

fix code

95fcbf8

Signed-off-by: Katrina Rogan <[email protected]>

kumare3 self-requested a review July 20, 2021 21:06

kumare3 approved these changes Jul 20, 2021

View reviewed changes

katrogan merged commit 59069d2 into master Jul 20, 2021

eapolinario pushed a commit to eapolinario/flytepropeller that referenced this pull request Aug 9, 2023

Implement workflow execution recovery (flyteorg#290)

72525a2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement workflow execution recovery #290

Implement workflow execution recovery #290

katrogan commented Jul 13, 2021 •

edited

Loading

kumare3 Jul 13, 2021

codecov bot commented Jul 15, 2021 •

edited

Loading

kumare3 Jul 16, 2021

katrogan Jul 19, 2021

kumare3 Jul 16, 2021

katrogan Jul 19, 2021

kumare3 Jul 20, 2021

katrogan Jul 20, 2021

kumare3 Jul 16, 2021

kumare3 Jul 16, 2021

katrogan Jul 19, 2021

kumare3 commented Jul 16, 2021

EngHabu Jul 17, 2021

katrogan Jul 19, 2021

EngHabu Jul 17, 2021

kumare3 Jul 17, 2021

EngHabu Jul 17, 2021

katrogan Jul 19, 2021

katrogan commented Jul 19, 2021

Implement workflow execution recovery #290

Implement workflow execution recovery #290

Conversation

katrogan commented Jul 13, 2021 • edited Loading

TL;DR

Type

Are all requirements met?

Complete description

Tracking Issue

Follow-up issue

Choose a reason for hiding this comment

codecov bot commented Jul 15, 2021 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kumare3 commented Jul 16, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

katrogan commented Jul 19, 2021

katrogan commented Jul 13, 2021 •

edited

Loading

codecov bot commented Jul 15, 2021 •

edited

Loading