Skip to content
This repository has been archived by the owner on Oct 9, 2023. It is now read-only.

Implement workflow execution recovery #290

Merged
merged 16 commits into from
Jul 20, 2021
Merged

Implement workflow execution recovery #290

merged 16 commits into from
Jul 20, 2021

Conversation

katrogan
Copy link
Contributor

@katrogan katrogan commented Jul 13, 2021

TL;DR

Implement workflow execution recovery. For ordinary task nodes this modifies pre-execute to attempt to recover node executions when the overall workflow execution is run in recovery mode, helping elide needless re-computation for previously succeeded node executions.

For workflow nodes (that is, those that call out to create an execution from a launch plan) this sets the triggered executions to recover from the original node execution which created an execution using the same launch plan.

For dynamic nodes (that is those that contain a subworkflow) the behavior remains the same as task nodes. If the original dynamic node succeeded, great, we'll recover it. Otherwise the dynamic workflow runs again.

Type

  • Bug Fix
  • Feature
  • Plugin

Are all requirements met?

  • Code completed
  • Smoke tested
  • Unit tests added
  • Code documentation added
  • Any pending items have an associated Issue

Complete description

How did you fix the bug, make the feature etc. Link to any design docs etc

Tracking Issue

flyteorg/flyte#1151

Follow-up issue

NA

katrogan added 3 commits June 25, 2021 14:25
Signed-off-by: Katrina Rogan <[email protected]>
Signed-off-by: Katrina Rogan <[email protected]>
Signed-off-by: Katrina Rogan <[email protected]>
@katrogan katrogan requested review from EngHabu and kumare3 as code owners July 13, 2021 21:51
@@ -92,6 +94,10 @@ func (p NodePhase) String() string {
return "RetryableFailure"
case NodePhaseDynamicRunning:
return "DynamicRunning"
case NodePhaseRecovering:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you have to do op_code_generate

katrogan added 4 commits July 14, 2021 10:26
Signed-off-by: Katrina Rogan <[email protected]>
Signed-off-by: Katrina Rogan <[email protected]>
Signed-off-by: Katrina Rogan <[email protected]>
Signed-off-by: Katrina Rogan <[email protected]>
@katrogan katrogan changed the title wip: node execution recovery Implement workflow execution recovery Jul 15, 2021
Signed-off-by: Katrina Rogan <[email protected]>
@codecov
Copy link

codecov bot commented Jul 15, 2021

Codecov Report

Merging #290 (95fcbf8) into master (9d0d964) will decrease coverage by 0.27%.
The diff coverage is 44.07%.

@@ -193,6 +195,42 @@ func TestAdminLaunchPlanExecutor_Launch(t *testing.T) {
assert.NoError(t, err)
})

t.Run("happy recover", func(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if recover fails?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to call create execution when applicable

} else {
logger.Debugf(ctx, "No outputs found for recovered node [%+v]", nCtx.NodeExecutionMetadata().GetNodeExecutionID())
}
outputFile := v1alpha1.GetOutputsFile(nCtx.NodeStatus().GetOutputDir())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ohh no are we still using this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kumare3 what should i be using instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kumare3 this doesn't include output file paths though. should i update it?

@@ -429,14 +554,19 @@ func (c *nodeExecutor) handleQueuedOrRunningNode(ctx context.Context, nCtx *node
np = v1alpha1.NodePhaseSucceeded
finalStatus = executors.NodeStatusSuccess
}
if np == v1alpha1.NodePhaseRecovering && !h.FinalizeRequired() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it is recovering we do not even care about the handler or finalization right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need Recovering?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@kumare3
Copy link
Contributor

kumare3 commented Jul 16, 2021

Its amazing how clean this change is. It really fits in well into the premise?

// The admin client might not be initialized if EnableAdminLauncher is set to False.
if adminClient == nil {
var err error
adminClient, err = getAdminClient(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we now always want an admin client, let's remove the condition and just always initialize it and use it up there if needed and down here all the time...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -17,7 +21,8 @@ import (
)

type launchPlanHandler struct {
launchPlan launchplan.Executor
launchPlan launchplan.Executor
recoveryClient recovery.Client
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused about why do we need the recovery client at this layer... won't that be handled at the Node executor layer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the reason that we want to call the recover execution endpoint instead in case the child workflow node has failed, so that we can recover a partially failed child workflow?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I thought.. but looking at the interface and implementation of recovery client... it retrieves Node executions from admin... I think... I obviously could be missing something since I'm half looking at this and half prepping my bags :-D

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we retrieve the node execution - which has target metadata of type workflow node, which we can use to fetch the originally-launched child execution

katrogan added 4 commits July 19, 2021 12:35
Signed-off-by: Katrina Rogan <[email protected]>
Signed-off-by: Katrina Rogan <[email protected]>
Signed-off-by: Katrina Rogan <[email protected]>
Signed-off-by: Katrina Rogan <[email protected]>
@katrogan
Copy link
Contributor Author

PTAL @kumare3

Signed-off-by: Katrina Rogan <[email protected]>
@kumare3 kumare3 self-requested a review July 20, 2021 21:06
@katrogan katrogan merged commit 59069d2 into master Jul 20, 2021
eapolinario pushed a commit to eapolinario/flytepropeller that referenced this pull request Aug 9, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants