[Core Feature] Introduce a `Recovery` mode for executions #1151

EngHabu · 2021-06-15T20:10:11Z

Motivation: Why do you think this is important?
Despite the robustness of the platform, infrastructure failures are inevitable. When infrastructure failures occur, it's crucial to allow in-flight executions, that failed due to such failures, to recover from these failures.

There are a few things to consider:

Successful nodes should not be run again regardless of their cacheable setting to avoid undesired behavior.
Users should not have to modify their code in order to enter recovery mode.

Goal: What should the final outcome look like, ideally?
Ideally flyte should offer a few knobs to control such behavior:

A control plane command to recover executions that failed due to system errors during a specific time period.
A UI button that allows users to Rerun any failed execution in recovery mode.
An Admin API to allow programmatic trigger of Rerun in recovery mode.

I think the work can be staged to ship/test incremental changes:

Flytepropeller change to add an additional field on the Workflow CRD to indicate that the workflow should recover from an existing execution. The field should have the full identifier of the referenced execution.
During the execution in FlytePropeller and before executing a node, if it's in recovery mode, it should call admin (should be abstracted behind a similar interface to DataCatalog) to check if that node already has its output populated for the same node in a prior execution. If it does, it should send an event to admin to indicate success as it normally would.
If the node does not have outputs yet in Admin, it should start to execute normally.
If the node is a workflow/LP node, it gets a bit interesting, propeller needs to compute/retrieve the execution that was previously launched from that node and propagate the recovery information down.
FlyteAdmin Launch API needs to have a way to indicate that the execution should recover from a particular, prior, terminated execution
We can then add a flytectl command to flytectl create execution --recoverFromExecutionId=<id of an execution in the same project/domain>
We can then script (or maybe add a flytectl command) that, given a time range, retrieves all system-failed executions and invoke recover on them.
Somewhere along the way, we can add a button to flyteconsole to Recover a failed execution.

Open questions:

How to handle Dynamic Tasks/WF?
I don't think we need to do anything specific for these. If the node is not done, the generation will run again and if it happened to generate the same futures spec, it'll follow the same logic above to run...
Execution URN (yes I'm bringing this up again)... when we discussed the flytectl create execution --recoverFrom=<execution urn> we got stuck because we don't have a way to serialize the identifier in a human-readable form... should we solve that now? maybe just restrict it to executions within the same project/domain...

The text was updated successfully, but these errors were encountered:

katrogan · 2021-06-15T20:51:04Z

@EngHabu A few questions on recovered nodes:

Is checking for outputs necessarily the correct way to ascertain prior successful completion? What happens with nodes with no outputs but side effects? Should we consider a terminal 'SUCCESS' status instead? (Or is the answer simply that the output uri will never be empty in this scenario)
For previously run nodes in the recovery execution, do we want to copy these outputs (in the db and/or on s3) for visualization/interaction/debugging?

EngHabu · 2021-06-15T20:56:37Z

You are right.. I misspoke.. no there should not be any assumption about the existence of output url... etc. so yes, checking if the node has succeeded is sufficient but moreover, if the node has produced outputs, they should be replicated in the current execution as the outputs of this node, right?
Yes, for all intents and purposes, they should look like they ran again...

EngHabu added enhancement New feature or request untriaged This issues has not yet been looked at by the Maintainers labels Jun 15, 2021

EngHabu added this to the 0.15.0 milestone Jun 15, 2021

katrogan self-assigned this Jun 15, 2021

kumare3 removed the untriaged This issues has not yet been looked at by the Maintainers label Jul 2, 2021

kumare3 modified the milestones: 0.15.0, 0.16.0 Jul 2, 2021

EngHabu modified the milestones: 0.16.0, 0.17.0 Aug 2, 2021

This was referenced Aug 11, 2021

Fetch node interface from static workflow closure flyteorg/flytekit#583

Merged

Fix copy paste bug in recovering node exec inputs flyteorg/flytepropeller#303

Merged

katrogan mentioned this issue Sep 2, 2021

[Core Feature] Implement recovery for flytekit remote #1420

Open

katrogan closed this as completed Sep 2, 2021

sbrunk mentioned this issue Mar 14, 2022

[Core feature] Include Intratask Checkpoints in Recovery Mode #2254

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core Feature] Introduce a `Recovery` mode for executions #1151

[Core Feature] Introduce a `Recovery` mode for executions #1151

EngHabu commented Jun 15, 2021 •

edited

Loading

katrogan commented Jun 15, 2021

EngHabu commented Jun 15, 2021

[Core Feature] Introduce a Recovery mode for executions #1151

[Core Feature] Introduce a Recovery mode for executions #1151

Comments

EngHabu commented Jun 15, 2021 • edited Loading

katrogan commented Jun 15, 2021

EngHabu commented Jun 15, 2021

[Core Feature] Introduce a `Recovery` mode for executions #1151

[Core Feature] Introduce a `Recovery` mode for executions #1151

EngHabu commented Jun 15, 2021 •

edited

Loading