Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core Feature] Introduce a Recovery mode for executions #1151

Closed
EngHabu opened this issue Jun 15, 2021 · 2 comments
Closed

[Core Feature] Introduce a Recovery mode for executions #1151

EngHabu opened this issue Jun 15, 2021 · 2 comments
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@EngHabu
Copy link
Contributor

EngHabu commented Jun 15, 2021

Motivation: Why do you think this is important?
Despite the robustness of the platform, infrastructure failures are inevitable. When infrastructure failures occur, it's crucial to allow in-flight executions, that failed due to such failures, to recover from these failures.

There are a few things to consider:

  1. Successful nodes should not be run again regardless of their cacheable setting to avoid undesired behavior.
  2. Users should not have to modify their code in order to enter recovery mode.

Goal: What should the final outcome look like, ideally?
Ideally flyte should offer a few knobs to control such behavior:

  1. A control plane command to recover executions that failed due to system errors during a specific time period.
  2. A UI button that allows users to Rerun any failed execution in recovery mode.
  3. An Admin API to allow programmatic trigger of Rerun in recovery mode.

I think the work can be staged to ship/test incremental changes:

  1. Flytepropeller change to add an additional field on the Workflow CRD to indicate that the workflow should recover from an existing execution. The field should have the full identifier of the referenced execution.
  2. During the execution in FlytePropeller and before executing a node, if it's in recovery mode, it should call admin (should be abstracted behind a similar interface to DataCatalog) to check if that node already has its output populated for the same node in a prior execution. If it does, it should send an event to admin to indicate success as it normally would.
  3. If the node does not have outputs yet in Admin, it should start to execute normally.
  4. If the node is a workflow/LP node, it gets a bit interesting, propeller needs to compute/retrieve the execution that was previously launched from that node and propagate the recovery information down.
  5. FlyteAdmin Launch API needs to have a way to indicate that the execution should recover from a particular, prior, terminated execution
  6. We can then add a flytectl command to flytectl create execution --recoverFromExecutionId=<id of an execution in the same project/domain>
  7. We can then script (or maybe add a flytectl command) that, given a time range, retrieves all system-failed executions and invoke recover on them.
  8. Somewhere along the way, we can add a button to flyteconsole to Recover a failed execution.

Open questions:

  1. How to handle Dynamic Tasks/WF?
    I don't think we need to do anything specific for these. If the node is not done, the generation will run again and if it happened to generate the same futures spec, it'll follow the same logic above to run...

  2. Execution URN (yes I'm bringing this up again)... when we discussed the flytectl create execution --recoverFrom=<execution urn> we got stuck because we don't have a way to serialize the identifier in a human-readable form... should we solve that now? maybe just restrict it to executions within the same project/domain...

@EngHabu EngHabu added enhancement New feature or request untriaged This issues has not yet been looked at by the Maintainers labels Jun 15, 2021
@EngHabu EngHabu added this to the 0.15.0 milestone Jun 15, 2021
@katrogan katrogan self-assigned this Jun 15, 2021
@katrogan
Copy link
Contributor

@EngHabu A few questions on recovered nodes:

  1. Is checking for outputs necessarily the correct way to ascertain prior successful completion? What happens with nodes with no outputs but side effects? Should we consider a terminal 'SUCCESS' status instead? (Or is the answer simply that the output uri will never be empty in this scenario)
  2. For previously run nodes in the recovery execution, do we want to copy these outputs (in the db and/or on s3) for visualization/interaction/debugging?

@EngHabu
Copy link
Contributor Author

EngHabu commented Jun 15, 2021

  1. You are right.. I misspoke.. no there should not be any assumption about the existence of output url... etc. so yes, checking if the node has succeeded is sufficient but moreover, if the node has produced outputs, they should be replicated in the current execution as the outputs of this node, right?
  2. Yes, for all intents and purposes, they should look like they ran again...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants