-
Notifications
You must be signed in to change notification settings - Fork 669
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core Feature] Introduce a Recovery
mode for executions
#1151
Comments
EngHabu
added
enhancement
New feature or request
untriaged
This issues has not yet been looked at by the Maintainers
labels
Jun 15, 2021
@EngHabu A few questions on recovered nodes:
|
|
kumare3
removed
the
untriaged
This issues has not yet been looked at by the Maintainers
label
Jul 2, 2021
This was referenced Jul 13, 2021
This was referenced Jul 23, 2021
This was referenced Aug 11, 2021
2 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Motivation: Why do you think this is important?
Despite the robustness of the platform, infrastructure failures are inevitable. When infrastructure failures occur, it's crucial to allow in-flight executions, that failed due to such failures, to recover from these failures.
There are a few things to consider:
cacheable
setting to avoid undesired behavior.Goal: What should the final outcome look like, ideally?
Ideally flyte should offer a few knobs to control such behavior:
Rerun
any failed execution in recovery mode.Rerun in recovery
mode.I think the work can be staged to ship/test incremental changes:
flytectl create execution --recoverFromExecutionId=<id of an execution in the same project/domain>
recover
on them.Recover
a failed execution.Open questions:
How to handle Dynamic Tasks/WF?
I don't think we need to do anything specific for these. If the node is not done, the generation will run again and if it happened to generate the same futures spec, it'll follow the same logic above to run...
Execution URN (yes I'm bringing this up again)... when we discussed the
flytectl create execution --recoverFrom=<execution urn>
we got stuck because we don't have a way to serialize the identifier in a human-readable form... should we solve that now? maybe just restrict it to executions within the same project/domain...The text was updated successfully, but these errors were encountered: