[Core feature] Include Intratask Checkpoints in Recovery
Mode
#2254
Labels
enhancement
New feature or request
stale
untriaged
This issues has not yet been looked at by the Maintainers
Motivation: Why do you think this is important?
Intratrask-checkpointing is a very useful feature for fast recovery if long running tasks such as model training runs fail.
However, intratrask-checkpoints only work within the same execution, i.e. when a pod fails and is retried. If the complete execution fails, and we rerun it in
Recovery mode
, intratask checkpoints are not used by the new execution.Goal: What should the final outcome look like, ideally?
If a task fails, and we run it again in
Recovery
mode, flyte should look for checkpoints in the original execution, and make them available to the recovered execution.In addition, we should also provide checkpoints as task meta output in order to let the user resume from a previously failed execution even from a different workflow version. This is mostly to support fast iteration during development, i.e. when the execution has failed due to a bug in user code. The user could then optionally provide the checkpoint path as task input to speed up a new training run.
By providing the checkpoint url via API during execution we could also better support use-cases like experiment tracking. I.e. we could save the
TensorBoard
run directory on each checkpoint, and then serve it directly from the bucket location of the checkpoint during training.Describe alternatives you've considered
An explicit
Recover from
to recover from a different version, but this is considered too error prone due to versioning issues.Propose: Link/Inline OR Additional context
No response
Are you sure this issue hasn't been raised already?
Have you read the Code of Conduct?
The text was updated successfully, but these errors were encountered: