[Core feature] Include Intratask Checkpoints in `Recovery` Mode #2254

sbrunk · 2022-03-14T15:06:51Z

Motivation: Why do you think this is important?

Intratrask-checkpointing is a very useful feature for fast recovery if long running tasks such as model training runs fail.

However, intratrask-checkpoints only work within the same execution, i.e. when a pod fails and is retried. If the complete execution fails, and we rerun it in Recovery mode, intratask checkpoints are not used by the new execution.

Goal: What should the final outcome look like, ideally?

If a task fails, and we run it again in Recovery mode, flyte should look for checkpoints in the original execution, and make them available to the recovered execution.

In addition, we should also provide checkpoints as task meta output in order to let the user resume from a previously failed execution even from a different workflow version. This is mostly to support fast iteration during development, i.e. when the execution has failed due to a bug in user code. The user could then optionally provide the checkpoint path as task input to speed up a new training run.

By providing the checkpoint url via API during execution we could also better support use-cases like experiment tracking. I.e. we could save the TensorBoard run directory on each checkpoint, and then serve it directly from the bucket location of the checkpoint during training.

Describe alternatives you've considered

An explicit Recover from to recover from a different version, but this is considered too error prone due to versioning issues.

Propose: Link/Inline OR Additional context

No response

Are you sure this issue hasn't been raised already?

Yes

Have you read the Code of Conduct?

Yes

The text was updated successfully, but these errors were encountered:

andrewwdye · 2022-09-19T17:40:49Z

I'm tackling the first part of this feature request -- passing the previous checkpoint to a recovered execution (like we do for retried tasks). I intend to include the path in the node execution closure and store in the admin DB. This will require changes in both propeller and admin.

I will defer treating checkpoints as meta outputs and optionally passing them as meta inputs to a separate workflow for future work, as this has much larger implications.

github-actions · 2023-08-28T00:38:58Z

Hello 👋, This issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will close the issue if we detect no activity in the next 7 days. Thank you for your contribution and understanding! 🙏

github-actions · 2023-09-05T00:36:57Z

Hello 👋, This issue has been inactive for over 9 months and hasn't received any updates since it was marked as stale. We'll be closing this issue for now, but if you believe this issue is still relevant, please feel free to reopen it. Thank you for your contribution and understanding! 🙏

hamersaw · 2023-11-09T13:44:50Z

Using intra-task checkpointing across workflow executions could be very dangerous.

sbrunk added enhancement New feature or request untriaged This issues has not yet been looked at by the Maintainers labels Mar 14, 2022

andrewwdye self-assigned this Sep 19, 2022

This was referenced Sep 19, 2022

Add CheckpointUri to TaskNodeMetadata flyteorg/flyteidl#322

Merged

Allow checkpoint resume when recovering a workflow flyteorg/flytepropeller#486

Merged

Save CheckpointUri in NodeExecution.Closure flyteorg/flyteadmin#479

Merged

github-actions bot added the stale label Aug 28, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 5, 2023

eapolinario reopened this Nov 2, 2023

hamersaw closed this as completed Nov 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core feature] Include Intratask Checkpoints in `Recovery` Mode #2254

[Core feature] Include Intratask Checkpoints in `Recovery` Mode #2254

sbrunk commented Mar 14, 2022 •

edited

Loading

andrewwdye commented Sep 19, 2022

github-actions bot commented Aug 28, 2023

github-actions bot commented Sep 5, 2023

hamersaw commented Nov 9, 2023

[Core feature] Include Intratask Checkpoints in Recovery Mode #2254

[Core feature] Include Intratask Checkpoints in Recovery Mode #2254

Comments

sbrunk commented Mar 14, 2022 • edited Loading

Motivation: Why do you think this is important?

Goal: What should the final outcome look like, ideally?

Describe alternatives you've considered

Propose: Link/Inline OR Additional context

Are you sure this issue hasn't been raised already?

Have you read the Code of Conduct?

andrewwdye commented Sep 19, 2022

github-actions bot commented Aug 28, 2023

github-actions bot commented Sep 5, 2023

hamersaw commented Nov 9, 2023

[Core feature] Include Intratask Checkpoints in `Recovery` Mode #2254

[Core feature] Include Intratask Checkpoints in `Recovery` Mode #2254

sbrunk commented Mar 14, 2022 •

edited

Loading