Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core feature] Include Intratask Checkpoints in Recovery Mode #2254

Closed
2 tasks done
sbrunk opened this issue Mar 14, 2022 · 4 comments
Closed
2 tasks done

[Core feature] Include Intratask Checkpoints in Recovery Mode #2254

sbrunk opened this issue Mar 14, 2022 · 4 comments
Assignees
Labels
enhancement New feature or request stale untriaged This issues has not yet been looked at by the Maintainers

Comments

@sbrunk
Copy link
Member

sbrunk commented Mar 14, 2022

Motivation: Why do you think this is important?

Intratrask-checkpointing is a very useful feature for fast recovery if long running tasks such as model training runs fail.

However, intratrask-checkpoints only work within the same execution, i.e. when a pod fails and is retried. If the complete execution fails, and we rerun it in Recovery mode, intratask checkpoints are not used by the new execution.

Goal: What should the final outcome look like, ideally?

If a task fails, and we run it again in Recovery mode, flyte should look for checkpoints in the original execution, and make them available to the recovered execution.

In addition, we should also provide checkpoints as task meta output in order to let the user resume from a previously failed execution even from a different workflow version. This is mostly to support fast iteration during development, i.e. when the execution has failed due to a bug in user code. The user could then optionally provide the checkpoint path as task input to speed up a new training run.

By providing the checkpoint url via API during execution we could also better support use-cases like experiment tracking. I.e. we could save the TensorBoard run directory on each checkpoint, and then serve it directly from the bucket location of the checkpoint during training.

Describe alternatives you've considered

An explicit Recover from to recover from a different version, but this is considered too error prone due to versioning issues.

Propose: Link/Inline OR Additional context

No response

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
@sbrunk sbrunk added enhancement New feature or request untriaged This issues has not yet been looked at by the Maintainers labels Mar 14, 2022
@andrewwdye andrewwdye self-assigned this Sep 19, 2022
@andrewwdye
Copy link
Contributor

I'm tackling the first part of this feature request -- passing the previous checkpoint to a recovered execution (like we do for retried tasks). I intend to include the path in the node execution closure and store in the admin DB. This will require changes in both propeller and admin.

I will defer treating checkpoints as meta outputs and optionally passing them as meta inputs to a separate workflow for future work, as this has much larger implications.

@github-actions
Copy link

Hello 👋, This issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will close the issue if we detect no activity in the next 7 days. Thank you for your contribution and understanding! 🙏

@github-actions github-actions bot added the stale label Aug 28, 2023
@github-actions
Copy link

github-actions bot commented Sep 5, 2023

Hello 👋, This issue has been inactive for over 9 months and hasn't received any updates since it was marked as stale. We'll be closing this issue for now, but if you believe this issue is still relevant, please feel free to reopen it. Thank you for your contribution and understanding! 🙏

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 5, 2023
@eapolinario eapolinario reopened this Nov 2, 2023
@hamersaw
Copy link
Contributor

hamersaw commented Nov 9, 2023

Using intra-task checkpointing across workflow executions could be very dangerous.

@hamersaw hamersaw closed this as completed Nov 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request stale untriaged This issues has not yet been looked at by the Maintainers
Projects
None yet
Development

No branches or pull requests

4 participants