Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

integrations: Load model_file for resuming #140

Closed
2 of 10 tasks
daavoo opened this issue Aug 24, 2021 · 7 comments
Closed
2 of 10 tasks

integrations: Load model_file for resuming #140

daavoo opened this issue Aug 24, 2021 · 7 comments
Assignees
Labels
A: docs Area: user documentation A: frameworks Area: ML Framework integration feature request

Comments

@daavoo
Copy link
Contributor

daavoo commented Aug 24, 2021

In order to properly resume training with dvc checkpoints, the user needs to load the existing model_file at the beginning of training.

Given that DVCLive integrations already take care of saving the model_file I think it makes sense to also include some logic to load the model_file, if it already exists, on the callback instantiation or on_train_begin.

This would simplify the usage of dvc checkpoints for resuming training.

@daavoo daavoo added feature request discussion requires active participation to reach a conclusion A: frameworks Area: ML Framework integration labels Aug 24, 2021
@daavoo daavoo changed the title integrations: Load model_file on init for resuming integrations: Load model_file for resuming Aug 24, 2021
@daavoo daavoo added the p1-important Include in the next sprint label Aug 24, 2021
@daavoo daavoo self-assigned this Aug 24, 2021
@daavoo
Copy link
Contributor Author

daavoo commented Aug 26, 2021

It looks that for some ML Frameworks which have some "resuming" argument this should be a Documentation update

@daavoo daavoo added A: docs Area: user documentation and removed discussion requires active participation to reach a conclusion labels Aug 26, 2021
@daavoo daavoo added p2-medium and removed p1-important Include in the next sprint labels Sep 10, 2021
@daavoo
Copy link
Contributor Author

daavoo commented Sep 10, 2021

Lowering the priority as this depends on deciding and documenting the recomended workflow for DVC checkpoints

@daavoo daavoo added p1-important Include in the next sprint and removed p2-medium labels Oct 6, 2021
@daavoo daavoo mentioned this issue Oct 7, 2021
1 task
@daavoo
Copy link
Contributor Author

daavoo commented Oct 7, 2021

Back to p1 as it is relevant for simplifying the "recovering with DVC_EXP_AUTO_PUSH" scenario

@daavoo
Copy link
Contributor Author

daavoo commented Apr 26, 2022

Relevant for https://github.com/iterative/terraform-provider-iterative use case

@casperdcl
Copy link

also related: iterative/example-repos-dev#83 (comment)

@daavoo
Copy link
Contributor Author

daavoo commented Oct 17, 2022

Inside each integration, we should look for the ML-Framework specific flags and handle Live's argument resume=True properly

@daavoo
Copy link
Contributor Author

daavoo commented May 2, 2023

Not planned.
We will drop model_file #499 .
Resuming should be discussed/handled in #505

@daavoo daavoo closed this as not planned Won't fix, can't repro, duplicate, stale May 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: docs Area: user documentation A: frameworks Area: ML Framework integration feature request
Projects
None yet
Development

No branches or pull requests

2 participants