-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remote training recovery from interruptions #505
Comments
See also https://docs.wandb.ai/guides/runs/resuming for ideas/comparison. |
To clarify, you mean that we have all the pieces to implement it, right?
I think we could just:
|
Since this is open let me share how I handled that problem at the moment - as an idea/comparison. Training is done on AWS EC2 instances with the use of Keras/TensorFlow. The backup and restoration of the models is handled by the
Now the final issue: assuming the training is triggered by GitHub action which uses CML to deploy EC2 instance, ... - what is the way to find the correct backup on EFS? Easy: use commit SHA, so for instance, store the backups on EFS inside the And final polish: |
Related:
checkpoints
dvc#9221model_file
for resuming #140dvc
: Enable model loading/saving based oncheckpoint: true
#191If you are training remotely and the machine shuts down, there's often no way to recover the last saved checkpoint on the new remote machine.
We have the tools to make it possible to recover in that scenario (without using DVC checkpoints) if we do something like:
Live(resume=True)
, DVCLive can fetch the model using the info saved in step 1 if there is no model in the workspace.We need some mechanism to tie the resumed experiment to the interrupted experiment. Is the experiment revision consistent between them? Should we require an experiment name be passed to tie them together?
The text was updated successfully, but these errors were encountered: