-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature exp run: Dryer resume within the CI #6823
Comments
Posting old message before it gets lost: upshot of auto-pull checkpoints, we need to
EXP_NAME=${BASE}-cml-run-${SHA} # similar convention as cml-pr
if [[ $(dvc exp pull --run-cache origin $EXP_NAME &>/dev/null) ]]; then
echo "# resuming interrupted experiment"
dvc exp apply $EXP_NAME
DVC_EXP_AUTO_PUSH=1
DVC_EXP_GIT_REMOTE=origin dvc exp run ...
else
echo "# first time running experiment"
DVC_EXP_AUTO_PUSH=1
DVC_EXP_GIT_REMOTE=origin dvc exp run -n $EXP_NAME ...
fi |
One minor note: You should be able to use With that in mind, the workflow can be something like:
|
also related: iterative/example-repos-dev#83 (comment) |
Closing since checkpoints have been deprecated. For discussion about resuming experiments, see iterative/dvclive#505. |
Issue
In the CI, to be able to resume training with preexisting checkpoints we have to make something like:
Would be nice if we had:
dvc exp run -n $EXP_NAME
to be able to pull and applySo it would become:
Additional issue
Please note:
This is because dvc exp pull --run-cache origin $EXP_NAME will throw an error in no prev experiments are present
The text was updated successfully, but these errors were encountered: