-
Notifications
You must be signed in to change notification settings - Fork 394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd: document updated exp run --reset
behavior
#2286
Conversation
core DVC change for this has been merged and is now available in master |
Use `--reset` to roll-back the workspace to `HEAD` and restart the whole | ||
experiment. Alternatively, you can use `--rev` to continue from a specific | ||
(previous) checkpoint. | ||
Use `--reset` to reset (remove) any existing `checkpoint` outputs in your | ||
workspace before running the experiment. Note that `--reset` overrides any |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clearer, here are the steps implied in this section (checkpoints workflow with reset):
- enable checkpoints in the stage ✅
- update code to write signal files ✅
exo run
- saves a bunch of checkpoints ✅ (possibly interrupt)- >
exp run --reset
< - We are here
Why would we want to reset? Again, to restart the experiment, right? (to go back to / try again step 2.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For checkpoint
outs, reset
removes the specified outs entirely (so that the specified intermediate model can be retrained from scratch and not from the HEAD state). It used to roll back to HEAD
, but this was changed (as of the core PR for these docs).
see iterative/dvc#5553 for the product-side discussion on why this change was made
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks for the context.
For checkpoint outs
Is there another use case for --reset
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we move exp run --reset
up in the steps or say it's an alternative to plain exp run
from the get-go?
0, 1, (same as above)
2a. exp run
- runs experiment from HEAD
state
2b. exp run --reset
- runs experiment from scratch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's a new stage with new checkpoint
outputs, both 2a
and 2b
are the same thing here (since there would not be any HEAD dvc.lock
entries for the checkpoint outs).
Is there another use case for --reset?
No, --reset
does not do anything unless a stage has checkpoint
outs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK so the first exp run
of a checkpoint pipeline is the same as exp run --reset
, got it. So yeah this whole section needs a bit of work to include all these key points. Please address my other comments/suggestions when you can and lmk to take this over a some final reorg of ideas (I'll try it in the meantime to confirm I got it 100% now).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more Q @pmrowla. exp run --reset
still deletes existing checkpoints also, right? Meaning the Git commits (I'm guessing the cache is left alone).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The experiment ref for the previously existing checkpoint run is what gets "deleted". In practice, as long as the user's stage is deterministic the old ref is just moved to point to the new run.
Git commits are left alone, git will garbage collect them on its own once they are no longer pointed to by any references.
Cache is also left alone, it should be cleaned up using gc
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted.
workspace before running the experiment. Note that `--reset` overrides any | ||
existing `dvc.lock` entries for `checkpoint` outputs. Alternatively, you can use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--reset
overrides any existingdvc.lock
entries forcheckpoint
outputs
I'm not getting this note, what are we trying to warn about?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If your dvc.lock
file is git-committed and the HEAD version of dvc.lock
has a hash for the specified checkpoint
out, that hash will be ignored on --reset
, and the out will be completely removed before reproducing the pipeline (rather than being checked out to HEAD's hash).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To put this in user terms, if you save a model at each checkpoint so that its weights get loaded as the starting point for the next training epoch, --reset
will drop the model so that training will start without any existing model weights.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I think with an updated explanation as suggested in #2286 (comment) we shouldn't need this warning here (it should be evident). But we can preserve it elsewhere, perhaps in the actual option description (which I haven't gotten to but will soon).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left another round of commends and some specific suggestions ☝️
Also, from iterative/dvc#5586, I'm not seeing this currently reflected/emphaiszed:
- "When --reset is used, all other (non-checkpoint) outs in dvc.lock will be unchanged
Previously, --reset would reset the entire dvc.lock file to HEAD" - "--reset is now mutually exclusive with --rev ... we now explicitly error out"
Should we further clarify on those points?
p.s. I can take this over at some point if needed.
The new behavior would be the expected default for anyone in ML using checkpoints (at least according to Dave+Dmitry), so I don't think it needs any more clarification. Also the old behavior wasn't documented anywhere in the first place, and it was changed early enough in the 2.0 release cycle that it is unlikely that anyone will be running dvc while also intentionally expecting the original behavior. |
"Previously, --reset would reset the entire dvc.lock file to HEAD" is in https://dvc.org/doc/command-reference/exp/run#checkpoints as "Use --reset to roll-back the workspace to HEAD" for example. But it's only been public for a little while indeed. If this is deemed obvi by the ML guys cc @dberenbaum then sure, one less detail 👍 |
@pmrowla I guess at this point I should officially take it over. I'll ping you with more Qs if any, thanks! p.s. if possible in the future pls use branches directly on the upstream for this repo (easier to review and take over) 🙂 |
OK guys do you want to double check the latest text? Thanks |
You can now use `dvc exp run` to begin the experiment. This removes any | ||
`checkpoint` outputs before running the experiment (regardless of whether they | ||
have cached versions). When the process finishes or gets interrupted (e.g. with | ||
Ctrl + `C`), DVC will [apply](/doc/command-reference/exp/apply) the last | ||
checkpoint to the <abbr>workspace</abbr> (overwriting any further changes done | ||
by the stage). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think wording it this way might end up confusing users?
This removes any
checkpoint
outputs before running the experiment (regardless of whether they have cached versions).
checkpoint
outs are only removed before running the experiment if --reset
is specified. In a normal dvc exp run
with no other arguments, we continue the experiment using the most recent version of that output.
If there is no previous run at all and the output has never existed before, then yes dvc exp run
will do the same thing as dvc exp run --reset
, but only in that specific case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this sentence get deleted? There's nothing special about what happens when you begin a checkpoint experiment. It's only different after at least one checkpoint has been generated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about what Peter just mentioned, @dberenbaum? "If there is no previous run at all and the output has never existed before, then yes dvc exp run will do the same thing as dvc exp run --reset" 🙂
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not really a special case though, and even in that case nothing is being removed (since there is no initial data to even remove in the first place)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I think my latest update should make it very clear. It's getting a little long/convoluted but at least it's complete and correct (I think). See 9e1a7e9.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
even in that case nothing is being removed (since there is no initial data to even remove in the first place
What if you just changed regular outputs (previously cached) into checkpoint outputs in HEAD
? (to migrate from regular pipeline to a checkpoint experiment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the updated sentence is:
On this very first use,
any `checkpoint` outputs are deleted before running the experiment (regardless
of whether they have <abbr>cached</abbr> versions).
It's getting a little long/convoluted but at least it's complete and correct (I think).
Agreed on all counts 😄 . I think I get your point now that it's unclear whether an initial checkpoint run treats existing outs as checkpoints or not. Ideally, this is called out as an edge case since it might add confusion for the typical case where no outs exist yet, but 🤷 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the typical case where no outs exist yet
OK if its an edge case we can leave it out. Always good to simplify docs. Removed that sentence for now... Not sure I fully understand all the possible scenarios but will try to QA separately.
@jorgeorpinel one thing that I wasn't sure about, otherwise LGTM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with the comments from @pmrowla and added one about the "overwriting any further changes" language. I already approved, so let me know if you want me to follow up again on anything, or else I'll leave merging to your discretion.
You can now use `dvc exp run` to begin the experiment. This removes any | ||
`checkpoint` outputs before running the experiment (regardless of whether they | ||
have cached versions). When the process finishes or gets interrupted (e.g. with | ||
Ctrl + `C`), DVC will [apply](/doc/command-reference/exp/apply) the last | ||
checkpoint to the <abbr>workspace</abbr> (overwriting any further changes done | ||
by the stage). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this sentence get deleted? There's nothing special about what happens when you begin a checkpoint experiment. It's only different after at least one checkpoint has been generated.
checkpoint to the <abbr>workspace</abbr> (overwriting any further changes done | ||
by the stage). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checkpoint to the <abbr>workspace</abbr> (overwriting any further changes done | |
by the stage). | |
checkpoint to the <abbr>workspace</abbr> (overwriting any further changes done | |
by the stage if the process was interrupted). |
Alternatively:
When the process finishes, DVC will [apply](/doc/command-reference/exp/apply) the last
checkpoint to the <abbr>workspace</abbr>. If the process gets interrupted (e.g. with
Ctrl + `C`), DVC will overwrite any changes done by the stage after the last checkpoint.
I find "overwriting any further changes" confusing unless it's specifically tied to the process being interrupted. If the process isn't interrupted, I think a final checkpoint is automatically generated to save the final outcome, so it's not overwriting further changes (@pmrowla can you confirm if that's correct?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, there's an additional commit generated at the end of the stage (assuming anything changed since the last time the user called make_checkpoint
), and that is what gets applied. Even if the process is interrupted, we don't overwrite anything in the user's workspace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So should (overwriting any further changes done by the stage)
get dropped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's an additional commit generated at the end of the stage
Ah this I didn't realize. OK I rewrote this based on that and Dave's suggestions (thanks) for now.
Even if the process is interrupted, we don't overwrite anything in the user's workspace.
Woah so why did I think that? A bit confused on this, so a final checkpoint that reflects the resulting workspace "is applied" (what is there to apply?) but the last checkpoint for interrupted runs is NOT applied?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's the latest language for reference:
If the process gets
interrupted (e.g. with Ctrl + `C`), DVC will
[apply](/doc/command-reference/exp/apply) the last checkpoint to the
<abbr>workspace</abbr> (overwriting any further changes).
Pending confirmation from @pmrowla, maybe we should remove any reference to apply
or the workspace if nothing gets overwritten? Maybe this sentence isn't needed at all? Do we feel it necessary to call out process interruption to make explicit that this is an expected workflow? If so, maybe something like:
If the process gets
interrupted (e.g. with Ctrl + `C`), DVC will
keep all checkpoints recorded before interruption.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Removed apply
and workspace references for now per Peter's comment above but it would def. be great to confirm!
per should (overwriting any further changes done by the stage) get dropped?
Co-authored-by: Peter Rowlands (변기호) <[email protected]>
I don't know if it's obvious, but it's a sane default and it's only been public for a little while, so I don't think we need to note the change from previous behavior. |
Seems like the text addresses all the comments although I still have some questions but we can manage them separately. Thanks guys, merging! |
❗ Please read the guidelines in the Contributing to the Documentation list if you make any substantial changes to the documentation or JS engine.
🐛 Please make sure to mention
Fix #issue
(if applicable) in the description of the PR. This causes GitHub to close it automatically when the PR is merged.Please choose to allow us to edit your branch when creating the PR.
Thank you for the contribution - we'll try to review it as soon as possible. 🙏
Documents updated
exp run --reset
behavior and clarifies some checkpoints-related issues around--queue/--run-all
.See iterative/dvc#5553 (comment) for behavior/change explanation