cmd: document updated `exp run --reset` behavior #2286

pmrowla · 2021-03-10T07:14:14Z

You may disregard these recommendations if you used the Edit on GitHub button from dvc.org to improve a doc in place.

❗ Please read the guidelines in the Contributing to the Documentation list if you make any substantial changes to the documentation or JS engine.

🐛 Please make sure to mention Fix #issue (if applicable) in the description of the PR. This causes GitHub to close it automatically when the PR is merged.

Please choose to allow us to edit your branch when creating the PR.

Thank you for the contribution - we'll try to review it as soon as possible. 🙏

Documents updated exp run --reset behavior and clarifies some checkpoints-related issues around --queue/--run-all.

See iterative/dvc#5553 (comment) for behavior/change explanation

pmrowla · 2021-03-12T16:32:30Z

core DVC change for this has been merged and is now available in master

jorgeorpinel · 2021-03-17T03:02:38Z

content/docs/command-reference/exp/run.md

-Use `--reset` to roll-back the workspace to `HEAD` and restart the whole
-experiment. Alternatively, you can use `--rev` to continue from a specific
-(previous) checkpoint.
+Use `--reset` to reset (remove) any existing `checkpoint` outputs in your
+workspace before running the experiment. Note that `--reset` overrides any


To be clearer, here are the steps implied in this section (checkpoints workflow with reset):

enable checkpoints in the stage ✅

update code to write signal files ✅

exo run - saves a bunch of checkpoints ✅ (possibly interrupt)

> exp run --reset < - We are here

Why would we want to reset? Again, to restart the experiment, right? (to go back to / try again step 2.)

For checkpoint outs, reset removes the specified outs entirely (so that the specified intermediate model can be retrained from scratch and not from the HEAD state). It used to roll back to HEAD, but this was changed (as of the core PR for these docs).

see iterative/dvc#5553 for the product-side discussion on why this change was made

I see, thanks for the context.

For checkpoint outs

Is there another use case for --reset?

Should we move exp run --reset up in the steps or say it's an alternative to plain exp run from the get-go?

0, 1, (same as above)
2a. exp run - runs experiment from HEAD state
2b. exp run --reset - runs experiment from scratch

If it's a new stage with new checkpoint outputs, both 2a and 2b are the same thing here (since there would not be any HEAD dvc.lock entries for the checkpoint outs).

Is there another use case for --reset?

No, --reset does not do anything unless a stage has checkpoint outs.

OK so the first exp run of a checkpoint pipeline is the same as exp run --reset, got it. So yeah this whole section needs a bit of work to include all these key points. ~~Please address my other comments/suggestions when you can and lmk to take this over a some final reorg of ideas~~ (I'll try it in the meantime to confirm I got it 100% now).

One more Q @pmrowla. exp run --reset still deletes existing checkpoints also, right? Meaning the Git commits (I'm guessing the cache is left alone).

The experiment ref for the previously existing checkpoint run is what gets "deleted". In practice, as long as the user's stage is deterministic the old ref is just moved to point to the new run.

Git commits are left alone, git will garbage collect them on its own once they are no longer pointed to by any references.

Cache is also left alone, it should be cleaned up using gc.

jorgeorpinel · 2021-03-17T02:56:47Z

content/docs/command-reference/exp/run.md

+workspace before running the experiment. Note that `--reset` overrides any
+existing `dvc.lock` entries for `checkpoint` outputs. Alternatively, you can use


--reset overrides any existing dvc.lock entries for checkpoint outputs

I'm not getting this note, what are we trying to warn about?

If your dvc.lock file is git-committed and the HEAD version of dvc.lock has a hash for the specified checkpoint out, that hash will be ignored on --reset, and the out will be completely removed before reproducing the pipeline (rather than being checked out to HEAD's hash).

To put this in user terms, if you save a model at each checkpoint so that its weights get loaded as the starting point for the next training epoch, --reset will drop the model so that training will start without any existing model weights.

Got it. I think with an updated explanation as suggested in #2286 (comment) we shouldn't need this warning here (it should be evident). But we can preserve it elsewhere, perhaps in the actual option description (which I haven't gotten to but will soon).

content/docs/command-reference/exp/run.md

jorgeorpinel

Left another round of commends and some specific suggestions ☝️

Also, from iterative/dvc#5586, I'm not seeing this currently reflected/emphaiszed:

"When --reset is used, all other (non-checkpoint) outs in dvc.lock will be unchanged
Previously, --reset would reset the entire dvc.lock file to HEAD"
"--reset is now mutually exclusive with --rev ... we now explicitly error out"

Should we further clarify on those points?

p.s. I can take this over at some point if needed.

pmrowla · 2021-03-18T00:46:38Z

Should we further clarify on those points?

The new behavior would be the expected default for anyone in ML using checkpoints (at least according to Dave+Dmitry), so I don't think it needs any more clarification.

Also the old behavior wasn't documented anywhere in the first place, and it was changed early enough in the 2.0 release cycle that it is unlikely that anyone will be running dvc while also intentionally expecting the original behavior.

jorgeorpinel · 2021-03-18T04:35:40Z

"Previously, --reset would reset the entire dvc.lock file to HEAD" is in https://dvc.org/doc/command-reference/exp/run#checkpoints as "Use --reset to roll-back the workspace to HEAD" for example. But it's only been public for a little while indeed. If this is deemed obvi by the ML guys cc @dberenbaum then sure, one less detail 👍

jorgeorpinel · 2021-03-18T04:38:24Z

@pmrowla I guess at this point I should officially take it over. I'll ping you with more Qs if any, thanks!

p.s. if possible in the future pls use branches directly on the upstream for this repo (easier to review and take over) 🙂

Took this over

jorgeorpinel · 2021-03-18T07:11:07Z

OK guys do you want to double check the latest text? Thanks

content/docs/command-reference/exp/run.md

pmrowla · 2021-03-18T10:41:32Z

content/docs/command-reference/exp/run.md

+You can now use `dvc exp run` to begin the experiment. This removes any
+`checkpoint` outputs before running the experiment (regardless of whether they
+have cached versions). When the process finishes or gets interrupted (e.g. with
+Ctrl + `C`), DVC will [apply](/doc/command-reference/exp/apply) the last
+checkpoint to the <abbr>workspace</abbr> (overwriting any further changes done
+by the stage).


I think wording it this way might end up confusing users?

This removes any checkpoint outputs before running the experiment (regardless of whether they have cached versions).

checkpoint outs are only removed before running the experiment if --reset is specified. In a normal dvc exp run with no other arguments, we continue the experiment using the most recent version of that output.

If there is no previous run at all and the output has never existed before, then yes dvc exp run will do the same thing as dvc exp run --reset, but only in that specific case.

Can this sentence get deleted? There's nothing special about what happens when you begin a checkpoint experiment. It's only different after at least one checkpoint has been generated.

What about what Peter just mentioned, @dberenbaum? "If there is no previous run at all and the output has never existed before, then yes dvc exp run will do the same thing as dvc exp run --reset" 🙂

That's not really a special case though, and even in that case nothing is being removed (since there is no initial data to even remove in the first place)

OK I think my latest update should make it very clear. It's getting a little long/convoluted but at least it's complete and correct (I think). See 9e1a7e9.

even in that case nothing is being removed (since there is no initial data to even remove in the first place

What if you just changed regular outputs (previously cached) into checkpoint outputs in HEAD? (to migrate from regular pipeline to a checkpoint experiment)

So the updated sentence is:

On this very first use, any `checkpoint` outputs are deleted before running the experiment (regardless of whether they have <abbr>cached</abbr> versions).

It's getting a little long/convoluted but at least it's complete and correct (I think).

Agreed on all counts 😄 . I think I get your point now that it's unclear whether an initial checkpoint run treats existing outs as checkpoints or not. Ideally, this is called out as an edge case since it might add confusion for the typical case where no outs exist yet, but 🤷 .

the typical case where no outs exist yet

OK if its an edge case we can leave it out. Always good to simplify docs. Removed that sentence for now... Not sure I fully understand all the possible scenarios but will try to QA separately.

pmrowla · 2021-03-18T10:42:51Z

@jorgeorpinel one thing that I wasn't sure about, otherwise LGTM

dberenbaum

Agree with the comments from @pmrowla and added one about the "overwriting any further changes" language. I already approved, so let me know if you want me to follow up again on anything, or else I'll leave merging to your discretion.

content/docs/command-reference/exp/run.md

dberenbaum · 2021-03-18T13:03:45Z

content/docs/command-reference/exp/run.md

+You can now use `dvc exp run` to begin the experiment. This removes any
+`checkpoint` outputs before running the experiment (regardless of whether they
+have cached versions). When the process finishes or gets interrupted (e.g. with
+Ctrl + `C`), DVC will [apply](/doc/command-reference/exp/apply) the last
+checkpoint to the <abbr>workspace</abbr> (overwriting any further changes done
+by the stage).


Can this sentence get deleted? There's nothing special about what happens when you begin a checkpoint experiment. It's only different after at least one checkpoint has been generated.

dberenbaum · 2021-03-18T13:10:01Z

content/docs/command-reference/exp/run.md

+checkpoint to the <abbr>workspace</abbr> (overwriting any further changes done
+by the stage).


Suggested change

checkpoint to the <abbr>workspace</abbr> (overwriting any further changes done

by the stage).

checkpoint to the <abbr>workspace</abbr> (overwriting any further changes done

by the stage if the process was interrupted).

Alternatively:

When the process finishes, DVC will [apply](/doc/command-reference/exp/apply) the last checkpoint to the <abbr>workspace</abbr>. If the process gets interrupted (e.g. with Ctrl + `C`), DVC will overwrite any changes done by the stage after the last checkpoint.

I find "overwriting any further changes" confusing unless it's specifically tied to the process being interrupted. If the process isn't interrupted, I think a final checkpoint is automatically generated to save the final outcome, so it's not overwriting further changes (@pmrowla can you confirm if that's correct?).

Yeah, there's an additional commit generated at the end of the stage (assuming anything changed since the last time the user called make_checkpoint), and that is what gets applied. Even if the process is interrupted, we don't overwrite anything in the user's workspace.

So should (overwriting any further changes done by the stage) get dropped?

there's an additional commit generated at the end of the stage

Ah this I didn't realize. OK I rewrote this based on that and Dave's suggestions (thanks) for now.

Even if the process is interrupted, we don't overwrite anything in the user's workspace.

Woah so why did I think that? A bit confused on this, so a final checkpoint that reflects the resulting workspace "is applied" (what is there to apply?) but the last checkpoint for interrupted runs is NOT applied?

Here's the latest language for reference:

If the process gets interrupted (e.g. with Ctrl + `C`), DVC will [apply](/doc/command-reference/exp/apply) the last checkpoint to the <abbr>workspace</abbr> (overwriting any further changes).

Pending confirmation from @pmrowla, maybe we should remove any reference to apply or the workspace if nothing gets overwritten? Maybe this sentence isn't needed at all? Do we feel it necessary to call out process interruption to make explicit that this is an expected workflow? If so, maybe something like:

If the process gets interrupted (e.g. with Ctrl + `C`), DVC will keep all checkpoints recorded before interruption.

OK. Removed apply and workspace references for now per Peter's comment above but it would def. be great to confirm!

content/docs/command-reference/exp/run.md

per iterative#2286 (comment)

per should (overwriting any further changes done by the stage) get dropped?

per iterative#2286 (review)

Co-authored-by: Peter Rowlands (변기호) <[email protected]>

per iterative#2286 (review)

content/docs/command-reference/exp/run.md

per iterative#2286 (comment)

dberenbaum · 2021-03-19T12:23:40Z

"Previously, --reset would reset the entire dvc.lock file to HEAD" is in https://dvc.org/doc/command-reference/exp/run#checkpoints as "Use --reset to roll-back the workspace to HEAD" for example. But it's only been public for a little while indeed. If this is deemed obvi by the ML guys cc @dberenbaum then sure, one less detail 👍

I don't know if it's obvious, but it's a sane default and it's only been public for a little while, so I don't think we need to note the change from previous behavior.

jorgeorpinel · 2021-03-19T20:37:59Z

Seems like the text addresses all the comments although I still have some questions but we can manage them separately. Thanks guys, merging!

rel #2286 (review)

cmd: document updated exp run --reset behavior

4a31247

pmrowla mentioned this pull request Mar 10, 2021

checkpoints: completely remove checkpoint outs on exp run --reset iterative/dvc#5586

Merged

2 tasks

pmrowla requested review from jorgeorpinel and dberenbaum March 10, 2021 07:15

dberenbaum approved these changes Mar 10, 2021

View reviewed changes

jorgeorpinel reviewed Mar 17, 2021

View reviewed changes

content/docs/command-reference/exp/run.md Outdated Show resolved Hide resolved

jorgeorpinel reviewed Mar 17, 2021

View reviewed changes

content/docs/command-reference/exp/run.md Outdated Show resolved Hide resolved

jorgeorpinel reviewed Mar 17, 2021

View reviewed changes

content/docs/command-reference/exp/run.md Outdated Show resolved Hide resolved

Update content/docs/command-reference/exp/run.md

d18e64d

jorgeorpinel reviewed Mar 17, 2021

View reviewed changes

content/docs/command-reference/exp/run.md Outdated Show resolved Hide resolved

jorgeorpinel previously requested changes Mar 17, 2021

View reviewed changes

Merge branch 'master' into checkpoint-reset

92292a0

ref: rewrite new --reset behavior edits

0733e32

jorgeorpinel added the 2.0 release label Mar 18, 2021

pmrowla commented Mar 18, 2021

View reviewed changes

content/docs/command-reference/exp/run.md Outdated Show resolved Hide resolved

pmrowla commented Mar 18, 2021

View reviewed changes

dberenbaum reviewed Mar 18, 2021

View reviewed changes

jorgeorpinel added 4 commits March 18, 2021 14:42

Merge branch 'master' into checkpoint-reset

3ab4363

ref: exp run: note aboug gc after --reset

3e17e5d

per iterative#2286 (comment)

ref: copy edits to exp run

953dd0c

ref: clarify about finished vs interrupted checkpoint exp runs

136846b

per should (overwriting any further changes done by the stage) get dropped?

jorgeorpinel and others added 3 commits March 18, 2021 15:36

ref: update note about gc in exp run

3fd6650

per iterative#2286 (review)

Update content/docs/command-reference/exp/run.md

6947d59

Co-authored-by: Peter Rowlands (변기호) <[email protected]>

ref: clarify about first exp run for checkpoints

9e1a7e9

per iterative#2286 (review)

pmrowla commented Mar 19, 2021

View reviewed changes

content/docs/command-reference/exp/run.md Outdated Show resolved Hide resolved

jorgeorpinel added 2 commits March 19, 2021 05:02

ref: remove sentence about first checkpoit exp run

cda047f

per iterative#2286 (comment)

ref: remove apply ref from ckpt exp run

92c3109

per iterative#2286 (comment)

jorgeorpinel merged commit b099934 into iterative:master Mar 19, 2021

pmrowla deleted the checkpoint-reset branch March 20, 2021 00:19

This was referenced Mar 20, 2021

ref: general updates to Experiments #2300

Merged

exp: review implementation details (run, remove, gc) #2325

Merged

jorgeorpinel added a commit that referenced this pull request Apr 25, 2021

ref: clarify about first checkpoint exp run

1dafbd8

rel #2286 (review)

jorgeorpinel mentioned this pull request Apr 25, 2021

ref: corrections to exp run/apply/branch + #2417

Merged

		workspace before running the experiment. Note that `--reset` overrides any
		existing `dvc.lock` entries for `checkpoint` outputs. Alternatively, you can use

		checkpoint to the <abbr>workspace</abbr> (overwriting any further changes done
		by the stage).

cmd: document updated exp run --reset behavior #2286

cmd: document updated exp run --reset behavior #2286

Conversation

pmrowla commented Mar 10, 2021 • edited Loading

pmrowla commented Mar 12, 2021

This comment was marked as resolved.

jorgeorpinel Mar 17, 2021 • edited Loading

Choose a reason for hiding this comment

pmrowla Mar 17, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorgeorpinel Mar 17, 2021 • edited Loading

Choose a reason for hiding this comment

pmrowla Mar 18, 2021 • edited Loading

Choose a reason for hiding this comment

jorgeorpinel Mar 18, 2021 • edited Loading

Choose a reason for hiding this comment

jorgeorpinel Mar 18, 2021 • edited Loading

Choose a reason for hiding this comment

pmrowla Mar 18, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorgeorpinel left a comment

Choose a reason for hiding this comment

pmrowla commented Mar 18, 2021 • edited by jorgeorpinel Loading

jorgeorpinel commented Mar 18, 2021 • edited Loading

jorgeorpinel commented Mar 18, 2021 • edited Loading

jorgeorpinel commented Mar 18, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorgeorpinel Mar 19, 2021 • edited Loading

Choose a reason for hiding this comment

This comment was marked as resolved.

Choose a reason for hiding this comment

jorgeorpinel Mar 19, 2021 • edited Loading

Choose a reason for hiding this comment

jorgeorpinel Mar 19, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorgeorpinel Mar 19, 2021 • edited Loading

Choose a reason for hiding this comment

pmrowla commented Mar 18, 2021

dberenbaum left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorgeorpinel Mar 19, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dberenbaum commented Mar 19, 2021

jorgeorpinel commented Mar 19, 2021

cmd: document updated `exp run --reset` behavior #2286

cmd: document updated `exp run --reset` behavior #2286

pmrowla commented Mar 10, 2021 •

edited

Loading

jorgeorpinel Mar 17, 2021 •

edited

Loading

pmrowla Mar 17, 2021 •

edited

Loading

jorgeorpinel Mar 17, 2021 •

edited

Loading

pmrowla Mar 18, 2021 •

edited

Loading

jorgeorpinel Mar 18, 2021 •

edited

Loading

jorgeorpinel Mar 18, 2021 •

edited

Loading

pmrowla Mar 18, 2021 •

edited

Loading

pmrowla commented Mar 18, 2021 •

edited by jorgeorpinel

Loading

jorgeorpinel commented Mar 18, 2021 •

edited

Loading

jorgeorpinel commented Mar 18, 2021 •

edited

Loading

jorgeorpinel Mar 19, 2021 •

edited

Loading

jorgeorpinel Mar 19, 2021 •

edited

Loading

jorgeorpinel Mar 19, 2021 •

edited

Loading

jorgeorpinel Mar 19, 2021 •

edited

Loading

jorgeorpinel Mar 19, 2021 •

edited

Loading