Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd: document updated exp run --reset behavior #2286

Merged
merged 13 commits into from
Mar 19, 2021

Conversation

pmrowla
Copy link
Contributor

@pmrowla pmrowla commented Mar 10, 2021

You may disregard these recommendations if you used the Edit on GitHub button from dvc.org to improve a doc in place.

❗ Please read the guidelines in the Contributing to the Documentation list if you make any substantial changes to the documentation or JS engine.

🐛 Please make sure to mention Fix #issue (if applicable) in the description of the PR. This causes GitHub to close it automatically when the PR is merged.

Please choose to allow us to edit your branch when creating the PR.

Thank you for the contribution - we'll try to review it as soon as possible. 🙏

Documents updated exp run --reset behavior and clarifies some checkpoints-related issues around --queue/--run-all.

See iterative/dvc#5553 (comment) for behavior/change explanation

@pmrowla
Copy link
Contributor Author

pmrowla commented Mar 12, 2021

core DVC change for this has been merged and is now available in master

Comment on lines 88 to 89
Use `--reset` to roll-back the workspace to `HEAD` and restart the whole
experiment. Alternatively, you can use `--rev` to continue from a specific
(previous) checkpoint.
Use `--reset` to reset (remove) any existing `checkpoint` outputs in your
workspace before running the experiment. Note that `--reset` overrides any

This comment was marked as resolved.

Copy link
Contributor

@jorgeorpinel jorgeorpinel Mar 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clearer, here are the steps implied in this section (checkpoints workflow with reset):

  1. enable checkpoints in the stage ✅
  2. update code to write signal files ✅
  3. exo run - saves a bunch of checkpoints ✅ (possibly interrupt)
  4. > exp run --reset < - We are here

Why would we want to reset? Again, to restart the experiment, right? (to go back to / try again step 2.)

Copy link
Contributor Author

@pmrowla pmrowla Mar 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For checkpoint outs, reset removes the specified outs entirely (so that the specified intermediate model can be retrained from scratch and not from the HEAD state). It used to roll back to HEAD, but this was changed (as of the core PR for these docs).

see iterative/dvc#5553 for the product-side discussion on why this change was made

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks for the context.

For checkpoint outs

Is there another use case for --reset?

Copy link
Contributor

@jorgeorpinel jorgeorpinel Mar 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we move exp run --reset up in the steps or say it's an alternative to plain exp run from the get-go?

0, 1, (same as above)
2a. exp run - runs experiment from HEAD state
2b. exp run --reset - runs experiment from scratch

Copy link
Contributor Author

@pmrowla pmrowla Mar 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's a new stage with new checkpoint outputs, both 2a and 2b are the same thing here (since there would not be any HEAD dvc.lock entries for the checkpoint outs).

Is there another use case for --reset?

No, --reset does not do anything unless a stage has checkpoint outs.

Copy link
Contributor

@jorgeorpinel jorgeorpinel Mar 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK so the first exp run of a checkpoint pipeline is the same as exp run --reset, got it. So yeah this whole section needs a bit of work to include all these key points. Please address my other comments/suggestions when you can and lmk to take this over a some final reorg of ideas (I'll try it in the meantime to confirm I got it 100% now).

Copy link
Contributor

@jorgeorpinel jorgeorpinel Mar 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more Q @pmrowla. exp run --reset still deletes existing checkpoints also, right? Meaning the Git commits (I'm guessing the cache is left alone).

Copy link
Contributor Author

@pmrowla pmrowla Mar 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The experiment ref for the previously existing checkpoint run is what gets "deleted". In practice, as long as the user's stage is deterministic the old ref is just moved to point to the new run.

Git commits are left alone, git will garbage collect them on its own once they are no longer pointed to by any references.

Cache is also left alone, it should be cleaned up using gc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted.

Comment on lines 89 to 90
workspace before running the experiment. Note that `--reset` overrides any
existing `dvc.lock` entries for `checkpoint` outputs. Alternatively, you can use
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--reset overrides any existing dvc.lock entries for checkpoint outputs

I'm not getting this note, what are we trying to warn about?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If your dvc.lock file is git-committed and the HEAD version of dvc.lock has a hash for the specified checkpoint out, that hash will be ignored on --reset, and the out will be completely removed before reproducing the pipeline (rather than being checked out to HEAD's hash).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To put this in user terms, if you save a model at each checkpoint so that its weights get loaded as the starting point for the next training epoch, --reset will drop the model so that training will start without any existing model weights.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I think with an updated explanation as suggested in #2286 (comment) we shouldn't need this warning here (it should be evident). But we can preserve it elsewhere, perhaps in the actual option description (which I haven't gotten to but will soon).

Copy link
Contributor

@jorgeorpinel jorgeorpinel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left another round of commends and some specific suggestions ☝️

Also, from iterative/dvc#5586, I'm not seeing this currently reflected/emphaiszed:

  • "When --reset is used, all other (non-checkpoint) outs in dvc.lock will be unchanged
    Previously, --reset would reset the entire dvc.lock file to HEAD"
  • "--reset is now mutually exclusive with --rev ... we now explicitly error out"

Should we further clarify on those points?

p.s. I can take this over at some point if needed.

@pmrowla
Copy link
Contributor Author

pmrowla commented Mar 18, 2021

Should we further clarify on those points?

The new behavior would be the expected default for anyone in ML using checkpoints (at least according to Dave+Dmitry), so I don't think it needs any more clarification.

Also the old behavior wasn't documented anywhere in the first place, and it was changed early enough in the 2.0 release cycle that it is unlikely that anyone will be running dvc while also intentionally expecting the original behavior.

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Mar 18, 2021

"Previously, --reset would reset the entire dvc.lock file to HEAD" is in https://dvc.org/doc/command-reference/exp/run#checkpoints as "Use --reset to roll-back the workspace to HEAD" for example. But it's only been public for a little while indeed. If this is deemed obvi by the ML guys cc @dberenbaum then sure, one less detail 👍

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Mar 18, 2021

@pmrowla I guess at this point I should officially take it over. I'll ping you with more Qs if any, thanks!

p.s. if possible in the future pls use branches directly on the upstream for this repo (easier to review and take over) 🙂

@jorgeorpinel
Copy link
Contributor

OK guys do you want to double check the latest text? Thanks

Comment on lines 81 to 86
You can now use `dvc exp run` to begin the experiment. This removes any
`checkpoint` outputs before running the experiment (regardless of whether they
have cached versions). When the process finishes or gets interrupted (e.g. with
Ctrl + `C`), DVC will [apply](/doc/command-reference/exp/apply) the last
checkpoint to the <abbr>workspace</abbr> (overwriting any further changes done
by the stage).
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think wording it this way might end up confusing users?

This removes any checkpoint outputs before running the experiment (regardless of whether they have cached versions).

checkpoint outs are only removed before running the experiment if --reset is specified. In a normal dvc exp run with no other arguments, we continue the experiment using the most recent version of that output.

If there is no previous run at all and the output has never existed before, then yes dvc exp run will do the same thing as dvc exp run --reset, but only in that specific case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this sentence get deleted? There's nothing special about what happens when you begin a checkpoint experiment. It's only different after at least one checkpoint has been generated.

Copy link
Contributor

@jorgeorpinel jorgeorpinel Mar 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about what Peter just mentioned, @dberenbaum? "If there is no previous run at all and the output has never existed before, then yes dvc exp run will do the same thing as dvc exp run --reset" 🙂

This comment was marked as resolved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not really a special case though, and even in that case nothing is being removed (since there is no initial data to even remove in the first place)

Copy link
Contributor

@jorgeorpinel jorgeorpinel Mar 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I think my latest update should make it very clear. It's getting a little long/convoluted but at least it's complete and correct (I think). See 9e1a7e9.

Copy link
Contributor

@jorgeorpinel jorgeorpinel Mar 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even in that case nothing is being removed (since there is no initial data to even remove in the first place

What if you just changed regular outputs (previously cached) into checkpoint outputs in HEAD? (to migrate from regular pipeline to a checkpoint experiment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the updated sentence is:

On this very first use,
any `checkpoint` outputs are deleted before running the experiment (regardless
of whether they have <abbr>cached</abbr> versions). 

It's getting a little long/convoluted but at least it's complete and correct (I think).

Agreed on all counts 😄 . I think I get your point now that it's unclear whether an initial checkpoint run treats existing outs as checkpoints or not. Ideally, this is called out as an edge case since it might add confusion for the typical case where no outs exist yet, but 🤷 .

Copy link
Contributor

@jorgeorpinel jorgeorpinel Mar 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the typical case where no outs exist yet

OK if its an edge case we can leave it out. Always good to simplify docs. Removed that sentence for now... Not sure I fully understand all the possible scenarios but will try to QA separately.

@pmrowla
Copy link
Contributor Author

pmrowla commented Mar 18, 2021

@jorgeorpinel one thing that I wasn't sure about, otherwise LGTM

Copy link
Contributor

@dberenbaum dberenbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with the comments from @pmrowla and added one about the "overwriting any further changes" language. I already approved, so let me know if you want me to follow up again on anything, or else I'll leave merging to your discretion.

content/docs/command-reference/exp/run.md Outdated Show resolved Hide resolved
Comment on lines 81 to 86
You can now use `dvc exp run` to begin the experiment. This removes any
`checkpoint` outputs before running the experiment (regardless of whether they
have cached versions). When the process finishes or gets interrupted (e.g. with
Ctrl + `C`), DVC will [apply](/doc/command-reference/exp/apply) the last
checkpoint to the <abbr>workspace</abbr> (overwriting any further changes done
by the stage).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this sentence get deleted? There's nothing special about what happens when you begin a checkpoint experiment. It's only different after at least one checkpoint has been generated.

Comment on lines 85 to 86
checkpoint to the <abbr>workspace</abbr> (overwriting any further changes done
by the stage).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
checkpoint to the <abbr>workspace</abbr> (overwriting any further changes done
by the stage).
checkpoint to the <abbr>workspace</abbr> (overwriting any further changes done
by the stage if the process was interrupted).

Alternatively:

When the process finishes, DVC will [apply](/doc/command-reference/exp/apply) the last
checkpoint to the <abbr>workspace</abbr>. If the process gets interrupted  (e.g. with
Ctrl + `C`), DVC will overwrite any changes done by the stage after the last checkpoint.

I find "overwriting any further changes" confusing unless it's specifically tied to the process being interrupted. If the process isn't interrupted, I think a final checkpoint is automatically generated to save the final outcome, so it's not overwriting further changes (@pmrowla can you confirm if that's correct?).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, there's an additional commit generated at the end of the stage (assuming anything changed since the last time the user called make_checkpoint), and that is what gets applied. Even if the process is interrupted, we don't overwrite anything in the user's workspace.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So should (overwriting any further changes done by the stage) get dropped?

Copy link
Contributor

@jorgeorpinel jorgeorpinel Mar 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's an additional commit generated at the end of the stage

Ah this I didn't realize. OK I rewrote this based on that and Dave's suggestions (thanks) for now.

Even if the process is interrupted, we don't overwrite anything in the user's workspace.

Woah so why did I think that? A bit confused on this, so a final checkpoint that reflects the resulting workspace "is applied" (what is there to apply?) but the last checkpoint for interrupted runs is NOT applied?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the latest language for reference:

If the process gets
interrupted (e.g. with Ctrl + `C`), DVC will
[apply](/doc/command-reference/exp/apply) the last checkpoint to the
<abbr>workspace</abbr> (overwriting any further changes).

Pending confirmation from @pmrowla, maybe we should remove any reference to apply or the workspace if nothing gets overwritten? Maybe this sentence isn't needed at all? Do we feel it necessary to call out process interruption to make explicit that this is an expected workflow? If so, maybe something like:

If the process gets
interrupted (e.g. with Ctrl + `C`), DVC will
keep all checkpoints recorded before interruption.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Removed apply and workspace references for now per Peter's comment above but it would def. be great to confirm!

content/docs/command-reference/exp/run.md Outdated Show resolved Hide resolved
@dberenbaum
Copy link
Contributor

"Previously, --reset would reset the entire dvc.lock file to HEAD" is in https://dvc.org/doc/command-reference/exp/run#checkpoints as "Use --reset to roll-back the workspace to HEAD" for example. But it's only been public for a little while indeed. If this is deemed obvi by the ML guys cc @dberenbaum then sure, one less detail 👍

I don't know if it's obvious, but it's a sane default and it's only been public for a little while, so I don't think we need to note the change from previous behavior.

@jorgeorpinel
Copy link
Contributor

Seems like the text addresses all the comments although I still have some questions but we can manage them separately. Thanks guys, merging!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants