Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exp run: reset when the lock file is in Git #5553

Closed
dmpetrov opened this issue Mar 5, 2021 · 16 comments · Fixed by #5586
Closed

exp run: reset when the lock file is in Git #5553

dmpetrov opened this issue Mar 5, 2021 · 16 comments · Fixed by #5586
Assignees
Labels
feature is a feature

Comments

@dmpetrov
Copy link
Member

dmpetrov commented Mar 5, 2021

Currently, the dvc exp run --reset option sets the checkpoints files to the latest commited dvc.lock (if that file is in Git). In many cases (probably in most of the cases), the training needs to be started from scratch but that seems not possible in the current design.

We should introduce something like --hard-reset to start training from scratch even for commited dvc.lock?

Also, I'm wondering about inconsistency between commited dvc.lock and not commited. It looks like --reset has a different behavior. Is there a way to unify it? The --hard-reser option does not unify it and I'm not sure it is the best way to solve the problem.

@dmpetrov dmpetrov added the feature is a feature label Mar 5, 2021
@dberenbaum
Copy link
Collaborator

Also, I'm wondering about inconsistency between commited dvc.lock and not commited. It looks like --reset has a different behavior. Is there a way to unify it? The --hard-reser option does not unify it and I'm not sure it is the best way to solve the problem.

Could you be more specific with this point? What is the inconsistency/different behavior you see?

@dmpetrov
Copy link
Member Author

dmpetrov commented Mar 5, 2021

@dberenbaum from user's point of view, dvc exp run --reset has a different behavior when dvc.lock is commited and when it is not.

An example: if I need to start training from scratch is not enough to just run dvc exp run --reset. I need to check if lock file is commited or not.

After more thinking about this scenario... It feels like removing model and starting training from scratch is more frequent use case then starting from the previous commit. Instead of introducing --hard-reset we can change the default behavior to model removal while introduce --reset-to-previous for the case when the latest commited model is needed.

@dberenbaum
Copy link
Collaborator

It seems reasonable for the default --reset to always run the experiment from scratch (as if the stage has never been run before). One concern I would have is to make sure it doesn't just drop dvc.lock in case there are other pipelines that are not part of the experiment.

Rather than have a --reset-to-previous flag, maybe it makes more sense to have an option to drop all experiments in the workspace? Something like dvc exp remove --workspace. People may want to clear their workspace and start over, even in experiments without checkpoints. I have run into lots of times where I thought, "Whoops, didn't mean to do that, let me get back to a clean workspace." See also #5437.

Thoughts @pmrowla ?

@dmpetrov
Copy link
Member Author

dmpetrov commented Mar 5, 2021

default --reset to always run the experiment from scratch

Yes, it seems like the best option.

Something like dvc exp remove --workspace

Is it equal to git checkout . && dvc checkout? If os, it is a good idea (this shortcut should be implemented I think) but it might not solve the issue because a user might need to train with previous model but the current code in the workspace.

It feels like a part of a bigger problem - how to granularly checkout a particular checkpoint and start training from it (without modifying workspace).

@dberenbaum
Copy link
Collaborator

Is it equal to git checkout . && dvc checkout? If os, it is a good idea (this shortcut should be implemented I think) but it might not solve the issue because a user might need to train with previous model but the current code in the workspace.

No, I mean dropping all experiments since the last commit. I probably shouldn't have called it the workspace (I was thinking of it as the opposite of dvc exp gc --workspace).

It feels like a part of a bigger problem - how to granularly checkout a particular checkpoint and start training from it (without modifying workspace).

Do you think this is common? I think it would be easy to implement but maybe confusing as a UI.

@dmpetrov
Copy link
Member Author

dmpetrov commented Mar 6, 2021

how to granularly checkout a particular checkpoint and start training from it (without modifying workspace).

Do you think this is common? I think it would be easy to implement but maybe confusing as a UI.

Yes, I think so. For example, after a few code changes the model is still diverging. So, you decided to get the model from 50 epochs back but train on the latest code changes (in the current workspace). In this case, using dvc exp apply will destroy the workspace which you might not like. It can be something like dvc exp get HEAD^^ model.pth logs/

@pmrowla
Copy link
Contributor

pmrowla commented Mar 9, 2021

So to clarify action points here, the desired behavior would be:

  • On dvc exp run --reset [target], any checkpoint outputs in target (or the entire pipeline if no target is provided) will be removed (regardless of whether or not dvc.lock contains an existing hash for that checkpoint out)
  • Everything else in the workspace's dvc.lock will remain untouched

Rather than have a --reset-to-previous flag, maybe it makes more sense to have an option to drop all experiments in the workspace? Something like dvc exp remove --workspace. People may want to clear their workspace and start over, even in experiments without checkpoints. I have run into lots of times where I thought, "Whoops, didn't mean to do that, let me get back to a clean workspace." See also #5437.

I can see how this would be useful, but it's a bit of a separate issue from the checkpoint reset one.

Currently, if you have a "clean" workspace (with no existing experiments derived from HEAD), as long as HEAD contains a dvc.lock with a hash for some checkpoint output, that existing hash will be used as the starting point for that checkpoint out the next time you run exp run. In this scenario, it does not matter whether or not --reset is used, since there is no existing experiment to resume - any run would be a "new" checkpoint experiment.

However, with the proposed --reset change, the --reset flag would affect behavior in this scenario:

  • without --reset, the new checkpoint experiment would use the checkpoint out hash from HEAD's dvc.lock as the starting point.
  • with --reset, the new checkpoint experiment would start with a removed checkpoint out (ignoring the hash from dvc.lock

@dberenbaum
Copy link
Collaborator

Thanks, @pmrowla. That's a great summary. In the case of a clean workspace that contains a dvc.lock file, changing the behavior of --reset as you suggested makes sense to me (and makes the flag at least meaningful in this scenario). Any concerns?

@pmrowla
Copy link
Contributor

pmrowla commented Mar 10, 2021

This behavior will also make exp run more consistent with dvc repro, where a given run will respect any values present in dvc.lock at the start of the run (other than checkpoints with --reset). If a user wants a completely "clean" workspace before repro (and now exp run), they need to explicitly do it with git reset --hard.

@dberenbaum
Copy link
Collaborator

For example, after a few code changes the model is still diverging. So, you decided to get the model from 50 epochs back but train on the latest code changes (in the current workspace).

I believe that dvc exp run --rev would allow for this, right? Maybe I'm misunderstanding your point. As a different viewpoint, I'll also say that I'm unlikely to use this kind of workflow because changing code or params in the middle of an experiment kind of breaks clean experimentation.

This behavior will also make exp run more consistent with dvc repro, where a given run will respect any values present in dvc.lock at the start of the run (other than checkpoints with --reset). If a user wants a completely "clean" workspace before repro (and now exp run), they need to explicitly do it with git reset --hard.

I'm unclear on how exp run is more consistent with dvc repro than before. Can you explain? Are there cases outside of --reset where behavior has changed?

Getting back to a clean workspace requires removing all tracked and untracked files, so I think even git reset --hard may be insufficent if untracked files are present (git stash --include-untracked should work). Also, to clarify, this is not necessarily what I meant in the comment above. I meant something more like rm -rf .git/refs/exps but for only experiments based off HEAD. I can document that in a separate issue.

@pmrowla
Copy link
Contributor

pmrowla commented Mar 11, 2021

I'm unclear on how exp run is more consistent with dvc repro than before. Can you explain? Are there cases outside of --reset where behavior has changed?

exp run would previously ignore any uncommitted modifications to dvc.lock files (i.e. modified dependency hashes) and would always check out the HEAD version of the dependency.

Related: 2nd point in #5593
(workspace modifications to data dependencies still need to be dvc commited, but the dvc.lock or .dvc files themselves do not need to be git committed now)

@dberenbaum
Copy link
Collaborator

dberenbaum commented Mar 11, 2021

Do you have an example? I've been playing around with this feature in https://github.com/dberenbaum/dvc-checkpoint, and I was finding that I was getting the same behavior with and without this PR. I tried the following:

$ dvc -V
2.0.3+4a7506 # before this PR
$ dvc exp run -q
$ echo 100 > bar
$ dvc commit bar
outputs ['bar'] of stage: 'bar.dvc' changed. Are you sure you want to commit it? [y/n] y
$ dvc exp run -q
$ dvc exp show --no-pager
┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━┓
┃ Experiment            ┃ Created      ┃ epoch ┃ mult ┃ params.yaml:start ┃ checkpoint.py:EPOCHS ┃ … ┃
┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━┩
│ workspace             │ -            │     3 │  300 │ 0                 │ 2                    │ 1 │
│ main                  │ Mar 10, 2021 │     - │    0 │ 2                 │ 1                    │   │
│ │ ╓ exp-04c46         │ 08:55 AM     │     3 │  300 │ 0                 │ 2                    │ 1 │
│ │ ╟ 39978e4 (670c8ee) │ 08:55 AM     │     2 │  200 │ 0                 │ 2                    │ 1 │
│ │ ╓ exp-e44da         │ 08:55 AM     │     1 │    1 │ 0                 │ 2                    │ 1 │
│ ├─╨ 83e6201           │ 08:55 AM     │     0 │    0 │ 0                 │ 2                    │ 1 │
└───────────────────────┴──────────────┴───────┴──────┴───────────────────┴──────────────────────┴───┘

The results with this PR are the same.

@pmrowla
Copy link
Contributor

pmrowla commented Mar 12, 2021

@dberenbaum your example doesn't have a committed dvc.lock and does not use --reset.

Using this test script:

#!/bin/bash

set -e
# set -x

REPO=dvc-checkpoint
rm -rf $REPO

git clone [email protected]:dberenbaum/dvc-checkpoint.git
cd $REPO
# repo does not contain an intial cache entry for bar (or a remote to pull
# from, but the git committed bar.dvc has a hash for "1"
echo 1 > bar
dvc commit -f bar

dvc exp run -q
git add .
git commit -m 'commit dvc.lock'

dvc -V
dvc exp run --reset -q
dvc exp show --no-pager

In our HEAD (main branch tip) committed dvc.lock file contains a hash for foo so that epoch starts at "1" for the final dvc exp run. Since there are no existing checkpoints to resume based on HEAD, using --reset matters (as outlined above #5553 (comment))

In current DVC master we get an experiment that starts with a foo containing 1 (since dvc.lock is git reset --hard to the HEAD state) and iterates through 2, 3

2.0.5+2db550
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━┳━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓
┃ Experiment    ┃ Created  ┃ epoch ┃ mult ┃ params.yaml:start ┃ checkpoint.py:EPOCHS ┃ checkpoint.py:STEP ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━╇━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩
│ workspace     │ -        │     3 │    3 │ 0                 │ 2                    │ 1                  │
│ main          │ 02:35 PM │     1 │    1 │ 0                 │ 2                    │ 1                  │
│ │ ╓ exp-bc2fa │ 02:35 PM │     3 │    3 │ 0                 │ 2                    │ 1                  │
│ ├─╨ 77b4801   │ 02:35 PM │     2 │    2 │ 0                 │ 2                    │ 1                  │
└───────────────┴──────────┴───────┴──────┴───────────────────┴──────────────────────┴────────────────────┘

After this PR we get an experiment that starts with a completely removed foo and iterate through epoch 0, 1

2.0.5+2c57a7
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━┳━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓
┃ Experiment    ┃ Created  ┃ epoch ┃ mult ┃ params.yaml:start ┃ checkpoint.py:EPOCHS ┃ checkpoint.py:STEP ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━╇━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩
│ workspace     │ -        │     1 │    1 │ 0                 │ 2                    │ 1                  │
│ main          │ 02:42 PM │     1 │    1 │ 0                 │ 2                    │ 1                  │
│ │ ╓ exp-e44da │ 02:42 PM │     1 │    1 │ 0                 │ 2                    │ 1                  │
│ ├─╨ 2900c0d   │ 02:42 PM │     0 │    0 │ 0                 │ 2                    │ 1                  │
└───────────────┴──────────┴───────┴──────┴───────────────────┴──────────────────────┴────────────────────┘

@dberenbaum
Copy link
Collaborator

Are there cases outside of --reset where behavior has changed?

Thanks, I get the changes to --reset. I still don't understand if or how behavior has changed outside of --reset?

@pmrowla
Copy link
Contributor

pmrowla commented Mar 12, 2021

@dberenbaum for resume runs, we will now respect dvc commit'ed workspace modifications to a dvc.lock file (for all outputs including checkpoint). Prior to this change, dvc.lock was always git reset --hard to the last checkpoint git commit. This was a non-intuitive before, since data deps tracked in .dvc files (like bar in your example) would not be git reset --hard in this way.

So before this change, dvc commited changes to .dvc files were respected, but committed changes to dvc.lock were not. After the change, both cases are treated the same.

Note that workspace changes which are not dvc commit'ed (i.e. written to dvc.lock files) will still be lost, with or without this PR (since exp runs start with dvc checkout).

#!/bin/bash

set -e
# set -x

REPO=dvc-checkpoint
rm -rf $REPO

git clone [email protected]:dberenbaum/dvc-checkpoint.git
cd $REPO
echo 1 > bar
dvc commit -f bar

dvc exp run -q
git add .
git commit -m 'commit dvc.lock'

dvc -V
echo 100 > bar
dvc commit -f bar
dvc exp run -q

echo 100 > foo
dvc commit -f foo
dvc exp run -q
dvc exp show --no-pager

In master:

2.0.5+2db550
Updating lock file 'dvc.lock'
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━┳━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓
┃ Experiment    ┃ Created  ┃ epoch ┃ mult ┃ params.yaml:start ┃ checkpoint.py:EPOCHS ┃ checkpoint.py:STEP ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━╇━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩
│ workspace     │ -        │     5 │  500 │ 0                 │ 2                    │ 1                  │
│ main          │ 01:11 AM │     1 │    1 │ 0                 │ 2                    │ 1                  │
│ │ ╓ exp-04c46 │ 01:11 AM │     5 │  500 │ 0                 │ 2                    │ 1                  │
│ │ ╟ 8363e81   │ 01:11 AM │     4 │  400 │ 0                 │ 2                    │ 1                  │
│ │ ╟ f09ff5b   │ 01:11 AM │     3 │  300 │ 0                 │ 2                    │ 1                  │
│ ├─╨ 28f86ed   │ 01:11 AM │     2 │  200 │ 0                 │ 2                    │ 1                  │
└───────────────┴──────────┴───────┴──────┴───────────────────┴──────────────────────┴────────────────────┘

After this PR:

2.0.5+2c57a7
Updating lock file 'dvc.lock'
┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓
┃ Experiment            ┃ Created  ┃ epoch ┃  mult ┃ params.yaml:start ┃ checkpoint.py:EPOCHS ┃ checkpoint.py:STEP ┃
┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩
│ workspace             │ -        │   102 │ 10200 │ 0                 │ 2                    │ 1                  │
│ main                  │ 01:16 AM │     1 │     1 │ 0                 │ 2                    │ 1                  │
│ │ ╓ exp-85601         │ 01:16 AM │   102 │ 10200 │ 0                 │ 2                    │ 1                  │
│ │ ╟ d80eb24 (3c24322) │ 01:16 AM │   101 │ 10100 │ 0                 │ 2                    │ 1                  │
│ │ ╓ exp-04c46         │ 01:16 AM │     3 │   300 │ 0                 │ 2                    │ 1                  │
│ ├─╨ 2b81a26           │ 01:16 AM │     2 │   200 │ 0                 │ 2                    │ 1                  │
└───────────────────────┴──────────┴───────┴───────┴───────────────────┴──────────────────────┴────────────────────┘

Also note that since dvc.lock was modified in between runs, this is treated as a "resume with modifications", so the final exp run gets put in its own new ref/branch, rather than extending the original branch.

@dberenbaum
Copy link
Collaborator

Perfect, thanks for the explanation!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature is a feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants