exp run: reset when the lock file is in Git #5553

dmpetrov · 2021-03-05T08:30:12Z

Currently, the dvc exp run --reset option sets the checkpoints files to the latest commited dvc.lock (if that file is in Git). In many cases (probably in most of the cases), the training needs to be started from scratch but that seems not possible in the current design.

We should introduce something like --hard-reset to start training from scratch even for commited dvc.lock?

Also, I'm wondering about inconsistency between commited dvc.lock and not commited. It looks like --reset has a different behavior. Is there a way to unify it? The --hard-reser option does not unify it and I'm not sure it is the best way to solve the problem.

The text was updated successfully, but these errors were encountered:

dberenbaum · 2021-03-05T15:35:06Z

Also, I'm wondering about inconsistency between commited dvc.lock and not commited. It looks like --reset has a different behavior. Is there a way to unify it? The --hard-reser option does not unify it and I'm not sure it is the best way to solve the problem.

Could you be more specific with this point? What is the inconsistency/different behavior you see?

dmpetrov · 2021-03-05T17:46:18Z

@dberenbaum from user's point of view, dvc exp run --reset has a different behavior when dvc.lock is commited and when it is not.

An example: if I need to start training from scratch is not enough to just run dvc exp run --reset. I need to check if lock file is commited or not.

After more thinking about this scenario... It feels like removing model and starting training from scratch is more frequent use case then starting from the previous commit. Instead of introducing --hard-reset we can change the default behavior to model removal while introduce --reset-to-previous for the case when the latest commited model is needed.

dberenbaum · 2021-03-05T21:20:26Z

It seems reasonable for the default --reset to always run the experiment from scratch (as if the stage has never been run before). One concern I would have is to make sure it doesn't just drop dvc.lock in case there are other pipelines that are not part of the experiment.

Rather than have a --reset-to-previous flag, maybe it makes more sense to have an option to drop all experiments in the workspace? Something like dvc exp remove --workspace. People may want to clear their workspace and start over, even in experiments without checkpoints. I have run into lots of times where I thought, "Whoops, didn't mean to do that, let me get back to a clean workspace." See also #5437.

Thoughts @pmrowla ?

dmpetrov · 2021-03-05T23:02:26Z

default --reset to always run the experiment from scratch

Yes, it seems like the best option.

Something like dvc exp remove --workspace

Is it equal to git checkout . && dvc checkout? If os, it is a good idea (this shortcut should be implemented I think) but it might not solve the issue because a user might need to train with previous model but the current code in the workspace.

It feels like a part of a bigger problem - how to granularly checkout a particular checkpoint and start training from it (without modifying workspace).

dberenbaum · 2021-03-06T00:19:15Z

Is it equal to git checkout . && dvc checkout? If os, it is a good idea (this shortcut should be implemented I think) but it might not solve the issue because a user might need to train with previous model but the current code in the workspace.

No, I mean dropping all experiments since the last commit. I probably shouldn't have called it the workspace (I was thinking of it as the opposite of dvc exp gc --workspace).

It feels like a part of a bigger problem - how to granularly checkout a particular checkpoint and start training from it (without modifying workspace).

Do you think this is common? I think it would be easy to implement but maybe confusing as a UI.

dmpetrov · 2021-03-06T00:42:51Z

how to granularly checkout a particular checkpoint and start training from it (without modifying workspace).

Do you think this is common? I think it would be easy to implement but maybe confusing as a UI.

Yes, I think so. For example, after a few code changes the model is still diverging. So, you decided to get the model from 50 epochs back but train on the latest code changes (in the current workspace). In this case, using dvc exp apply will destroy the workspace which you might not like. It can be something like dvc exp get HEAD^^ model.pth logs/

pmrowla · 2021-03-09T00:26:56Z

So to clarify action points here, the desired behavior would be:

On dvc exp run --reset [target], any checkpoint outputs in target (or the entire pipeline if no target is provided) will be removed (regardless of whether or not dvc.lock contains an existing hash for that checkpoint out)
Everything else in the workspace's dvc.lock will remain untouched

Rather than have a --reset-to-previous flag, maybe it makes more sense to have an option to drop all experiments in the workspace? Something like dvc exp remove --workspace. People may want to clear their workspace and start over, even in experiments without checkpoints. I have run into lots of times where I thought, "Whoops, didn't mean to do that, let me get back to a clean workspace." See also #5437.

I can see how this would be useful, but it's a bit of a separate issue from the checkpoint reset one.

Currently, if you have a "clean" workspace (with no existing experiments derived from HEAD), as long as HEAD contains a dvc.lock with a hash for some checkpoint output, that existing hash will be used as the starting point for that checkpoint out the next time you run exp run. In this scenario, it does not matter whether or not --reset is used, since there is no existing experiment to resume - any run would be a "new" checkpoint experiment.

However, with the proposed --reset change, the --reset flag would affect behavior in this scenario:

without --reset, the new checkpoint experiment would use the checkpoint out hash from HEAD's dvc.lock as the starting point.
with --reset, the new checkpoint experiment would start with a removed checkpoint out (ignoring the hash from dvc.lock

dberenbaum · 2021-03-09T13:34:46Z

Thanks, @pmrowla. That's a great summary. In the case of a clean workspace that contains a dvc.lock file, changing the behavior of --reset as you suggested makes sense to me (and makes the flag at least meaningful in this scenario). Any concerns?

pmrowla · 2021-03-10T07:19:57Z

This behavior will also make exp run more consistent with dvc repro, where a given run will respect any values present in dvc.lock at the start of the run (other than checkpoints with --reset). If a user wants a completely "clean" workspace before repro (and now exp run), they need to explicitly do it with git reset --hard.

dberenbaum · 2021-03-10T19:52:05Z

For example, after a few code changes the model is still diverging. So, you decided to get the model from 50 epochs back but train on the latest code changes (in the current workspace).

I believe that dvc exp run --rev would allow for this, right? Maybe I'm misunderstanding your point. As a different viewpoint, I'll also say that I'm unlikely to use this kind of workflow because changing code or params in the middle of an experiment kind of breaks clean experimentation.

This behavior will also make exp run more consistent with dvc repro, where a given run will respect any values present in dvc.lock at the start of the run (other than checkpoints with --reset). If a user wants a completely "clean" workspace before repro (and now exp run), they need to explicitly do it with git reset --hard.

I'm unclear on how exp run is more consistent with dvc repro than before. Can you explain? Are there cases outside of --reset where behavior has changed?

Getting back to a clean workspace requires removing all tracked and untracked files, so I think even git reset --hard may be insufficent if untracked files are present (git stash --include-untracked should work). Also, to clarify, this is not necessarily what I meant in the comment above. I meant something more like rm -rf .git/refs/exps but for only experiments based off HEAD. I can document that in a separate issue.

pmrowla · 2021-03-11T02:38:15Z

I'm unclear on how exp run is more consistent with dvc repro than before. Can you explain? Are there cases outside of --reset where behavior has changed?

exp run would previously ignore any uncommitted modifications to dvc.lock files (i.e. modified dependency hashes) and would always check out the HEAD version of the dependency.

Related: 2nd point in #5593
(workspace modifications to data dependencies still need to be dvc commited, but the dvc.lock or .dvc files themselves do not need to be git committed now)

dberenbaum · 2021-03-11T14:03:09Z

Do you have an example? I've been playing around with this feature in https://github.com/dberenbaum/dvc-checkpoint, and I was finding that I was getting the same behavior with and without this PR. I tried the following:

$ dvc -V
2.0.3+4a7506 # before this PR
$ dvc exp run -q
$ echo 100 > bar
$ dvc commit bar
outputs ['bar'] of stage: 'bar.dvc' changed. Are you sure you want to commit it? [y/n] y
$ dvc exp run -q
$ dvc exp show --no-pager
┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━┓
┃ Experiment            ┃ Created      ┃ epoch ┃ mult ┃ params.yaml:start ┃ checkpoint.py:EPOCHS ┃ … ┃
┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━┩
│ workspace             │ -            │     3 │  300 │ 0                 │ 2                    │ 1 │
│ main                  │ Mar 10, 2021 │     - │    0 │ 2                 │ 1                    │   │
│ │ ╓ exp-04c46         │ 08:55 AM     │     3 │  300 │ 0                 │ 2                    │ 1 │
│ │ ╟ 39978e4 (670c8ee) │ 08:55 AM     │     2 │  200 │ 0                 │ 2                    │ 1 │
│ │ ╓ exp-e44da         │ 08:55 AM     │     1 │    1 │ 0                 │ 2                    │ 1 │
│ ├─╨ 83e6201           │ 08:55 AM     │     0 │    0 │ 0                 │ 2                    │ 1 │
└───────────────────────┴──────────────┴───────┴──────┴───────────────────┴──────────────────────┴───┘

The results with this PR are the same.

pmrowla · 2021-03-12T05:46:25Z

@dberenbaum your example doesn't have a committed dvc.lock and does not use --reset.

Using this test script:

#!/bin/bash

set -e
# set -x

REPO=dvc-checkpoint
rm -rf $REPO

git clone [email protected]:dberenbaum/dvc-checkpoint.git
cd $REPO
# repo does not contain an intial cache entry for bar (or a remote to pull
# from, but the git committed bar.dvc has a hash for "1"
echo 1 > bar
dvc commit -f bar

dvc exp run -q
git add .
git commit -m 'commit dvc.lock'

dvc -V
dvc exp run --reset -q
dvc exp show --no-pager

In our HEAD (main branch tip) committed dvc.lock file contains a hash for foo so that epoch starts at "1" for the final dvc exp run. Since there are no existing checkpoints to resume based on HEAD, using --reset matters (as outlined above #5553 (comment))

In current DVC master we get an experiment that starts with a foo containing 1 (since dvc.lock is git reset --hard to the HEAD state) and iterates through 2, 3

2.0.5+2db550
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━┳━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓
┃ Experiment    ┃ Created  ┃ epoch ┃ mult ┃ params.yaml:start ┃ checkpoint.py:EPOCHS ┃ checkpoint.py:STEP ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━╇━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩
│ workspace     │ -        │     3 │    3 │ 0                 │ 2                    │ 1                  │
│ main          │ 02:35 PM │     1 │    1 │ 0                 │ 2                    │ 1                  │
│ │ ╓ exp-bc2fa │ 02:35 PM │     3 │    3 │ 0                 │ 2                    │ 1                  │
│ ├─╨ 77b4801   │ 02:35 PM │     2 │    2 │ 0                 │ 2                    │ 1                  │
└───────────────┴──────────┴───────┴──────┴───────────────────┴──────────────────────┴────────────────────┘

After this PR we get an experiment that starts with a completely removed foo and iterate through epoch 0, 1

2.0.5+2c57a7
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━┳━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓
┃ Experiment    ┃ Created  ┃ epoch ┃ mult ┃ params.yaml:start ┃ checkpoint.py:EPOCHS ┃ checkpoint.py:STEP ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━╇━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩
│ workspace     │ -        │     1 │    1 │ 0                 │ 2                    │ 1                  │
│ main          │ 02:42 PM │     1 │    1 │ 0                 │ 2                    │ 1                  │
│ │ ╓ exp-e44da │ 02:42 PM │     1 │    1 │ 0                 │ 2                    │ 1                  │
│ ├─╨ 2900c0d   │ 02:42 PM │     0 │    0 │ 0                 │ 2                    │ 1                  │
└───────────────┴──────────┴───────┴──────┴───────────────────┴──────────────────────┴────────────────────┘

dberenbaum · 2021-03-12T13:48:20Z

Are there cases outside of --reset where behavior has changed?

Thanks, I get the changes to --reset. I still don't understand if or how behavior has changed outside of --reset?

pmrowla · 2021-03-12T16:20:48Z

@dberenbaum for resume runs, we will now respect dvc commit'ed workspace modifications to a dvc.lock file (for all outputs including checkpoint). Prior to this change, dvc.lock was always git reset --hard to the last checkpoint git commit. This was a non-intuitive before, since data deps tracked in .dvc files (like bar in your example) would not be git reset --hard in this way.

So before this change, dvc commited changes to .dvc files were respected, but committed changes to dvc.lock were not. After the change, both cases are treated the same.

Note that workspace changes which are not dvc commit'ed (i.e. written to dvc.lock files) will still be lost, with or without this PR (since exp runs start with dvc checkout).

#!/bin/bash

set -e
# set -x

REPO=dvc-checkpoint
rm -rf $REPO

git clone [email protected]:dberenbaum/dvc-checkpoint.git
cd $REPO
echo 1 > bar
dvc commit -f bar

dvc exp run -q
git add .
git commit -m 'commit dvc.lock'

dvc -V
echo 100 > bar
dvc commit -f bar
dvc exp run -q

echo 100 > foo
dvc commit -f foo
dvc exp run -q
dvc exp show --no-pager

In master:

2.0.5+2db550
Updating lock file 'dvc.lock'
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━┳━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓
┃ Experiment    ┃ Created  ┃ epoch ┃ mult ┃ params.yaml:start ┃ checkpoint.py:EPOCHS ┃ checkpoint.py:STEP ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━╇━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩
│ workspace     │ -        │     5 │  500 │ 0                 │ 2                    │ 1                  │
│ main          │ 01:11 AM │     1 │    1 │ 0                 │ 2                    │ 1                  │
│ │ ╓ exp-04c46 │ 01:11 AM │     5 │  500 │ 0                 │ 2                    │ 1                  │
│ │ ╟ 8363e81   │ 01:11 AM │     4 │  400 │ 0                 │ 2                    │ 1                  │
│ │ ╟ f09ff5b   │ 01:11 AM │     3 │  300 │ 0                 │ 2                    │ 1                  │
│ ├─╨ 28f86ed   │ 01:11 AM │     2 │  200 │ 0                 │ 2                    │ 1                  │
└───────────────┴──────────┴───────┴──────┴───────────────────┴──────────────────────┴────────────────────┘

After this PR:

2.0.5+2c57a7
Updating lock file 'dvc.lock'
┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓
┃ Experiment            ┃ Created  ┃ epoch ┃  mult ┃ params.yaml:start ┃ checkpoint.py:EPOCHS ┃ checkpoint.py:STEP ┃
┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩
│ workspace             │ -        │   102 │ 10200 │ 0                 │ 2                    │ 1                  │
│ main                  │ 01:16 AM │     1 │     1 │ 0                 │ 2                    │ 1                  │
│ │ ╓ exp-85601         │ 01:16 AM │   102 │ 10200 │ 0                 │ 2                    │ 1                  │
│ │ ╟ d80eb24 (3c24322) │ 01:16 AM │   101 │ 10100 │ 0                 │ 2                    │ 1                  │
│ │ ╓ exp-04c46         │ 01:16 AM │     3 │   300 │ 0                 │ 2                    │ 1                  │
│ ├─╨ 2b81a26           │ 01:16 AM │     2 │   200 │ 0                 │ 2                    │ 1                  │
└───────────────────────┴──────────┴───────┴───────┴───────────────────┴──────────────────────┴────────────────────┘

Also note that since dvc.lock was modified in between runs, this is treated as a "resume with modifications", so the final exp run gets put in its own new ref/branch, rather than extending the original branch.

dberenbaum · 2021-03-12T16:26:57Z

Perfect, thanks for the explanation!

dmpetrov added the feature is a feature label Mar 5, 2021

dmpetrov assigned pmrowla and dberenbaum Mar 5, 2021

This was referenced Mar 10, 2021

checkpoints: completely remove checkpoint outs on exp run --reset #5586

Merged

cmd: document updated exp run --reset behavior iterative/dvc.org#2286

Merged

pmrowla closed this as completed in #5586 Mar 12, 2021

pmrowla mentioned this issue Mar 14, 2021

exp: dvc.yaml scheme (checkpoint field) #5530

Closed

dberenbaum mentioned this issue Mar 23, 2021

Remove all experiments in the workspace #5676

Closed

pmrowla mentioned this issue Apr 1, 2021

exp run --run-all: runs full pipeline, even stages that did not change #5628

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

exp run: reset when the lock file is in Git #5553

exp run: reset when the lock file is in Git #5553

dmpetrov commented Mar 5, 2021

dberenbaum commented Mar 5, 2021

dmpetrov commented Mar 5, 2021

dberenbaum commented Mar 5, 2021

dmpetrov commented Mar 5, 2021

dberenbaum commented Mar 6, 2021

dmpetrov commented Mar 6, 2021

pmrowla commented Mar 9, 2021

dberenbaum commented Mar 9, 2021

pmrowla commented Mar 10, 2021

dberenbaum commented Mar 10, 2021

pmrowla commented Mar 11, 2021

dberenbaum commented Mar 11, 2021 •

edited

Loading

pmrowla commented Mar 12, 2021

dberenbaum commented Mar 12, 2021

pmrowla commented Mar 12, 2021 •

edited

Loading

dberenbaum commented Mar 12, 2021

exp run: reset when the lock file is in Git #5553

exp run: reset when the lock file is in Git #5553

Comments

dmpetrov commented Mar 5, 2021

dberenbaum commented Mar 5, 2021

dmpetrov commented Mar 5, 2021

dberenbaum commented Mar 5, 2021

dmpetrov commented Mar 5, 2021

dberenbaum commented Mar 6, 2021

dmpetrov commented Mar 6, 2021

pmrowla commented Mar 9, 2021

dberenbaum commented Mar 9, 2021

pmrowla commented Mar 10, 2021

dberenbaum commented Mar 10, 2021

pmrowla commented Mar 11, 2021

dberenbaum commented Mar 11, 2021 • edited Loading

pmrowla commented Mar 12, 2021

dberenbaum commented Mar 12, 2021

pmrowla commented Mar 12, 2021 • edited Loading

dberenbaum commented Mar 12, 2021

dberenbaum commented Mar 11, 2021 •

edited

Loading

pmrowla commented Mar 12, 2021 •

edited

Loading