-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
exp run: reset when the lock file is in Git #5553
Comments
Could you be more specific with this point? What is the inconsistency/different behavior you see? |
@dberenbaum from user's point of view, An example: if I need to start training from scratch is not enough to just run After more thinking about this scenario... It feels like removing model and starting training from scratch is more frequent use case then starting from the previous commit. Instead of introducing |
It seems reasonable for the default Rather than have a Thoughts @pmrowla ? |
Yes, it seems like the best option.
Is it equal to It feels like a part of a bigger problem - how to granularly checkout a particular checkpoint and start training from it (without modifying workspace). |
No, I mean dropping all experiments since the last commit. I probably shouldn't have called it the workspace (I was thinking of it as the opposite of
Do you think this is common? I think it would be easy to implement but maybe confusing as a UI. |
Yes, I think so. For example, after a few code changes the model is still diverging. So, you decided to get the model from 50 epochs back but train on the latest code changes (in the current workspace). In this case, using |
So to clarify action points here, the desired behavior would be:
I can see how this would be useful, but it's a bit of a separate issue from the checkpoint reset one. Currently, if you have a "clean" workspace (with no existing experiments derived from HEAD), as long as HEAD contains a However, with the proposed
|
Thanks, @pmrowla. That's a great summary. In the case of a clean workspace that contains a |
This behavior will also make |
I believe that
I'm unclear on how Getting back to a clean workspace requires removing all tracked and untracked files, so I think even |
Related: 2nd point in #5593 |
Do you have an example? I've been playing around with this feature in https://github.com/dberenbaum/dvc-checkpoint, and I was finding that I was getting the same behavior with and without this PR. I tried the following: $ dvc -V
2.0.3+4a7506 # before this PR
$ dvc exp run -q
$ echo 100 > bar
$ dvc commit bar
outputs ['bar'] of stage: 'bar.dvc' changed. Are you sure you want to commit it? [y/n] y
$ dvc exp run -q
$ dvc exp show --no-pager
┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━┓
┃ Experiment ┃ Created ┃ epoch ┃ mult ┃ params.yaml:start ┃ checkpoint.py:EPOCHS ┃ … ┃
┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━┩
│ workspace │ - │ 3 │ 300 │ 0 │ 2 │ 1 │
│ main │ Mar 10, 2021 │ - │ 0 │ 2 │ 1 │ │
│ │ ╓ exp-04c46 │ 08:55 AM │ 3 │ 300 │ 0 │ 2 │ 1 │
│ │ ╟ 39978e4 (670c8ee) │ 08:55 AM │ 2 │ 200 │ 0 │ 2 │ 1 │
│ │ ╓ exp-e44da │ 08:55 AM │ 1 │ 1 │ 0 │ 2 │ 1 │
│ ├─╨ 83e6201 │ 08:55 AM │ 0 │ 0 │ 0 │ 2 │ 1 │
└───────────────────────┴──────────────┴───────┴──────┴───────────────────┴──────────────────────┴───┘ The results with this PR are the same. |
@dberenbaum your example doesn't have a committed Using this test script: #!/bin/bash
set -e
# set -x
REPO=dvc-checkpoint
rm -rf $REPO
git clone [email protected]:dberenbaum/dvc-checkpoint.git
cd $REPO
# repo does not contain an intial cache entry for bar (or a remote to pull
# from, but the git committed bar.dvc has a hash for "1"
echo 1 > bar
dvc commit -f bar
dvc exp run -q
git add .
git commit -m 'commit dvc.lock'
dvc -V
dvc exp run --reset -q
dvc exp show --no-pager In our HEAD ( In current DVC master we get an experiment that starts with a
After this PR we get an experiment that starts with a completely removed
|
Thanks, I get the changes to |
@dberenbaum for resume runs, we will now respect So before this change, Note that workspace changes which are not #!/bin/bash
set -e
# set -x
REPO=dvc-checkpoint
rm -rf $REPO
git clone [email protected]:dberenbaum/dvc-checkpoint.git
cd $REPO
echo 1 > bar
dvc commit -f bar
dvc exp run -q
git add .
git commit -m 'commit dvc.lock'
dvc -V
echo 100 > bar
dvc commit -f bar
dvc exp run -q
echo 100 > foo
dvc commit -f foo
dvc exp run -q
dvc exp show --no-pager In master:
After this PR:
Also note that since |
Perfect, thanks for the explanation! |
Currently, the
dvc exp run --reset
option sets the checkpoints files to the latest commiteddvc.lock
(if that file is in Git). In many cases (probably in most of the cases), the training needs to be started from scratch but that seems not possible in the current design.We should introduce something like
--hard-reset
to start training from scratch even for commiteddvc.lock
?Also, I'm wondering about inconsistency between commited
dvc.lock
and not commited. It looks like--reset
has a different behavior. Is there a way to unify it? The--hard-reser
option does not unify it and I'm not sure it is the best way to solve the problem.The text was updated successfully, but these errors were encountered: