-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
checkpoints: num(ber)/epoch awareness #113
Comments
I'm not sure I understand the issue here? For checkpoints edit: this might be a side effect of iterative/dvc#6180 - the checkpoint resume feature requires the |
This is more about the user code not having to look at the workspace to determine what checkpoint it's currently on. Ideally DVC itself should determine the checkpoint number and not bother running the user code if it's at the final checkpoint already. |
This makes sense to me if the user code does not include iterations or If the user code includes some iteration over |
@casperdcl What do you think about a workflow like this? for _ in dvclive.range(100):
....
dvclive.log("acc", acc)
See #68 |
yup that looks like a clean solution |
Do I understand this use case correctly?
|
Should we move this to dvclive? Is there any underlying dvc issue to keep open if this dvclive feature is added? |
I'm not sure how someone using some ML Framework for training could adapt it's training code to use this. |
Since dvclive has a resume option, can this be handled using a |
not sure; |
Sorry @casperdcl, I was responding to @daavoo. I still think having
Not for long! iterative/dvc.org#2632 |
Ah I see. Well |
|
Yes, my concerns were about ML Frameworks where there is not an easy way to integrate the usage of
For some ML Frameworks, where the user writes it's own training loop, something like the code block bellow could be used instead of the hypothetical
Anyhow, it looks that adding a new public method |
Solves two problems:
dvc exp run && dvc exp run
shouldn't re-run everything a second timedvc repro && dvc repro
. DVC should non re-run.dvc exp run
(e.g. due to a runner timeout) and resuming shouldn't re-start from checkpoint zero.The text was updated successfully, but these errors were encountered: