Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

checkpoints: num(ber)/epoch awareness #113

Closed
casperdcl opened this issue Jun 15, 2021 · 15 comments · Fixed by #142
Closed

checkpoints: num(ber)/epoch awareness #113

casperdcl opened this issue Jun 15, 2021 · 15 comments · Fixed by #142
Assignees
Labels
discussion requires active participation to reach a conclusion feature request p1-important Include in the next sprint

Comments

@casperdcl
Copy link

Solves two problems:

  • dvc exp run && dvc exp run shouldn't re-run everything a second time
    • Yes, the user code can check this but so what? Same argument for non-checkpointed dvc repro && dvc repro. DVC should non re-run.
  • interrupting dvc exp run (e.g. due to a runner timeout) and resuming shouldn't re-start from checkpoint zero.

@pmrowla
Copy link
Contributor

pmrowla commented Jun 16, 2021

I'm not sure I understand the issue here?

For checkpoints dvc exp run && dvc exp run should already not be restarting anything. The second exp run should already resume from the last generated checkpoint, even if the first run was interrupted.

edit: this might be a side effect of iterative/dvc#6180 - the checkpoint resume feature requires the dvc.lock for the checkpoint you want to resume, but if CML isn't seeing dvc.lock files for previously generated checkpoints, then that's why DVC will restart from checkpoint 0 each time

@casperdcl
Copy link
Author

This is more about the user code not having to look at the workspace to determine what checkpoint it's currently on. Ideally DVC itself should determine the checkpoint number and not bother running the user code if it's at the final checkpoint already.

@dberenbaum
Copy link
Collaborator

This is more about the user code not having to look at the workspace to determine what checkpoint it's currently on. Ideally DVC itself should determine the checkpoint number and not bother running the user code if it's at the final checkpoint already.

This makes sense to me if the user code does not include iterations or make_checkpoint and instead trains a single epoch and exits. In that case, running the stage a fixed number of times makes sense. This seems like the cleanest workflow, but it also seems atypical for many users, who may create checkpoints inside a callback to a fit method that iterates over a defined number of epochs.

If the user code includes some iteration over make_checkpoint, this idea seems confusing since the user code will define the number of iterations. What if the user code has fewer or more iterations than the number of checkpoints defined in dvc? In iterative/dvc#6104 (reply in thread), @pmrowla suggested how user code can be used to track the number of epochs using a parameter (that may not be considered a dependency?). However, that will still run the checkpoints stage every time, it just may exit immediately without doing anything if the max number of checkpoints has been reached.

@dberenbaum
Copy link
Collaborator

dberenbaum commented Jun 16, 2021

@casperdcl What do you think about a workflow like this?

for _ in dvclive.range(100):
    ....
    dvclive.log("acc", acc)

dvclive.range could start from the tracked epoch number and exit immediately if there are no remaining epochs. If dvc/dvclive handles the iteration, then this makes sense to me.

See #68

@casperdcl
Copy link
Author

yup that looks like a clean solution

@dberenbaum
Copy link
Collaborator

@pmrowla @pared Thoughts?

@pared
Copy link
Contributor

pared commented Jun 18, 2021

Do I understand this use case correctly?

  1. User uses dvclive.range to define max number of epochs
  2. If, for some reason execution stops and checkpoints are used, dvclive should remember latest epoch number
  3. Upon next run dvclive should pick up from latest checkpoint

@dberenbaum
Copy link
Collaborator

Should we move this to dvclive? Is there any underlying dvc issue to keep open if this dvclive feature is added?

@daavoo
Copy link
Contributor

daavoo commented Jul 16, 2021

@casperdcl What do you think about a workflow like this?

for _ in dvclive.range(100):
    ....
    dvclive.log("acc", acc)

dvclive.range could start from the tracked epoch number and exit immediately if there are no remaining epochs. If dvc/dvclive handles the iteration, then this makes sense to me.

See iterative/dvclive#68

I'm not sure how someone using some ML Framework for training could adapt it's training code to use this.

@dberenbaum
Copy link
Collaborator

Since dvclive has a resume option, can this be handled using a dvclive.step property that can be passed to the framework? For example, model.fit(initial_epoch=dvclive.step) in keras.

@dberenbaum dberenbaum transferred this issue from iterative/dvc Jul 16, 2021
@casperdcl
Copy link
Author

not sure; dvclive documentation doesn't really exist but sounds like it would work

@dberenbaum
Copy link
Collaborator

Sorry @casperdcl, I was responding to @daavoo. I still think having dvclive.range() would be nice for a manual training loop.

not sure; dvclive documentation doesn't really exist but sounds like it would work

Not for long! iterative/dvc.org#2632

@casperdcl
Copy link
Author

Ah I see. Well for step in dvclive.range(...) would make dvclive.step a bit redundant.

@dberenbaum
Copy link
Collaborator

dberenbaum commented Jul 16, 2021

dvclive.step is for use with frameworks where the code looks like model.fit(initial_epoch=dvclive.step, epochs=100) instead of for step in dvclive.range(100): train(model). There's no manual training loop, which I assumed was the concern raised by @daavoo.

@daavoo daavoo added discussion requires active participation to reach a conclusion feature request labels Jul 19, 2021
@daavoo
Copy link
Contributor

daavoo commented Jul 19, 2021

dvclive.step is for use with frameworks where the code looks like model.fit(initial_epoch=dvclive.step, epochs=100) instead of for step in dvclive.range(100): train(model). There's no manual training loop, which I assumed was the concern raised by @daavoo.

Yes, my concerns were about ML Frameworks where there is not an easy way to integrate the usage of dvclive.step, also was worried about the consistency between different dvclive<>ML Framework integrations given that not every ML Framework might have support for "resuming" or indicating an initial epoch that is not 0.

Ah I see. Well for step in dvclive.range(...) would make dvclive.step a bit redundant.

For some ML Frameworks, where the user writes it's own training loop, something like the code block bellow could be used instead of the hypothetical dvclive.range (unless we would want to wrap additional functionality under dvclive.range):

while dvclive.get_step() < 100:
    # train for one epoch

Anyhow, it looks that adding a new public method dvclive.get_step() would be trivial and has many potential usages, cc @dberenbaum

@daavoo daavoo added the p1-important Include in the next sprint label Aug 24, 2021
@daavoo daavoo self-assigned this Aug 25, 2021
@daavoo daavoo mentioned this issue Aug 25, 2021
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion requires active participation to reach a conclusion feature request p1-important Include in the next sprint
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants