Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

experiments: checkpoints proof of concept #4591

Merged
merged 23 commits into from
Oct 6, 2020

Conversation

pmrowla
Copy link
Contributor

@pmrowla pmrowla commented Sep 22, 2020

Thank you for the contribution - we'll try to review it as soon as possible. πŸ™

Will close #4498.

  • dvc exp run --checkpoint can be used to reproduce a checkpoint experiment
    • Checkpoint stages should be generated using --always-changed and with --outs-persist for the intermediate checkpoint outputs
    • When running a checkpoint experiment, a new commit in the experiment branch will be generated each time the running stage calls dvc.api.make_checkpoint() (or generates the appropriate .dvc/tmp/DVC_CHECKPOINT) signal file
    • Reproduction and checkpoint generation will run forever until the user code exits or is killed via Ctrl+C
  • dvc exp run --continue <checkpoint_exp_rev> can be used to resume a prior checkpoint experiment. Execution will be resumed from the tip of the checkpoint branch.
  • Checkpoint experiments can be viewed as normal in dvc exp show

Known issues (needs further investigation after this PR):

  • Git-python sometimes throws broken pipe errors when checkpoint runs are killed via Ctrl-C
  • Reproducing a baseline commit does not work properly, for a checkpoint experiment to be generated properly, there needs to be some change versus the baseline/parent commit (either in the workspace or via the repro --params option)
  • Sorting in dvc exp show will not work properly if the table contains checkpoint experiments, but filtering should work as expected

Features that will need follow up PR:

  • Branching from some commit in the middle of a checkpoint experiment and then resuming a "new" branch with potentially modified code/params is not yet supported

@pmrowla pmrowla added the A: experiments Related to dvc exp label Sep 22, 2020
@pmrowla pmrowla self-assigned this Sep 22, 2020
Comment on lines +616 to +632

EXPERIMENTS_RUN_HELP = (
"Reproduce complete or partial experiment pipelines."
)
experiments_run_parser = experiments_subparsers.add_parser(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all experiments run/repro behavior will be moved from dvc repro to dvc exp run in a follow-up PR, for now repro -e is just duplicated here so that checkpoint runs work correctly

@pmrowla
Copy link
Contributor Author

pmrowla commented Oct 6, 2020

asciicast

@pmrowla pmrowla marked this pull request as ready for review October 6, 2020 09:29
@pmrowla pmrowla added the feature is a feature label Oct 6, 2020
@pmrowla pmrowla changed the title [WIP] checkpoints proof of concept experiments: checkpoints proof of concept Oct 6, 2020
@pmrowla pmrowla requested review from skshetry, efiop and pared October 6, 2020 11:26
Copy link
Contributor

@efiop efiop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great start!

@efiop efiop merged commit f0a729d into iterative:master Oct 6, 2020
@pmrowla pmrowla deleted the checkpoints-poc branch October 6, 2020 12:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: experiments Related to dvc exp feature is a feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Experiments: checkpoints tracking
2 participants