Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvc repro: only download the necessary artefacts to re-run the pipeline #4870

Closed
ettadar opened this issue Nov 10, 2020 · 1 comment
Closed
Labels
feature request Requesting a new feature p3-nice-to-have It should be done this or next sprint

Comments

@ettadar
Copy link

ettadar commented Nov 10, 2020

Let's say there is a 3 stages workflow: A -> B -> C:

  • A run a.py and produce a.json,
  • B run b.py that takes a.json as input and produce b.json,
  • C run c.py that takes b.json as input and produce c.json.

It would be very handy to have a flag in dvc repro to download only what is necessary to rerun the stages that changed. For example, when working on a fresh clone of the repo:

  • if I change nothing and run dvc repro --new-flag -> dvc download nothing and run nothing.
  • if I change c.py and run dvc repro --new-flag -> dvc only download b.json and run step C.
  • if I change b.py and run dvc repro --new-flag -> dvc download a.json and run steps B and C.

This become particularly useful when working with big pipelines that train multiple models. Downloading all the training data for all the model can takes a lot of time and a lot of space on the disk.

@dberenbaum
Copy link
Collaborator

Seems like a duplicate of #4742, so I'll transfer the comment over there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Requesting a new feature p3-nice-to-have It should be done this or next sprint
Projects
None yet
Development

No branches or pull requests

4 participants