-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make that dvc repro --pull
pulls all missing files.
#4742
Comments
Hi @peper0 ! It is the plan 🙂 For now it is limited to run-cache, but we'll add support for data sources soon. Stay tuned! 🙂 |
Comments from #4871:
|
Comment from #4870:
|
Another request in discord: https://discord.com/channels/485586884165107732/485596304961962003/945713992825991188 As noted there, this is really important for remote execution, so I think we need to prioritize this soon. cc @pmrowla @karajan1001 |
Hi, I am leaving my comment as a user as suggested by @dberenbaum, I hope it can be helpful. Let's take the example A -> B -> C mentioned above. After defining the pipeline I want to run each stage in a dedicated container using an orchestrator like Argo (or Kubeflow or Airflow or ...). git clone git-repo-containing-the-dvc-files
cd git-repo-containing-the-dvc-files
dvc repro B --new-flag
dvc add -R .
dvc push -R .
git add -A .
git commit -m "executed stage B"
git push ( The command Note from a design perspective: "There should be one-- an preferably only one --obvious way to do it". dvc pull B --new-flag && dvc repro B ( |
Can't this already be done with The behavior for |
Actually |
Ah I see, so what we want would be something along the lines of Although I'm not sure I understand why pulling the output here is a problem. This is also what |
In the perspective of running a stage on a "clean" container, you want to run a stage only if:
|
In the initial comment for this issue, there is a good summary: "pull whatever is missing and necessary for this repro." I think it's related to #5369, and that may be a prerequisite for the best-case scenario. A good solution for #5369 would determine whether there are any modified stages, taking into account what can be checked out from the cache/run-cache or pulled from remote storage. Ideally, there should be an equivalent to do this but have the pipeline checkout/pull what's needed and run the necessary stages. For each stage, the command should do something like:
I don't think
|
I'm still not following
Determining whether or not the upstream stages are modified requires on having the dependencies available (to see whether or not they have changed)
In this case, it sounds like you are talking about where I have a pipeline like DVC doesn't work this way right now, and supporting this would be a bigger change than just |
I would like to have an option to use the
You're right, it's a bigger change than a typical command option.
If I want to repro
|
@francesco086 You mentioned wanting to have the ability to pull a single stage's dependencies in I see two possible approaches:
|
Is it possible to explicitly refer Analogies from Bazel build system:
|
After #5369, we should have the ability to determine whether a stage needs to be run without pulling the data locally. At that point, solving this issue seems straightforward (although still more complex than a typical CLI flag):
|
According to help for
dvc repro
:and that's what it does. Pulls missing files that are outputs restored from the run-cache. But if there are outputs missing from "sources" (i.e. dvc files having only output and no command at all), it won't pull them. These must be pulled separately with a
pull
command.Why not change it into "pull whatever is missing and necessary for this repro"? Is there any use case where a user want to download automatically some missing files, but not all of them?
The text was updated successfully, but these errors were encountered: