dvc pull -r "remote_a_data"/ dvc.api.repo.pull(remote="remote_a_data") both trying to also pull another remote repository "remote_a_model" #10458
Labels
A: data-sync
Related to dvc get/fetch/import/pull/push
breaking-change
p2-medium
Medium priority, should be done, but less important
Bug Report
dvc pull -r "remote_a_data"/ dvc.api.repo.pull(remote="remote_a_data"): both trying to also pull another remote repository "remote_a_model"
Description
Want to preface by saying I followed (and posted) on issues #10365 #2825 #8298 . I really appreciated the strong feedback/advice on those issues! I am now back and with a similar error, and I am unsure if this is a bug report/feature request/lack of my own understanding but I am leaning towards the latter.
Reproduce
I have multiple online AWS s3 buckets I save data and models to. Let's say I have 5 total, and they are named
"remote_{a-e}_{data/model}",
where 'a'->'e' represents which model/data pairing it is, and the data/model represents data/model folders respectively.
I am setting up a training script in which I enter a parameter of a-e and it chooses to download that dataset using DVC and then makes a training run. This folder is also moved to a docker image so I set it up with no scm and git tracking. My folder set up for an example is:
training-run
--.dvc
-----.gitignore
-----.config
--data_a
------.gitignore
------data_folder.dvc
--model_a
-----.gitignore
-----model.pt.dvc
--data_b...and so on
dvc config file is:
['remote' "remote_a_data"]
URL = s3://a/data
['remote' "remote_a_model"]
URL = s3://a/model
data_folder.dvc is:
outs:
-md5: ~.dir
size: ~
nfiles: ~
hash: md5
path: data_folder
remote: remote_a_data
model.pt.dvc is:
outs:
-md5: ~ (no .dir(?))
size: ~
hash: md5
path: model.pt
remote: remote_a_model
I then use a python script to download these repositories using:
from dvc.repo import Repo
repo = Repo(".")
repo.pull(remote="remote_a_data"
repo.pull(remote="remote_a_model"
and when that didn't work I tried
dvc pull -r "remote_a_data"
and everytime, no matter what order I did the two commands in, the first one would fail. It would download its respective data, but for some reason still try to download the other remote's data, and when it wasn't found in the cache, it would fail.
So if I ran repo.pull(remote="remote_a_model", it would give me:
Collecting
Fetching
Building workspace index
Comparing Indexes
Applying Changes
Traceback...
File...python3.10/site-packages/dvc/repo/checkout.py line 184 in checkout:
raise CheckoutError([relpath(out_path) for out_path in failed], stats)
dvc.exceptions.CheckoutError: Checkout failed for following targets:
data/data_folder
Is your cache up to date?
It will still download the model but it would fail and in for some reason trying to download the data folder.
If I do the same thing but instead of using API and use dvc pull -r "remote_a_model" it would give me
Collecting
Fetching
Building workspace index
Comparing Indexes
Applying Changes
A model_a/model.pt
1 file added and 1 file fetched
ERROR: failed to pull data from the cloud - Chcekout failed for following targets:
data_a/data_folder
Is your cache up to date?
File...python3.10/site-packages/dvc/repo/checkout.py line 184 in checkout:
raise CheckoutError([relpath(out_path) for out_path in failed], stats)
dvc.exceptions.CheckoutError: Checkout failed for following targets:
data/data_folder
Is your cache up to date?
Expected
I expect to JUST pull in the remote repository I am calling, without checking to see for the other folder.
Environment information
I am manually starting in a new container every time so the cache and tmp folders are never initialized and always empty!
Output of
dvc doctor
:Platform: Python 3.10.12 on Linux-5.15.133.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Subprojects:
Supports:
Config:
Cache types: hardlink, symlink
Cache directory: 9p on C:\
Caches: local
Remotes: s3, s3
Workspace directory: 9p on C:\
Repo: dvc (subdir), git
Thanks a lot!
The text was updated successfully, but these errors were encountered: