Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvc pull -r "remote_a_data"/ dvc.api.repo.pull(remote="remote_a_data") both trying to also pull another remote repository "remote_a_model" #10458

Closed
spaghevin opened this issue Jun 14, 2024 · 1 comment
Labels
A: data-sync Related to dvc get/fetch/import/pull/push breaking-change p2-medium Medium priority, should be done, but less important

Comments

@spaghevin
Copy link

Bug Report

dvc pull -r "remote_a_data"/ dvc.api.repo.pull(remote="remote_a_data"): both trying to also pull another remote repository "remote_a_model"

Description

Want to preface by saying I followed (and posted) on issues #10365 #2825 #8298 . I really appreciated the strong feedback/advice on those issues! I am now back and with a similar error, and I am unsure if this is a bug report/feature request/lack of my own understanding but I am leaning towards the latter.

Reproduce

I have multiple online AWS s3 buckets I save data and models to. Let's say I have 5 total, and they are named
"remote_{a-e}_{data/model}",
where 'a'->'e' represents which model/data pairing it is, and the data/model represents data/model folders respectively.

I am setting up a training script in which I enter a parameter of a-e and it chooses to download that dataset using DVC and then makes a training run. This folder is also moved to a docker image so I set it up with no scm and git tracking. My folder set up for an example is:
training-run
--.dvc
-----.gitignore
-----.config
--data_a
------.gitignore
------data_folder.dvc
--model_a
-----.gitignore
-----model.pt.dvc
--data_b...and so on

dvc config file is:
['remote' "remote_a_data"]
URL = s3://a/data
['remote' "remote_a_model"]
URL = s3://a/model

data_folder.dvc is:
outs:
-md5: ~.dir
size: ~
nfiles: ~
hash: md5
path: data_folder
remote: remote_a_data

model.pt.dvc is:
outs:
-md5: ~ (no .dir(?))
size: ~
hash: md5
path: model.pt
remote: remote_a_model

I then use a python script to download these repositories using:
from dvc.repo import Repo
repo = Repo(".")
repo.pull(remote="remote_a_data"
repo.pull(remote="remote_a_model"

and when that didn't work I tried

dvc pull -r "remote_a_data"

and everytime, no matter what order I did the two commands in, the first one would fail. It would download its respective data, but for some reason still try to download the other remote's data, and when it wasn't found in the cache, it would fail.

So if I ran repo.pull(remote="remote_a_model", it would give me:
Collecting
Fetching
Building workspace index
Comparing Indexes
Applying Changes
Traceback...
File...python3.10/site-packages/dvc/repo/checkout.py line 184 in checkout:
raise CheckoutError([relpath(out_path) for out_path in failed], stats)
dvc.exceptions.CheckoutError: Checkout failed for following targets:
data/data_folder
Is your cache up to date?

It will still download the model but it would fail and in for some reason trying to download the data folder.
If I do the same thing but instead of using API and use dvc pull -r "remote_a_model" it would give me

Collecting
Fetching
Building workspace index
Comparing Indexes
Applying Changes
A model_a/model.pt
1 file added and 1 file fetched
ERROR: failed to pull data from the cloud - Chcekout failed for following targets:
data_a/data_folder
Is your cache up to date?
File...python3.10/site-packages/dvc/repo/checkout.py line 184 in checkout:
raise CheckoutError([relpath(out_path) for out_path in failed], stats)
dvc.exceptions.CheckoutError: Checkout failed for following targets:
data/data_folder
Is your cache up to date?

Expected

I expect to JUST pull in the remote repository I am calling, without checking to see for the other folder.

Environment information

I am manually starting in a new container every time so the cache and tmp folders are never initialized and always empty!

Output of dvc doctor:
Platform: Python 3.10.12 on Linux-5.15.133.1-microsoft-standard-WSL2-x86_64-with-glibc2.35

Subprojects:

    dvc_data = 3.15.1

    dvc_objects = 5.1.0

    dvc_render = 1.0.2

    dvc_task = 0.4.0

    scmrepo = 3.3.5

Supports:

    http (aiohttp = 3.9.5, aiohttp-retry = 2.8.3),

    https (aiohttp = 3.9.5, aiohttp-retry = 2.8.3),

    s3 (s3fs = 2024.6.0, boto3 = 1.34.106)

Config:

    Global: /home/{user}/.config/dvc

    System: /etc/xdg/dvc

Cache types: hardlink, symlink

Cache directory: 9p on C:\

Caches: local

Remotes: s3, s3

Workspace directory: 9p on C:\

Repo: dvc (subdir), git

Thanks a lot!

@dberenbaum
Copy link
Collaborator

When you run repo.pull(remote="remote_a_data"), DVC will only pull from that remote, but it will still fail for any data that is set to another remote. To get around this, you can do repo.pull(remote="remote_a_data", allow_missing=True). You might also want to keep an eye on the idea here to add an option for data that is skipped by default unless it is explicitly passed as a target. I think it's worth revisiting this whole behavior in a future major release so that DVC doesn't fail for data that is set to a different remote than the one being pulled.

@dberenbaum dberenbaum added p2-medium Medium priority, should be done, but less important breaking-change A: data-sync Related to dvc get/fetch/import/pull/push labels Jun 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: data-sync Related to dvc get/fetch/import/pull/push breaking-change p2-medium Medium priority, should be done, but less important
Projects
None yet
Development

No branches or pull requests

2 participants