Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

import: local cache is ignored when importing data that already exist in cache #10255

Closed
Honzys opened this issue Jan 24, 2024 · 14 comments · Fixed by #10388
Closed

import: local cache is ignored when importing data that already exist in cache #10255

Honzys opened this issue Jan 24, 2024 · 14 comments · Fixed by #10388
Labels
A: data-sync Related to dvc get/fetch/import/pull/push bug Did we break something? p1-important Important, aka current backlog of things to do

Comments

@Honzys
Copy link
Contributor

Honzys commented Jan 24, 2024

Bug Report

DVC - local cache is ignored when importing data that already exist in the local cache

DVC local cache is ignored when importing data that already exists in the local cache

Description

When using dvc import with shared local cache the cache is ignored when the .dvc files doesn't exist yet.

Reproduce

Step list of how to reproduce the bug

Example:

$ cd /tmp
$ dvc init --no-scm
$ echo "[core]
    no_scm = True
[cache]
    dir = /srv/dvc_cache/datasets_import
    type = "reflink,symlink,copy"
    shared = group" > .dvc/config
$ dvc import some_url some_data --force
$ rm -rf *.dvc
# Here I would expect to just link it from the local cache and not pull it from remote
$ dvc import some_url some_data --force

Expected

I would expect that the second import would first check if the data are in the cache dir and just link them, when they have been already pulled previously.

Environment information

This is required to ensure that we can reproduce the bug.

Output of dvc doctor:

$ dvc doctor
 ->  dvc doctor
DVC version: 3.42.0 (pip)
-------------------------
Platform: Python 3.10.11 on Linux-5.15.0-89-generic-x86_64-with-glibc2.31
Subprojects:
        dvc_data = 3.8.0
        dvc_objects = 3.0.6
        dvc_render = 1.0.1
        dvc_task = 0.3.0
        scmrepo = 2.0.4
Supports:
        http (aiohttp = 3.9.1, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.9.1, aiohttp-retry = 2.8.3),
        s3 (s3fs = 2023.12.2, boto3 = 1.34.22)
Config:
        Global: /root/.config/dvc
        System: /etc/xdg/dvc
Cache types: reflink, hardlink, symlink
Cache directory: xfs on /dev/mapper/data-srv
Caches: local
Remotes: None
Workspace directory: xfs on /dev/mapper/data-srv
Repo: dvc (no_scm)
Repo.site_cache_dir: /var/tmp/dvc/repo/736a03a99cffc314a8a900b728659998

Additional Information (if any):

@Honzys
Copy link
Contributor Author

Honzys commented Jan 25, 2024

I guess it's similar to already resolved regression in the past - #9385 ?

@Honzys
Copy link
Contributor Author

Honzys commented Jan 25, 2024

Another note / info.

When I am importing using dvc==3.x data which were created with dvc==2.x it works OK as expected.

If I try to do the same with data created with dvc==3.x it downloads data from remote everytime (even if the exact files are in the cache).

@12michi34
Copy link

Just bit me as well. Thought the reason was my mix of dvc2/3 caches so recreated all in 3.0. Glad you found easy repro steps.

@pmrowla pmrowla added bug Did we break something? regression Ohh, we broke something :-( A: data-sync Related to dvc get/fetch/import/pull/push labels Jan 26, 2024
@pmrowla pmrowla added this to DVC Jan 26, 2024
@github-project-automation github-project-automation bot moved this to Backlog in DVC Jan 26, 2024
@dberenbaum dberenbaum added the p1-important Important, aka current backlog of things to do label Jan 26, 2024
@Honzys
Copy link
Contributor Author

Honzys commented Feb 6, 2024

Hello,
Is there any update on this issue please?
Were you able to reproduce it, or do you need more input from my part?

@pmrowla
Copy link
Contributor

pmrowla commented Feb 20, 2024

I'm unable to reproduce this in the latest DVC release (3.45.0). Can you try updating and verify whether or not you still see the issue?

@pmrowla pmrowla added the awaiting response we are waiting for your reply, please respond! :) label Feb 20, 2024
@12michi34
Copy link

This is so odd. I am still seeing a "Downloading" message in my test case with 3.45.0 and I am running out of diskspace so I am quite sure the data gets downloaded from remote and only later a symlink generated. But this happens only with my case and not with the repo mentioned (but it looks to me the exact same situation). Not sure if I got more time to look into this .. suffering from that issue for many months : https://discuss.dvc.org/t/help-with-upgrading-imported-via-dvc2-x-dvc-data-with-dvc3-0/1750/22

@dberenbaum dberenbaum moved this from Backlog to Todo in DVC Feb 20, 2024
@dberenbaum dberenbaum moved this from Todo to Backlog in DVC Feb 20, 2024
@12michi34
Copy link

Not sure if the following just confuses this issue but I noticed that when I do the "1st import" (similar to the repro steps) then I get asked the password for remote and then the download starts. On the 2nd "import" (similar to the repro steps) I do NOT get asked the password, "downloading" is displayed BUT its a lot faster (which makes me think it copies/downloads it from the external cache). BUT (and this is the main issue) it still copies the files before creating a symlink (which is bad because I run out hdd space in my realworld use case although the data could be just symlinked from a shared cache).

@pmrowla
Copy link
Contributor

pmrowla commented Feb 21, 2024

Yes as you noted, it's not actually downloading from a remote, it's being copied from your existing cache. Ideally we should be preserving the link type in this case and we can look into fixing that (but not using the symlink here is not a regression)

cc @efiop

@pmrowla pmrowla removed regression Ohh, we broke something :-( awaiting response we are waiting for your reply, please respond! :) labels Feb 21, 2024
@12michi34
Copy link

Thanx for your quick reply. Was that behaviour already in place with dvc 2.0 ? The good news is that once the import happens the "dvc checkout" behaves way better as it does not download from the cache before creating the symlink.

So the problem I am facing is
a) I have a projectA that stores ~4TB of data in a shared external cache (on a large external drive).
b) many different project live on a smaller drive (<4TB) and would like to import the data from projectA

This all should just "work" because the data is already in the cache. But because "import" "download"s it from the cache before creating the symlink I will run out of diskspace and the "import" fails.

Ideally we should be preserving the link type in this case and we can look into fixing that (but not using the symlink here is not a regression)

That would be really great. I can't see the workflow to import large sets of data from a shared external cache otherwise.
How likely is this to happen ? I am still confused why I didn't run into the issue with dvc 2.x .

Kindest regards

@Honzys
Copy link
Contributor Author

Honzys commented Mar 1, 2024

Sorry for my late response, i've been quite busy lately.
From my experients it's other way around now.

With dvc 3.x version I am able to skip the downloading of the file created by dvc 3.x, if it already exists in the cache - which is great.

But if I want to import files created by dvc 2.x the cache is ignored and it always downloads from the remote.

So I guess this state is actually better than the previously reported, since we can migrate the data from 2.x to 3.x.
But I guess it would be worth fixing this issue too?

@Honzys
Copy link
Contributor Author

Honzys commented Mar 1, 2024

Here's the output of dvc doctor:

DVC version: 3.48.0 (pip)
-------------------------
Platform: Python 3.11.7 on Linux-5.15.0-97-generic-x86_64-with-glibc2.31
Subprojects:
        dvc_data = 3.13.0
        dvc_objects = 5.1.0
        dvc_render = 1.0.1
        dvc_task = 0.3.0
        scmrepo = 3.2.0
Supports:
        http (aiohttp = 3.9.3, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.9.3, aiohttp-retry = 2.8.3),
        s3 (s3fs = 2024.2.0, boto3 = 1.34.51)
Config:
        Global: /root/.config/dvc
        System: /etc/xdg/dvc
Cache types: symlink
Cache directory: xfs on /dev/mapper/data-srv
Caches: local
Remotes: None
Workspace directory: overlay on overlay
Repo: dvc (no_scm)
Repo.site_cache_dir: /var/tmp/dvc/repo/de89edf83a919aae8b7ee93ba17c75e0

@dberenbaum dberenbaum removed the p1-important Important, aka current backlog of things to do label Mar 5, 2024
@dberenbaum dberenbaum added the p1-important Important, aka current backlog of things to do label Apr 4, 2024
@dberenbaum
Copy link
Collaborator

I'm able to reproduce and confirm that this is a regression introduced in #9246. See the reproduction script below. It creates a source repo with a shared cache. The initial dvc add takes ~10 minutes on my machine. Then it creates a new repo is to import the source data, and the import takes ~15 minutes, even if using a shared cache that already contains all the data. Before that regression, it takes seconds.

set -eux

echo "setup source repo"
CACHE=$(mktemp -d)
REPO_SOURCE=$(mktemp -d)
cd $REPO_SOURCE
git init
dvc init -q
dvc cache dir $CACHE
dvc config cache.shared group
dvc config cache.type symlink

echo "generate data"
mkdir dir
for i in {1..100}; do
    head -c 1000000000 < /dev/urandom > dir/${i};
done

echo "dvc add data"
time dvc add dir
git add .
git commit -m "add data"

echo "dvc import data with shared cache"
REPO_IMPORT=$(mktemp -d)
cd $REPO_IMPORT
git init
dvc init -q
dvc cache dir $CACHE
dvc config cache.shared group
dvc config cache.type symlink
time dvc import $REPO_SOURCE dir

@skshetry
Copy link
Member

skshetry commented Apr 29, 2024

I don't think local cache is being ignored here. @dberenbaum, in the above script, DVCFileSystem is copying from workspace files. If you add rm -rf dir after git commit, it'll use the cache.

Using --rev will force dvc to import from a certain git revisions. And this behaviour does not happen in case of remote repositories.

@dberenbaum
Copy link
Collaborator

If you add rm -rf dir after git commit, it'll use the cache.

Yes, it will copy from the cache. However, this doesn't solve the underlying problem that this copy operation takes way longer than checkout. You can adjust the script above to just a few files (as long as each is large) and still see the difference between now and either before #9246 or with #10388.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: data-sync Related to dvc get/fetch/import/pull/push bug Did we break something? p1-important Important, aka current backlog of things to do
Projects
No open projects
Status: Backlog
Development

Successfully merging a pull request may close this issue.

5 participants