-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
import: local cache is ignored when importing data that already exist in cache #10255
Comments
I guess it's similar to already resolved regression in the past - #9385 ? |
Another note / info. When I am importing using If I try to do the same with data created with |
Just bit me as well. Thought the reason was my mix of dvc2/3 caches so recreated all in 3.0. Glad you found easy repro steps. |
Hello, |
I'm unable to reproduce this in the latest DVC release (3.45.0). Can you try updating and verify whether or not you still see the issue? |
This is so odd. I am still seeing a "Downloading" message in my test case with 3.45.0 and I am running out of diskspace so I am quite sure the data gets downloaded from remote and only later a symlink generated. But this happens only with my case and not with the repo mentioned (but it looks to me the exact same situation). Not sure if I got more time to look into this .. suffering from that issue for many months : https://discuss.dvc.org/t/help-with-upgrading-imported-via-dvc2-x-dvc-data-with-dvc3-0/1750/22 |
Not sure if the following just confuses this issue but I noticed that when I do the "1st import" (similar to the repro steps) then I get asked the password for remote and then the download starts. On the 2nd "import" (similar to the repro steps) I do NOT get asked the password, "downloading" is displayed BUT its a lot faster (which makes me think it copies/downloads it from the external cache). BUT (and this is the main issue) it still copies the files before creating a symlink (which is bad because I run out hdd space in my realworld use case although the data could be just symlinked from a shared cache). |
Yes as you noted, it's not actually downloading from a remote, it's being copied from your existing cache. Ideally we should be preserving the link type in this case and we can look into fixing that (but not using the symlink here is not a regression) cc @efiop |
Thanx for your quick reply. Was that behaviour already in place with dvc 2.0 ? The good news is that once the import happens the "dvc checkout" behaves way better as it does not download from the cache before creating the symlink. So the problem I am facing is This all should just "work" because the data is already in the cache. But because "import" "download"s it from the cache before creating the symlink I will run out of diskspace and the "import" fails.
That would be really great. I can't see the workflow to import large sets of data from a shared external cache otherwise. Kindest regards |
Sorry for my late response, i've been quite busy lately. With dvc But if I want to import files created by So I guess this state is actually better than the previously reported, since we can migrate the data from |
Here's the output of
|
I'm able to reproduce and confirm that this is a regression introduced in #9246. See the reproduction script below. It creates a source repo with a shared cache. The initial
|
I don't think local cache is being ignored here. @dberenbaum, in the above script, DVCFileSystem is copying from workspace files. If you add Using |
Yes, it will copy from the cache. However, this doesn't solve the underlying problem that this copy operation takes way longer than checkout. You can adjust the script above to just a few files (as long as each is large) and still see the difference between now and either before #9246 or with #10388. |
Bug Report
DVC - local cache is ignored when importing data that already exist in the local cache
DVC local cache is ignored when importing data that already exists in the local cache
Description
When using
dvc import
with shared local cache the cache is ignored when the.dvc
files doesn't exist yet.Reproduce
Step list of how to reproduce the bug
Example:
Expected
I would expect that the second import would first check if the data are in the cache dir and just link them, when they have been already pulled previously.
Environment information
This is required to ensure that we can reproduce the bug.
Output of
dvc doctor
:Additional Information (if any):
The text was updated successfully, but these errors were encountered: