Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvc import ignores cache and downloads from remote #9385

Closed
johnyaku opened this issue May 1, 2023 · 24 comments · Fixed by #9415 or #9423
Closed

dvc import ignores cache and downloads from remote #9385

johnyaku opened this issue May 1, 2023 · 24 comments · Fixed by #9415 or #9423
Assignees
Labels
A: data-sync Related to dvc get/fetch/import/pull/push p1-important Important, aka current backlog of things to do regression Ohh, we broke something :-(

Comments

@johnyaku
Copy link

johnyaku commented May 1, 2023

Bug Report

dvc import ignores cache and downloads from remote

Description

I'm using dvc import to import some files that I know are already in an external shared cache (which is also the cache directory for the current project).

dvc downloads the imported files rather than simply sym-linking to the cache

Reproduce

  1. create external cache directory
  2. create project that uses this external directory as its cache directory
  3. git push the config
  4. dvc import some data from a registry but don't git push the resulting .dvc files
  5. delete the project directory but not the external cache
  6. recreate the project directory by recloning a new instances (this will include the config from 3 above)
  7. dvc import the same data again

Expected

Expected dvc to create symlinks to the external cache when the files were already there.

Environment information

Observed this bug with dvc 2.53 and 2.55.
dvc 2.39 works as expected

@dberenbaum dberenbaum added the p1-important Important, aka current backlog of things to do label May 5, 2023
@dberenbaum
Copy link
Collaborator

Related to #8808. Thanks for pointing out that it worked in 2.39. I had missed that this was a regression, so going to bump the priority of this one.

@dberenbaum dberenbaum added regression Ohh, we broke something :-( A: data-sync Related to dvc get/fetch/import/pull/push labels May 5, 2023
@efiop
Copy link
Contributor

efiop commented May 6, 2023

Definitely my fault and caused by #9246 I think the issue is that dvcfs that we use in imports doesn't set local cache and thus always streams from remote (oneliner fix if confirmed). I'll take a closer look.

@efiop efiop self-assigned this May 6, 2023
@shcheklein
Copy link
Member

I was looking a bit into this yesterday. I can confirm that it's first got broken with 2.53.0 (and yes, it's related to the #9246 most likely):

DVC version: 2.53.0
-------------------
Platform: Python 3.9.16 on macOS-13.3.1-arm64-arm-64bit
Subprojects:
	dvc_data = 0.47.3
	dvc_objects = 0.21.2
	dvc_render = 0.3.1
	dvc_task = 0.2.1
	scmrepo = 0.2.1
Supports:
	http (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
	https (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
	s3 (s3fs = 2023.4.0, boto3 = 1.26.76)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk3s1s1
Caches: local
Remotes: s3, s3
Workspace directory: apfs on /dev/disk3s1s1
Repo: dvc, git
Repo.site_cache_dir: /Library/Caches/dvc/repo/f2f169a5407da9f721ec62974a7ca573

I hit this with a simple example like this, from the get started:

dvc import https://github.com/iterative/dataset-registry.git get-started/data.xml -v

It ignores cache and downloads it again.

Haven't had enough time, but brief look was not enough to see the way to pass cache into the DVCFileSystem's get_file (which goes into DataFileSystem.get_file()). Do we have any examples? Is it managed outside before we even go into download?


It there is not single line fix, should we revert it (seems severe enough since affects multiple commands) for now, until we fix this?

Also, a bit unrelated minor question, while looking into this, I saw that circular imports detection got deleted. Out of curiosity, is it handled by something else (e.g. regular graph circular deps check?). Or did we discuss that it's not critical anymore? (it would be great to have this kind of context in the PR description).

@efiop
Copy link
Contributor

efiop commented May 6, 2023

Let me take a look today and I'll get back with the actions. Sure, reverting is also an option, as always.

Regarding circular imports, the new mechanism should handle that better than before so there wasn't a need a for that check anymore. There wasn't a written discussion about it, but we did discuss it privately with @pmrowla before and agreed that it makes sense to drop that for now.

@shcheklein
Copy link
Member

the new mechanism should handle that better than before so there wasn't a need a for that check anymore. There wasn't a written discussion about it, but we did discuss it privately with @pmrowla before and agreed that it makes sense to drop that for now.

yep, that's fine (would be great still to have it documented). What I guess brought my attention that we also removed two tests- do we have a scenario like this somewhere tested, etc - thoughts that were crossing my mind.

@efiop
Copy link
Contributor

efiop commented May 7, 2023

@shcheklein

Haven't had enough time, but brief look was not enough to see the way to pass cache into the DVCFileSystem's get_file (which goes into DataFileSystem.get_file()). Do we have any examples? Is it managed outside before we even go into download?

You can pass config to the DVCFileSystem, which will pass it to Repo. No documented examples, unfortunately.

yep, that's fine (would be great still to have it documented)

I don't think this was documented before. The scenario is overall complex and I wouldn't document it without first researching it properly. It will require a significant amount of effort to document every edge case and tbh I don't think the effort is worth it. Happy to create an issue if you want to, but realistically I don't think it will be actionable or worthwhile.

What I guess brought my attention that we also removed two tests- do we have a scenario like this somewhere tested, etc - thoughts that were crossing my mind.

Yup, that PR is not my proudest "verbose" PR in regards to comments and descriptions 😅 First test just doesn't apply anymore, and the second one tests that we raise a nonexistent exception in a normal scenario. I don't think we need a new test there at that point, as it is a normal scenario now and should work by design. Happy to discuss it further in that PR (just to avoid polluting this unrelated issue).

@efiop
Copy link
Contributor

efiop commented May 7, 2023

@johnyaku Could you please try installing dvc from upstream https://dvc.org/doc/install/pre-release and let us know if it solves the issue for you?

@johnyaku
Copy link
Author

johnyaku commented May 8, 2023

Still downloading from the remote :(

For reference my dvc-related package versions/builds are as follows:

dvc                       2.56.1.dev22+g8620b6e5          pypi_0    pypi
dvc-data                  0.47.5             pyhd8ed1ab_0    conda-forge
dvc-gs                    2.22.0             pyhd8ed1ab_0    conda-forge
dvc-http                  2.30.2             pyhd8ed1ab_2    conda-forge
dvc-objects               0.21.2             pyhd8ed1ab_0    conda-forge
dvc-render                0.3.1              pyhd8ed1ab_0    conda-forge
dvc-s3                    2.22.0             pyhd8ed1ab_0    conda-forge
dvc-ssh                   2.22.1             pyhd8ed1ab_0    conda-forge
dvc-studio-client         0.8.0              pyhd8ed1ab_0    conda-forge
dvc-task                  0.2.1              pyhd8ed1ab_0    conda-forge

@efiop efiop reopened this May 8, 2023
@efiop
Copy link
Contributor

efiop commented May 8, 2023

@johnyaku Could you elaborate on the data you are importing from a registry? Is it a dvc added artifact or dvc import also?

@johnyaku
Copy link
Author

johnyaku commented May 8, 2023

We do use chained imports sometimes, but the cases where I most recently observed this problem were a simple case of ...

  1. dvc add files to registry
  2. dvc import the files into a project repo

@efiop
Copy link
Contributor

efiop commented May 8, 2023

@johnyaku Anything else specific about the setup? Is registry a monorepo? The data that you are importing is a file or a dir?

@efiop
Copy link
Contributor

efiop commented May 8, 2023

@johnyaku Btw, regarding create external cache directory and git push the config, could you elaborate on that? Are you talking about cache.dir option? Where are you pointing it to and why do you git commit that config option?

@efiop
Copy link
Contributor

efiop commented May 8, 2023

Okay, I'm able to reproduce with a directory 🤦, e.g.

dvc import https://github.com/iterative/dvc-bench data/small/dataset

Looking into it.

@efiop efiop closed this as completed May 8, 2023
@efiop efiop reopened this May 8, 2023
@efiop
Copy link
Contributor

efiop commented May 8, 2023

Ah, okay, so it is hitting the cache, as expected. But the issue is that it is "downloading" by copying files from cache to workspace and not symlinking. I've completely missed this part when reading the issue initially 🙁 Looking into it...

@efiop
Copy link
Contributor

efiop commented May 8, 2023

@johnyaku I've created some drafts that I will have to sleep on today. Would you be so kind to try it out in the meantime, just want to make sure I totally understood your scenario this time around 😅 E.g.

pip install git+https://github.com/efiop/dvc.git@fix-9385

@johnyaku
Copy link
Author

johnyaku commented May 8, 2023

Thanks @efiop! I hadn't appreciated that the "download" was from the cache rather than the remote.

For context, the relevant config settings are all --system level:

dvc config --list --show-origin
$XDG_CONFIG_DIRS[0]/dvc/config       core.autostage=true
$XDG_CONFIG_DIRS[0]/dvc/config       cache.type=hardlink,symlink
$XDG_CONFIG_DIRS[0]/dvc/config       cache.dir=/path/to/external/shared/dvc/cache
$XDG_CONFIG_DIRS[0]/dvc/config       cache.shared=group

I've observed this "download" behaviour with both files and directories.

Tuesdays are a bit crazy here so I'll test the fix tomorrow.

@efiop efiop added the awaiting response we are waiting for your reply, please respond! :) label May 9, 2023
efiop added a commit to iterative/dvc-data that referenced this issue May 15, 2023
@johnyaku
Copy link
Author

I had trouble installing the fix above, but I have tested v2.57.1 and the problem still persists.

@efiop
Copy link
Contributor

efiop commented May 18, 2023

@johnyaku could you elaborate? Still downloading from remote or not creating links or both?

@efiop efiop reopened this May 18, 2023
@johnyaku
Copy link
Author

The message says "Downloading" but I suspect that it is actually copying from cache, as before, rather than symlinking.

@efiop
Copy link
Contributor

efiop commented May 19, 2023

@johnyaku The message is the same, yes. But you should see symlinks now. Could you ls -la in a directory that it created? It should show symlinks.

@johnyaku
Copy link
Author

I gave up waiting after two hours. To be fair, it is a large import. But for comparison v2.39 completes the import in less than 30 seconds (with symlinks).

Examining the partially imported directory (after I canceled the import) with ls -al shows regular files, not symlinks (with v.2.57.1). That's probably why it was taking so long (copying or downloading rather than symlinking).

@efiop
Copy link
Contributor

efiop commented May 20, 2023

@johnyaku Thank you for the info! I'll take another look and will get back to you.

@daavoo daavoo removed the awaiting response we are waiting for your reply, please respond! :) label May 31, 2023
@johnyaku
Copy link
Author

johnyaku commented Jun 4, 2023

This appears to be fixed in dvc 2.58.2

@efiop
Copy link
Contributor

efiop commented Jun 4, 2023

@johnyaku Thanks for the feedback! Closing for now then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: data-sync Related to dvc get/fetch/import/pull/push p1-important Important, aka current backlog of things to do regression Ohh, we broke something :-(
Projects
No open projects
Archived in project
5 participants