-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dvc import ignores cache and downloads from remote #9385
Comments
Related to #8808. Thanks for pointing out that it worked in 2.39. I had missed that this was a regression, so going to bump the priority of this one. |
Definitely my fault and caused by #9246 I think the issue is that dvcfs that we use in imports doesn't set local cache and thus always streams from remote (oneliner fix if confirmed). I'll take a closer look. |
I was looking a bit into this yesterday. I can confirm that it's first got broken with 2.53.0 (and yes, it's related to the #9246 most likely):
I hit this with a simple example like this, from the get started:
It ignores cache and downloads it again. Haven't had enough time, but brief look was not enough to see the way to pass cache into the It there is not single line fix, should we revert it (seems severe enough since affects multiple commands) for now, until we fix this? Also, a bit unrelated minor question, while looking into this, I saw that circular imports detection got deleted. Out of curiosity, is it handled by something else (e.g. regular graph circular deps check?). Or did we discuss that it's not critical anymore? (it would be great to have this kind of context in the PR description). |
Let me take a look today and I'll get back with the actions. Sure, reverting is also an option, as always. Regarding circular imports, the new mechanism should handle that better than before so there wasn't a need a for that check anymore. There wasn't a written discussion about it, but we did discuss it privately with @pmrowla before and agreed that it makes sense to drop that for now. |
yep, that's fine (would be great still to have it documented). What I guess brought my attention that we also removed two tests- do we have a scenario like this somewhere tested, etc - thoughts that were crossing my mind. |
You can pass
I don't think this was documented before. The scenario is overall complex and I wouldn't document it without first researching it properly. It will require a significant amount of effort to document every edge case and tbh I don't think the effort is worth it. Happy to create an issue if you want to, but realistically I don't think it will be actionable or worthwhile.
Yup, that PR is not my proudest "verbose" PR in regards to comments and descriptions 😅 First test just doesn't apply anymore, and the second one tests that we raise a nonexistent exception in a normal scenario. I don't think we need a new test there at that point, as it is a normal scenario now and should work by design. Happy to discuss it further in that PR (just to avoid polluting this unrelated issue). |
@johnyaku Could you please try installing dvc from upstream https://dvc.org/doc/install/pre-release and let us know if it solves the issue for you? |
Still downloading from the remote :( For reference my dvc-related package versions/builds are as follows:
|
@johnyaku Could you elaborate on the data you are importing from a registry? Is it a |
We do use chained imports sometimes, but the cases where I most recently observed this problem were a simple case of ...
|
@johnyaku Anything else specific about the setup? Is registry a monorepo? The data that you are importing is a file or a dir? |
@johnyaku Btw, regarding |
Okay, I'm able to reproduce with a directory 🤦, e.g.
Looking into it. |
Ah, okay, so it is hitting the cache, as expected. But the issue is that it is "downloading" by copying files from cache to workspace and not symlinking. I've completely missed this part when reading the issue initially 🙁 Looking into it... |
@johnyaku I've created some drafts that I will have to sleep on today. Would you be so kind to try it out in the meantime, just want to make sure I totally understood your scenario this time around 😅 E.g.
|
Thanks @efiop! I hadn't appreciated that the "download" was from the cache rather than the remote. For context, the relevant config settings are all
I've observed this "download" behaviour with both files and directories. Tuesdays are a bit crazy here so I'll test the fix tomorrow. |
I had trouble installing the fix above, but I have tested v2.57.1 and the problem still persists. |
@johnyaku could you elaborate? Still downloading from remote or not creating links or both? |
The message says "Downloading" but I suspect that it is actually copying from cache, as before, rather than symlinking. |
@johnyaku The message is the same, yes. But you should see symlinks now. Could you |
I gave up waiting after two hours. To be fair, it is a large import. But for comparison v2.39 completes the import in less than 30 seconds (with symlinks). Examining the partially imported directory (after I canceled the import) with |
@johnyaku Thank you for the info! I'll take another look and will get back to you. |
This appears to be fixed in dvc 2.58.2 |
@johnyaku Thanks for the feedback! Closing for now then. |
Bug Report
dvc import ignores cache and downloads from remote
Description
I'm using
dvc import
to import some files that I know are already in an external shared cache (which is also the cache directory for the current project).dvc downloads the imported files rather than simply sym-linking to the cache
Reproduce
git push
the configdvc import
some data from a registry but don'tgit push
the resulting.dvc
filesdvc import
the same data againExpected
Expected dvc to create symlinks to the external cache when the files were already there.
Environment information
Observed this bug with dvc 2.53 and 2.55.
dvc 2.39 works as expected
The text was updated successfully, but these errors were encountered: