-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
get: handle non-DVC repositories #3097
get: handle non-DVC repositories #3097
Conversation
a13df5a
to
afc8886
Compare
DeepSource is complaining about some unused arguments in tests. I suppose I shouldn't clean them up at this point. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Please see some comments below.
dvc/repo/get.py
Outdated
if not output.use_cache: | ||
# Catch this below and go for a plain old fs_copy | ||
raise _DoPlainCopy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very strange flow control, I have never ever seen one like that, but maybe I haven't seen enough python projects, so please correct me if I'm wrong. Regardless, I'm pretty sure you could simply do something like:
try:
repo = Repo(tmp_dir)
output = repo.find_out_by_relpath(path)
if output.use_cache:
...
_get_cached(repo, output, out)
return
except (NotDvcRepoError, OutputNotFoundError):
pass
if os.path.isabs(path):
raise FileNotFoundError
fs_copy(os.path.join(tmp_dir, path), out)
and avoid that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is way nicer!
dvc/repo/get.py
Outdated
# | ||
# Also, we can't use theoretical "move" link type here, because | ||
# the same cache file might be used a few times in a directory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment should stay over repo.cache.local.cache_types = ["reflink", "hardlink", "copy"]
line, otherwise it loses some of its meaning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops
@fabiosantoscode Yep, take it as a heads-up from DS. In this case there are no actions you should take. 🙂 |
873123a
to
d092854
Compare
@efiop I believe I've addressed the comments. WRT the |
Cool @efiop, I've reverted the code to just clone into |
Sorry, not sure I follow. I see that you still pass tmp_path to cached_clone , instead of keeping it as it was before and only changing the cache dir. Is it meant that way? |
@efiop I wrote that before you pointed out that it was enough to just clone the repo to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some language suggestions.
BTW I think this addresses half of #3077! Maybe apply also to dvc import
in this same PR and mark it as closing that issue as well? 🗿 -> 🐦 🐦
@jorgeorpinel thanks for your suggestions, it reads a lot clearer now. As per closing #3077, sure, let's do it :) |
@jorgeorpinel the docs for import are now updated. Check them out and see if they make sense to you. @efiop your comments have all been addressed, I believe. And @restyled-io has closed its PR, since the code style is on par. I ran |
I'll update the docs when this is confirmed to be the final help text. |
dvc/repo/get.py
Outdated
@@ -43,7 +44,10 @@ def get(url, path, out=None, rev=None): | |||
dpath = os.path.dirname(os.path.abspath(out)) | |||
tmp_dir = os.path.join(dpath, "." + str(shortuuid.uuid())) | |||
try: | |||
with external_repo(cache_dir=tmp_dir, url=url, rev=rev) as repo: | |||
cached_clone(url, rev=rev, clone_path=tmp_dir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I might've not worded it clearly enough, but I meant that we should keep it as it was before this change, meaning that we shouldn't pass tmp_dir to cached_clone and let it clone wherever it needs to. And we need to setup external cache dir pointing to tmp_dir, the same way we do that in external_repo. Also, take a look at external_repo
, it is doing things like setting a default remote, which you've lost here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might be able to flip the logic over and do something like:
try:
with external_repo():
...
if out.use_cache:
_get_cached()
return
# fallthrough
except NotDvcRepoError:
pass
path = cached_clone(...)
# copy from git
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done! I didn't fold it like that because I would have to do the cached_clone conditionally, and instead duplicated the file copy call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing #3077 along the way @fabiosantoscode! Here are more small language suggestions, some of them my bad from previous review, and some of them may require reformatting code (new lines) but black
on the pre-commit hook or Restyled can do that for you.
Allows us to `dvc get` from non-DVC source repositories. Fixes iterative#3089
Co-Authored-By: Ruslan Kuprieiev <[email protected]>
This reverts commit 8e49228.
Co-Authored-By: Jorge Orpinel <[email protected]>
Co-Authored-By: Jorge Orpinel <[email protected]>
Co-Authored-By: Jorge Orpinel <[email protected]>
Co-Authored-By: Jorge Orpinel <[email protected]>
802d8ab
to
5fb6640
Compare
My bad @jorgeorpinel I should've made the help texts for get and import uniform myself. Thanks for doing that for me in the end. @efiop now I'm using I'm wondering if a simple file move would be the best for the simple git repository case. After all, the repository is destroyed just after and moves within a single filesystem are very fast. |
# Either an uncached out with absolute path or a user error | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: remove this extra line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
dvc/repo/get.py
Outdated
if output and output.use_cache: | ||
_get_cached(repo, output, out) | ||
return | ||
|
||
# Either an uncached out with absolute path or a user error | ||
if os.path.isabs(path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolute paths are not allowed even for pure-git repos, hence why I've suggested that example that has the same code handling both non-cached and git files. No need to worry much about second cached_clone being redundand, as we cache those anyways, so the overhead is not big.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I tried to fold it like you demonstrated before, but it looks like the files are missing if I try to do the copy outside of the context manager.
So I extracted the part that checks if the path is absolute into a function and called it twice.
dvc/repo/get.py
Outdated
if raw_git_dir: | ||
remove(raw_git_dir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there is any need in removing it, it is a cached copy after all 🙂 And if you check external_repo, it doesn't delete those either. Or am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's good news. I figured there was some automatic cleanup happening but haven't seen how it works yet.
dvc/repo/get.py
Outdated
# Either an uncached out with absolute path or a user error | ||
if os.path.isabs(path): | ||
raise FileNotFoundError | ||
_forbid_absolute_path(path) | ||
|
||
fs_copy(os.path.join(repo.root_dir, path), out) | ||
return | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just pass it through? You are doing the same thing down below anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First time I tried it I got file not found errors and figured the context manager was removing the repository. I was wrong. I got it working now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All good with command output now. Let's just reflect it on iterative/dvc.org/pull/912 please (including in the cmd ref Synopsis sections). Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
@Suor maybe I don't understand what you've raised here but since get now supports non DVC repositories it will never throw an exception to complain about the target repository not being DVC. |
@fabiosantoscode get will never throw, but |
Allows us to
dvc get
from non-DVC source repositories.Fixes #3089
Closes #3077
Will send PR for doc site tomorrow. Just need to change some parts which imply that the remote repository needs to be DVC-enabled.
❗ Have you followed the guidelines in the Contributing to DVC list?
📖 Check this box if this PR does not require documentation updates, or if it does and you have created a separate PR in dvc.org with such updates (or at least opened an issue about it in that repo). Please link below to your PR (or issue) in the dvc.org repo.
docs: iterative/dvc.org#911
Thank you for the contribution - we'll try to review it as soon as possible. 🙏