Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get: handle non-DVC repositories #3097

Merged

Conversation

fabiosantoscode
Copy link
Contributor

@fabiosantoscode fabiosantoscode commented Jan 9, 2020

Allows us to dvc get from non-DVC source repositories.

Fixes #3089
Closes #3077

Will send PR for doc site tomorrow. Just need to change some parts which imply that the remote repository needs to be DVC-enabled.


  • ❗ Have you followed the guidelines in the Contributing to DVC list?

  • 📖 Check this box if this PR does not require documentation updates, or if it does and you have created a separate PR in dvc.org with such updates (or at least opened an issue about it in that repo). Please link below to your PR (or issue) in the dvc.org repo.

docs: iterative/dvc.org#911

  • ❌ Have you checked DeepSource, CodeClimate, and other sanity checks below? We consider their findings recommendatory and don't expect everything to be addressed. Please review them carefully and fix those that actually improve code or fix bugs.

Thank you for the contribution - we'll try to review it as soon as possible. 🙏

@fabiosantoscode fabiosantoscode force-pushed the feature/3089-get-non-dvc-repositories branch from a13df5a to afc8886 Compare January 9, 2020 22:42
@fabiosantoscode
Copy link
Contributor Author

fabiosantoscode commented Jan 9, 2020

DeepSource is complaining about some unused arguments in tests. I suppose I shouldn't clean them up at this point.

Copy link
Contributor

@efiop efiop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Please see some comments below.

dvc/repo/get.py Outdated Show resolved Hide resolved
dvc/repo/get.py Outdated
Comment on lines 66 to 68
if not output.use_cache:
# Catch this below and go for a plain old fs_copy
raise _DoPlainCopy
Copy link
Contributor

@efiop efiop Jan 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very strange flow control, I have never ever seen one like that, but maybe I haven't seen enough python projects, so please correct me if I'm wrong. Regardless, I'm pretty sure you could simply do something like:

try:                                                                    
    repo = Repo(tmp_dir)                                                
    output = repo.find_out_by_relpath(path)                             
    if output.use_cache:                                                
        ... 
        _get_cached(repo, output, out)                                  
        return                                                          
except (NotDvcRepoError, OutputNotFoundError):                          
    pass                                                                
                                                                        
if os.path.isabs(path):                                                 
    raise FileNotFoundError                                             
                                                                        
fs_copy(os.path.join(tmp_dir, path), out)                               

and avoid that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is way nicer!

dvc/repo/get.py Outdated
Comment on lines 60 to 62
#
# Also, we can't use theoretical "move" link type here, because
# the same cache file might be used a few times in a directory.
Copy link
Contributor

@efiop efiop Jan 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment should stay over repo.cache.local.cache_types = ["reflink", "hardlink", "copy"] line, otherwise it loses some of its meaning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops

@efiop
Copy link
Contributor

efiop commented Jan 9, 2020

DeepSource is complaining about some unused arguments in tests. I suppose I shouldn't clean them up at this point.

@fabiosantoscode Yep, take it as a heads-up from DS. In this case there are no actions you should take. 🙂

@fabiosantoscode
Copy link
Contributor Author

@efiop I believe I've addressed the comments.

WRT the clone_dir argument and the lost optimization, I removed the argument, and instead of changing the clone target, I changed the Repo's cache to tmp_dir which is next to the place the get will drop the file at.

dvc/repo/get.py Outdated Show resolved Hide resolved
dvc/repo/get.py Outdated Show resolved Hide resolved
@fabiosantoscode
Copy link
Contributor Author

Cool @efiop, I've reverted the code to just clone into tmp_dir and addressed the remaining comments 👍

@efiop
Copy link
Contributor

efiop commented Jan 10, 2020

WRT the clone_dir argument and the lost optimization, I removed the argument, and instead of changing the clone target, I changed the Repo's cache to tmp_dir which is next to the place the get will drop the file at.

Sorry, not sure I follow. I see that you still pass tmp_path to cached_clone , instead of keeping it as it was before and only changing the cache dir. Is it meant that way?

@fabiosantoscode
Copy link
Contributor Author

@efiop I wrote that before you pointed out that it was enough to just clone the repo to tmp_dir, do disregard it.

Copy link
Contributor

@jorgeorpinel jorgeorpinel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some language suggestions.

BTW I think this addresses half of #3077! Maybe apply also to dvc import in this same PR and mark it as closing that issue as well? 🗿 -> 🐦 🐦

dvc/command/get.py Outdated Show resolved Hide resolved
dvc/command/get.py Outdated Show resolved Hide resolved
dvc/command/get.py Outdated Show resolved Hide resolved
dvc/command/get.py Outdated Show resolved Hide resolved
@fabiosantoscode
Copy link
Contributor Author

fabiosantoscode commented Jan 10, 2020

@jorgeorpinel thanks for your suggestions, it reads a lot clearer now.

As per closing #3077, sure, let's do it :)

@fabiosantoscode
Copy link
Contributor Author

@jorgeorpinel the docs for import are now updated. Check them out and see if they make sense to you.

@efiop your comments have all been addressed, I believe.

And @restyled-io has closed its PR, since the code style is on par. I ran black dvc locally just to confirm.

@fabiosantoscode
Copy link
Contributor Author

I'll update the docs when this is confirmed to be the final help text.

dvc/repo/get.py Outdated
@@ -43,7 +44,10 @@ def get(url, path, out=None, rev=None):
dpath = os.path.dirname(os.path.abspath(out))
tmp_dir = os.path.join(dpath, "." + str(shortuuid.uuid()))
try:
with external_repo(cache_dir=tmp_dir, url=url, rev=rev) as repo:
cached_clone(url, rev=rev, clone_path=tmp_dir)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I might've not worded it clearly enough, but I meant that we should keep it as it was before this change, meaning that we shouldn't pass tmp_dir to cached_clone and let it clone wherever it needs to. And we need to setup external cache dir pointing to tmp_dir, the same way we do that in external_repo. Also, take a look at external_repo, it is doing things like setting a default remote, which you've lost here.

Copy link
Contributor

@efiop efiop Jan 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might be able to flip the logic over and do something like:

try:
    with external_repo():
        ...
        if out.use_cache:
            _get_cached()
           return
        # fallthrough
except NotDvcRepoError:
     pass
path = cached_clone(...)
# copy from git

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! I didn't fold it like that because I would have to do the cached_clone conditionally, and instead duplicated the file copy call.

Copy link
Contributor

@jorgeorpinel jorgeorpinel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing #3077 along the way @fabiosantoscode! Here are more small language suggestions, some of them my bad from previous review, and some of them may require reformatting code (new lines) but black on the pre-commit hook or Restyled can do that for you.

dvc/command/imp.py Outdated Show resolved Hide resolved
dvc/command/get.py Outdated Show resolved Hide resolved
dvc/command/get.py Outdated Show resolved Hide resolved
dvc/command/imp.py Outdated Show resolved Hide resolved
dvc/command/imp.py Outdated Show resolved Hide resolved
dvc/command/get.py Outdated Show resolved Hide resolved
dvc/command/imp.py Outdated Show resolved Hide resolved
dvc/command/imp.py Outdated Show resolved Hide resolved
@fabiosantoscode fabiosantoscode force-pushed the feature/3089-get-non-dvc-repositories branch from 802d8ab to 5fb6640 Compare January 10, 2020 22:29
@fabiosantoscode
Copy link
Contributor Author

My bad @jorgeorpinel I should've made the help texts for get and import uniform myself. Thanks for doing that for me in the end.

@efiop now I'm using external_repo like before, to leverage its DVC-specific logic, and a simple cached_clone followed by a file copy in case it's not a DVC repo.

I'm wondering if a simple file move would be the best for the simple git repository case. After all, the repository is destroyed just after and moves within a single filesystem are very fast.

# Either an uncached out with absolute path or a user error

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: remove this extra line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

dvc/repo/get.py Outdated
if output and output.use_cache:
_get_cached(repo, output, out)
return

# Either an uncached out with absolute path or a user error
if os.path.isabs(path):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolute paths are not allowed even for pure-git repos, hence why I've suggested that example that has the same code handling both non-cached and git files. No need to worry much about second cached_clone being redundand, as we cache those anyways, so the overhead is not big.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I tried to fold it like you demonstrated before, but it looks like the files are missing if I try to do the copy outside of the context manager.

So I extracted the part that checks if the path is absolute into a function and called it twice.

dvc/repo/get.py Outdated
Comment on lines 85 to 86
if raw_git_dir:
remove(raw_git_dir)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is any need in removing it, it is a cached copy after all 🙂 And if you check external_repo, it doesn't delete those either. Or am I missing something?

Copy link
Contributor Author

@fabiosantoscode fabiosantoscode Jan 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's good news. I figured there was some automatic cleanup happening but haven't seen how it works yet.

dvc/repo/get.py Outdated
Comment on lines 72 to 77
# Either an uncached out with absolute path or a user error
if os.path.isabs(path):
raise FileNotFoundError
_forbid_absolute_path(path)

fs_copy(os.path.join(repo.root_dir, path), out)
return

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just pass it through? You are doing the same thing down below anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First time I tried it I got file not found errors and figured the context manager was removing the repository. I was wrong. I got it working now.

Copy link
Contributor

@jorgeorpinel jorgeorpinel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good with command output now. Let's just reflect it on iterative/dvc.org/pull/912 please (including in the cmd ref Synopsis sections). Thanks.

Copy link
Contributor

@efiop efiop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@efiop efiop merged commit b7f9e73 into iterative:master Jan 11, 2020
@fabiosantoscode fabiosantoscode deleted the feature/3089-get-non-dvc-repositories branch January 13, 2020 14:09
dvc/repo/get.py Show resolved Hide resolved
@fabiosantoscode
Copy link
Contributor Author

@Suor maybe I don't understand what you've raised here but since get now supports non DVC repositories it will never throw an exception to complain about the target repository not being DVC.

@Suor
Copy link
Contributor

Suor commented Jan 19, 2020

@fabiosantoscode get will never throw, but external_repo() doesn't know and should not expect all code using it to always handle certain exceptions. Anyway, this won't be an issue after #3190.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

get: handle non-DVC repositories get/import: help output should reflect non-DVC Git repo support
5 participants