Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] repo: use unified RepoTree for erepos #3716

Closed
wants to merge 60 commits into from

Conversation

pmrowla
Copy link
Contributor

@pmrowla pmrowla commented May 1, 2020

  • ❗ I have followed the Contributing to DVC checklist.

  • πŸ“– If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here. If the CLI API is changed, I have updated tab completion scripts.

  • ❌ I will check DeepSource, CodeClimate, and other sanity checks below. (We consider them recommendatory and don't expect everything to be addressed. Please fix things that actually improve code or fix bugs.)

Thank you for the contribution - we'll try to review it as soon as possible. πŸ™

PR implements the following:

  • dvc.repo.tree.RepoTree finished - RepoTree can be used to view a DvcTree and separate WorkingTree or GitTree as a single unified tree instance
    • walk()/walk_files() will walk and merge both trees
    • copytree(src, dest) will copy the contents of the tree (starting from src) into the dest pathname
  • utils: Add equivalent versions of existing copytree/copyfile/file_md5/etc functions that work with trees and file objects instead of fs paths
  • remote: Add support for trees/file objects
    • remote.save_tree(), remote.save_obj() can be used to save trees or file objects directly to local cache
  • erepo: Use RepoTree for external repos
    • cloned external repos do not use git checkout unless for_write=True
    • GitTree is used to read files directly from git index rather than working with working copy files
    • only one clone is kept per erepo (rather than duplicate clones needed for every git revision that has been checked out)
  • dvc ls: Now uses RepoTree instead of DvcTree + separate git file handling
  • dvc get: Now uses ExternalRepo.get_external() for handling all dvc + git files
  • dvc fetch: Now uses ExternalRepo.fetch_external() for handling all dvc + git files
  • RepoDependency updated to use new erepo behavior
  • RecursiveImportError now raised when trying to import/get a git directory which contains dvcfiles (import/get directory with git files and dvc outputsΒ #3087). Support for import -R will be implemented in a future PR.

TODO:

  • Cleanup state handling
  • Fix download issue that breaks dvcx (minor dvcx patch needed after this PR, issue has been filed in the dvcx repo)

Fixes #3611
Partial fix for #3087

dvc/external_repo.py Outdated Show resolved Hide resolved
dvc/external_repo.py Show resolved Hide resolved
dvc/repo/tree.py Outdated Show resolved Hide resolved
dvc/external_repo.py Outdated Show resolved Hide resolved
@pmrowla pmrowla force-pushed the 3087-repotree branch 3 times, most recently from 741879a to 6702399 Compare May 8, 2020 04:59
@pmrowla pmrowla marked this pull request as ready for review May 8, 2020 09:13
@@ -139,7 +139,5 @@ def walk(self, top, topdown=True):

yield root, dirs, files

def walk_files(self, top):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is moved into dvc.scm.tree.BaseTree

Comment on lines 267 to 269
if is_working_tree(self.repo.tree):
self.state.save(path_info, checksum)
self.state.save(new_info, checksum, tree=self.cache.tree)
Copy link
Contributor Author

@pmrowla pmrowla May 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that repo.tree may be a GitTree, RemoteLocal (self.cache here) has it's own WorkingTree instance. When working with GitTree's, repo.tree must be used to look for paths in the git-index, and RemoteLocal.tree must be used to look for local cache paths. state.save also has to know which tree to use - it uses repo.tree by default, but when the repo.tree is a GitTree, state.save must be passed the local cache tree instead

Here, path_info is a git-index blob, and not an actual filesystem path with an inode, so we can't do anything with it in state. But new_info is an actual fs path to the newly created local cache file, so we do want to save it in state.

dvc/remote/base.py Outdated Show resolved Hide resolved
tests/func/test_get.py Show resolved Hide resolved
@pmrowla
Copy link
Contributor Author

pmrowla commented May 8, 2020

There's a still bug that needs to be fixed, where git logger messages are not being suppressed and cause the caplog based tests in plot to fail, but other than that this can be reviewed now.

(but do not merge as long as it's still labeled WIP)

@pmrowla pmrowla requested review from efiop, skshetry and pared May 8, 2020 09:36
@pmrowla pmrowla self-assigned this May 8, 2020
@pmrowla
Copy link
Contributor Author

pmrowla commented May 8, 2020

From discussion w/@efiop: need to clean up state changes with regard to local cache (state should just be a no-op when the parent repo tree is not a working tree)

Comment on lines 76 to 87
def use_cache(self, cache):
"""Use specified cache instead of erepo tmpdir cache."""
self._local_cache = cache
if hasattr(self, "cache"):
save_cache = self.cache.local
self.cache.local = cache
# make cache aware of our repo tree
with cache.erepo_tree(self.tree):
yield
if hasattr(self, "cache"):
self.cache.local = save_cache
self._local_cache = None
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This context manager lets us use an existing local cache instead of the default tmpdir erepo cache. Before we would just edit erepo.cache.local.cache_dir to make downloads go where we wanted them to, but we were not using other cache settings like the proper link types.

combined w/the Remote.erepo_tree context manager, we can now use the correct local cache instance, and also make sure that local cache instance is aware of the erepo git tree

filter_info = None
return path_info, filter_info

def get_external(self, path, to_info, recursive=False, **kwargs):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Future PR will be needed to add import/get -R support for #3087

Comment on lines 171 to 174
self._fetch_external(fetch_infos, **kwargs)

def _pull_cached(self, out, path_info, dest):
with self.state:
tmp = PathInfo(tmp_fname(dest))
src = tmp / path_info.relative_to(out.path_info)
self.repo_tree.copytree(path_info, to_info)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_external is now essentially just a fetch and then checkout (checkout is done for dvc outs in RepoTree.copytree)

@pmrowla pmrowla changed the title [WIP] repo: use unified RepoTree for erepos repo: use unified RepoTree for erepos May 12, 2020
dvc/utils/__init__.py Outdated Show resolved Hide resolved
dvc/remote/local.py Outdated Show resolved Hide resolved
@pmrowla
Copy link
Contributor Author

pmrowla commented May 18, 2020

closing this, it will be reworked and split into smaller PRs

@pmrowla pmrowla closed this May 18, 2020
@pmrowla pmrowla deleted the 3087-repotree branch May 18, 2020 09:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

external_repo doesn't respect parent cache and progress bar
4 participants