Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remote: support saving RepoTree objects directly to cache #3825

Merged
merged 17 commits into from
May 21, 2020

Conversation

pmrowla
Copy link
Contributor

@pmrowla pmrowla commented May 19, 2020

  • ❗ I have followed the Contributing to DVC checklist.

  • πŸ“– If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here. If the CLI API is changed, I have updated tab completion scripts.

  • ❌ I will check DeepSource, CodeClimate, and other sanity checks below. (We consider them recommendatory and don't expect everything to be addressed. Please fix things that actually improve code or fix bugs.)

Thank you for the contribution - we'll try to review it as soon as possible. πŸ™

Related to #3811.

added support for:

  • Granular support for RepoTree/DvcTree.open() (Will close api: support granularity (files inside dirs)Β #3541)
    • Files inside DVC directory outs can be opened by any function using RepoTree.open() (Repo.open_by_relpath(), api.open())
    • If file is not cached, it will only be opened if tree was initialized with fetch=True or stream=True
  • RepoTree paths can be saved directly to remote (i.e. local cache)

ex:

tree = RepoTree(..., fetch=True)  # stream or fetch must be set to save DVC outs properly

cache = repo.cache.local
cache.save(path_in_tree, None, tree=tree)

For saving tree paths, rather than requiring the call to cache.save_info(...) to get a destination checksum, for tree paths the checksum will be calculated during save(). The reason for this is that for directories, it makes more sense to collect dir cache information as we walk the tree (and save files from the tree), rather than walking the tree once to collect dir cache information, and then walking it a second time to save files.

Comment on lines +536 to +543
for fname in tree.walk_files(path_info):
checksum = tree.get_file_checksum(fname)
file_info = {
self.PARAM_CHECKSUM: checksum,
self.PARAM_RELPATH: fname.relative_to(path_info).as_posix(),
}
self._save_file(fname, checksum, tree=tree)
dir_info.append(file_info)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The actual behavior for this walk depends on how tree was initialized.

Git files will always be saved directly from the tree in _save_file() via tree.open()
For DVC outs:

  • If fetch=False and stream=False, walk will not recurse into DVC out directories
  • If fetch=True, DVC outs will be fetched when walk recurses into the out directory
    • _save_file will skip the tree.open()->copy for fetched outs (since they will already exist in cache)
  • If stream=True, DVC outs will be saved by streaming in _save_file() via tree.open()

Basically, for erepo's and import/get, we probably always want the behavior from fetch=True.

@pmrowla pmrowla marked this pull request as ready for review May 20, 2020 08:50
@pmrowla pmrowla changed the title [WIP] remote: support saving RepoTree objects directly to cache remote: support saving RepoTree objects directly to cache May 20, 2020
@pmrowla pmrowla requested review from efiop, pared and skshetry May 20, 2020 08:54
tests/unit/repo/test_repo_tree.py Outdated Show resolved Hide resolved
Comment on lines +512 to +515
if tree:
checksum = self._save_tree(path_info, tree)
else:
dir_info = self.get_dir_cache(checksum)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've discussed this before, but just to clarify: I suppose we need this because we are not ready to make _collect_dir use the tree, just yet, right? Because it uses self.walk_files, which raises another question of abstracting tree-like Remote methods somehow (maybe in RemoteTree's or something)?

Copy link
Contributor Author

@pmrowla pmrowla May 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we can add tree support in _collect_dir, but it made the workflow look a bit strange to me, since _collect_dir is used in get_dir_checksum. If you go off of the current behavior for fetching external git files, we would do something along the lines of:

# get a dir checksum for the git dir via `get_dir_checksum()`/`_collect_dir()`/`tree.walk_files()`
dir_cache_info = cache.save_info(path, tree=tree)
# walk the tree again and save git file objects from tree
cache.save(path, dir_cache_info, tree=tree)

But walking the tree twice like this seems unnecessary for this use case, since we can save objects from the tree and collect the dir cache info at the same time during a single walk, and I wasn't sure if saving objects from tree during the walk would be desired behavior for _collect_dir (because saving objects in a get_dir_checksum call definitely seems like non-desired behavior).

Thinking about it now, what we could do is add the tree param to _collect_dir, and if it's called with a tree then we just save the objects from tree during walk_files, and use _collect_dir here in place of _save_tree. And when _collect_dir is used w/o a tree param, it doesn't save any objects during walk_files and behaves the same way as before for get_dir_checksum calls.

Comment on lines +596 to +600
if tree:
isdir = tree.isdir
save_link = False
else:
isdir = self.isdir
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like how RemoteTree(or something like that) becomes more and more apparent πŸ˜„

And that the state should probably belong to the tree. So here it would be a noop for git tree.

Not saying we should implement all of that right now, just noticing the things that we've discussed earlier πŸ™‚

Comment on lines +469 to +477
if self.changed_cache(checksum):
self.move(path_info, cache_info, mode=self.CACHE_MODE)
self.link(cache_info, path_info)
elif self.iscopy(path_info) and self._cache_is_copy(path_info):
# Default relink procedure involves unneeded copy
self.unprotect(path_info)
else:
self.remove(path_info)
self.link(cache_info, path_info)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also think about making save() only save without checking out(these links are effectively checkout). This if&else makes it clear that this method tries to do too much. But at the same time move is a very nice optimization for big files that allows us to make it happen instantly if we are within the same fs.

I would be fine with keeping it as is for now, but let's at least keep this in mind.

Copy link
Contributor

@efiop efiop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Let's merge for now and proceed with the next steps. From the discussions above, it is clear that we will likely abstract away more things when working with trees in remotes.

@efiop efiop merged commit 75e795c into iterative:master May 21, 2020
@pmrowla pmrowla deleted the cache-save-tree branch May 21, 2020 04:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

api: support granularity (files inside dirs)
3 participants