remote: support saving RepoTree objects directly to cache #3825

pmrowla · 2020-05-19T10:22:02Z

❗ I have followed the Contributing to DVC checklist.
📖 If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here. If the CLI API is changed, I have updated tab completion scripts.
❌ I will check DeepSource, CodeClimate, and other sanity checks below. (We consider them recommendatory and don't expect everything to be addressed. Please fix things that actually improve code or fix bugs.)

Thank you for the contribution - we'll try to review it as soon as possible. 🙏

Related to #3811.

added support for:

Granular support for RepoTree/DvcTree.open() (Will close api: support granularity (files inside dirs) #3541)
- Files inside DVC directory outs can be opened by any function using RepoTree.open() (Repo.open_by_relpath(), api.open())
- If file is not cached, it will only be opened if tree was initialized with fetch=True or stream=True
RepoTree paths can be saved directly to remote (i.e. local cache)

ex:

tree = RepoTree(..., fetch=True)  # stream or fetch must be set to save DVC outs properly

cache = repo.cache.local
cache.save(path_in_tree, None, tree=tree)

For saving tree paths, rather than requiring the call to cache.save_info(...) to get a destination checksum, for tree paths the checksum will be calculated during save(). The reason for this is that for directories, it makes more sense to collect dir cache information as we walk the tree (and save files from the tree), rather than walking the tree once to collect dir cache information, and then walking it a second time to save files.

GitTree

pmrowla · 2020-05-20T08:49:58Z

dvc/remote/base.py

+        for fname in tree.walk_files(path_info):
+            checksum = tree.get_file_checksum(fname)
+            file_info = {
+                self.PARAM_CHECKSUM: checksum,
+                self.PARAM_RELPATH: fname.relative_to(path_info).as_posix(),
+            }
+            self._save_file(fname, checksum, tree=tree)
+            dir_info.append(file_info)


The actual behavior for this walk depends on how tree was initialized.

Git files will always be saved directly from the tree in _save_file() via tree.open()
For DVC outs:

If fetch=False and stream=False, walk will not recurse into DVC out directories

If fetch=True, DVC outs will be fetched when walk recurses into the out directory

_save_file will skip the tree.open()->copy for fetched outs (since they will already exist in cache)

If stream=True, DVC outs will be saved by streaming in _save_file() via tree.open()

Basically, for erepo's and import/get, we probably always want the behavior from fetch=True.

tests/unit/repo/test_repo_tree.py

efiop · 2020-05-20T22:08:39Z

dvc/remote/base.py

+        if tree:
+            checksum = self._save_tree(path_info, tree)
+        else:
+            dir_info = self.get_dir_cache(checksum)


We've discussed this before, but just to clarify: I suppose we need this because we are not ready to make _collect_dir use the tree, just yet, right? Because it uses self.walk_files, which raises another question of abstracting tree-like Remote methods somehow (maybe in RemoteTree's or something)?

So we can add tree support in _collect_dir, but it made the workflow look a bit strange to me, since _collect_dir is used in get_dir_checksum. If you go off of the current behavior for fetching external git files, we would do something along the lines of:

# get a dir checksum for the git dir via `get_dir_checksum()`/`_collect_dir()`/`tree.walk_files()` dir_cache_info = cache.save_info(path, tree=tree) # walk the tree again and save git file objects from tree cache.save(path, dir_cache_info, tree=tree)

But walking the tree twice like this seems unnecessary for this use case, since we can save objects from the tree and collect the dir cache info at the same time during a single walk, and I wasn't sure if saving objects from tree during the walk would be desired behavior for _collect_dir (because saving objects in a get_dir_checksum call definitely seems like non-desired behavior).

Thinking about it now, what we could do is add the tree param to _collect_dir, and if it's called with a tree then we just save the objects from tree during walk_files, and use _collect_dir here in place of _save_tree. And when _collect_dir is used w/o a tree param, it doesn't save any objects during walk_files and behaves the same way as before for get_dir_checksum calls.

efiop · 2020-05-20T22:12:07Z

dvc/remote/base.py

+        if tree:
+            isdir = tree.isdir
+            save_link = False
+        else:
+            isdir = self.isdir


I like how RemoteTree(or something like that) becomes more and more apparent 😄

And that the state should probably belong to the tree. So here it would be a noop for git tree.

Not saying we should implement all of that right now, just noticing the things that we've discussed earlier 🙂

efiop · 2020-05-20T22:17:21Z

dvc/remote/base.py

+            if self.changed_cache(checksum):
+                self.move(path_info, cache_info, mode=self.CACHE_MODE)
+                self.link(cache_info, path_info)
+            elif self.iscopy(path_info) and self._cache_is_copy(path_info):
+                # Default relink procedure involves unneeded copy
+                self.unprotect(path_info)
+            else:
+                self.remove(path_info)
+                self.link(cache_info, path_info)


Could also think about making save() only save without checking out(these links are effectively checkout). This if&else makes it clear that this method tries to do too much. But at the same time move is a very nice optimization for big files that allows us to make it happen instantly if we are within the same fs.

I would be fine with keeping it as is for now, but let's at least keep this in mind.

efiop

Thanks! Let's merge for now and proceed with the next steps. From the discussions above, it is clear that we will likely abstract away more things when working with trees in remotes.

pmrowla self-assigned this May 19, 2020

pmrowla mentioned this pull request May 19, 2020

repo: abstract erepo dvc/git handling into RepoTree #3811

Closed

4 tasks

pmrowla added 6 commits May 20, 2020 12:59

DvcTree: use out.dir_cache instead of out.collect_use_dir_cache

6117cf0

RepoTree: support filtering out dvcfiles during walk()

18e611b

RepoTree: support recursive isdvc()

dd2ffa7

RepoTree: pull dir contents on walk() if in fetch mode

076dcfe

remote: support computing checksums for tree objects

f1991b6

tests: add func test for DVC dir pull on RepoTree.walk()

22594e1

pmrowla force-pushed the cache-save-tree branch from 935dca0 to 22594e1 Compare May 20, 2020 03:59

pmrowla added 10 commits May 20, 2020 16:00

DvcTree: support granular open()

de17469

RepoTree: add get_file_checksum()

deb729d

remote: revert checksum changes

5bfd57d

remote: add tree support for cache.save()

ff28ac8

tests: test saving directly to cache from RepoTree

a60f8f5

remote: local cache needs its own WorkingTree instance when repo.tree is

6add867

GitTree

RepoTree: state context should be entered by open()/walk() caller

a6dcaf3

RepoTree: only pull dir contents when cache has changed

2047d85

tree: update tree tests

e79c923

fix ds warnings, update comments

821c507

pmrowla commented May 20, 2020

View reviewed changes

pmrowla marked this pull request as ready for review May 20, 2020 08:50

pmrowla changed the title ~~[WIP] remote: support saving RepoTree objects directly to cache~~ remote: support saving RepoTree objects directly to cache May 20, 2020

pmrowla requested review from efiop, pared and skshetry May 20, 2020 08:54

pared suggested changes May 20, 2020

View reviewed changes

tests/unit/repo/test_repo_tree.py Outdated Show resolved Hide resolved

fix windows tests

4082dd2

pmrowla force-pushed the cache-save-tree branch from 031bbba to 4082dd2 Compare May 20, 2020 15:02

efiop mentioned this pull request May 20, 2020

cmd ref: document target granularity for push/pull/fetch/checkout/status -c et al. iterative/dvc.org#886

Closed

efiop reviewed May 20, 2020

View reviewed changes

efiop approved these changes May 21, 2020

View reviewed changes

efiop merged commit 75e795c into iterative:master May 21, 2020

pmrowla deleted the cache-save-tree branch May 21, 2020 04:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remote: support saving RepoTree objects directly to cache #3825

remote: support saving RepoTree objects directly to cache #3825

pmrowla commented May 19, 2020 •

edited

Loading

pmrowla May 20, 2020

efiop May 20, 2020

pmrowla May 21, 2020 •

edited

Loading

efiop May 20, 2020

efiop May 20, 2020

efiop left a comment

remote: support saving RepoTree objects directly to cache #3825

remote: support saving RepoTree objects directly to cache #3825

Conversation

pmrowla commented May 19, 2020 • edited Loading

pmrowla May 20, 2020

Choose a reason for hiding this comment

efiop May 20, 2020

Choose a reason for hiding this comment

pmrowla May 21, 2020 • edited Loading

Choose a reason for hiding this comment

efiop May 20, 2020

Choose a reason for hiding this comment

efiop May 20, 2020

Choose a reason for hiding this comment

efiop left a comment

Choose a reason for hiding this comment

pmrowla commented May 19, 2020 •

edited

Loading

pmrowla May 21, 2020 •

edited

Loading