index: add IndexView, brancher: support index #8407

pmrowla · 2022-10-07T09:48:42Z

❗ I have followed the Contributing to DVC checklist.
📖 If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.

Thank you for the contribution - we'll try to review it as soon as possible. 🙏

Prerequisite for #8249
Depends on iterative/dvc-data#195

pmrowla · 2022-10-11T07:34:03Z

dvc/repo/brancher.py

@@ -77,6 +77,7 @@ def brancher(  # noqa: E302
        from dvc.fs import GitFileSystem

        for sha, names in found_revs.items():
+            self.__dict__.pop("index", None)


This at least makes repo.index usable at the same time as brancher. Previously you would just reuse whatever the first built index was due to it being a cached_property (so you would usually get the workspace index no matter what brancher rev you were at)

I think what we really want is a brancher replacement that returns built indexes, and ideally caches them on disk by git SHA. So for known git SHA's that we have already indexed we can just load the cached version instead of re-collecting the entire DVC repo for each rev every time we use brancher.

Previously you would just reuse whatever the first built index was due to it being a cached_property

Not sure I understand this. Index is cleared when the fs changes.

dvc/dvc/repo/__init__.py

Lines 321 to 325 in 5606608

def fs(self, fs: "FileSystem"):

self._fs = fs

# Our graph cache is no longer valid, as it was based on the previous

# fs.

self._reset()

dvc/dvc/repo/__init__.py

Lines 576 to 580 in 5606608

def _reset(self):

self.state.close()

self.scm._reset() # pylint: disable=protected-access

self.__dict__.pop("index", None)

self.__dict__.pop("dvcignore", None)

I think what we really want is a brancher replacement that returns built indexes

That was the original idea, see this gist for example. The problem is the way we are swapping out fs in between which needs to be fixed and needs to be per-index.

ideally caches them on disk by git SHA

That was something that I was thinking of working on next (after introducing index). So we have these bits implemented:

dvc/dvc/repo/index.py

Line 307 in 5606608

def dumpd(self) -> Dict[str, Dict]:

dvc/dvc/repo/index.py

Line 319 in 5606608

def identifier(self) -> str:

And the reason why index is immutable. But there were lots of issues, the important one being that the data structure was not appropriate at that time and the thinking in terms of stage is not quite suitable (if data is what we cared about). It is definitely complicated since we have params, metrics, etc which could also be cached.

Since the implementation currently uses stage.dumpd(), it is terribly slow to even create automatically. Last time, it would take ~5s just to serialize for a significantly large repo. Also there's a lot of cache invalidation that we need to be careful of (eg: dvc.yaml/.dvc broken, metrics missing, etc.). I did not find any difference between
loading from GitFS to loading from cache to justify the complicated implementation at the time.

I missed the fs setter/_reset behavior. I was seeing some issue with caching during brancher but it might have been something unrelated then. I'll take another look and then remove the brancher change if I can't reproduce it again

Any updates on this? Do we need this?

I think it can be removed, I'll take care of it when the import-url changes get merged. (I'm still in the process of testing some potential brancher related object+import collection changes)

efiop · 2022-10-11T12:46:06Z

dvc/repo/index.py

@@ -113,6 +114,62 @@ def is_stage_inside_path(stage: "Stage") -> bool:

        return self.filter(is_stage_inside_path)

+    @staticmethod
+    def _hash_targets(


Hm, what is this for?

Caching known sets of targets that we've already collected using this index. It's just the python __hash__() of a frozen set of stages for a given targets string + deps/recursive collection flags.

So we can do the

targets_hash = self._hash_targets(targets, **kwargs) if targets_hash not in self._collected_targets:

Maybe it would be easier to read code-wise if it was

targets = (frozenset(targets), with_deps, recursive) collected_targets[targets] = ...

but we don't really need to re-use the entire frozenset from the key at all so I simplified it here to just use the hash

pmrowla changed the title ~~[WIP] index: add IndexView~~ [WIP] index: add IndexView, brancher: support index Oct 7, 2022

pmrowla force-pushed the index-views branch 5 times, most recently from 29a2d04 to af888e9 Compare October 11, 2022 05:41

pmrowla added 4 commits October 11, 2022 16:28

output: mypy fixes

9440d1a

index: cache target collection

e93d83d

index: add IndexView and Index.targets_view()

bdb782a

brancher: clear repo.index on each branch

e3c3ec7

pmrowla force-pushed the index-views branch from af888e9 to e3c3ec7 Compare October 11, 2022 07:29

pmrowla commented Oct 11, 2022

View reviewed changes

pmrowla changed the title ~~[WIP] index: add IndexView, brancher: support index~~ index: add IndexView, brancher: support index Oct 11, 2022

pmrowla marked this pull request as ready for review October 11, 2022 07:35

efiop reviewed Oct 11, 2022

View reviewed changes

efiop approved these changes Oct 11, 2022

View reviewed changes

efiop merged commit add28f7 into iterative:main Oct 11, 2022

pmrowla deleted the index-views branch October 11, 2022 13:14

skshetry mentioned this pull request Dec 14, 2022

Reset all indices on the brancher iteration #8679

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

index: add IndexView, brancher: support index #8407

index: add IndexView, brancher: support index #8407

pmrowla commented Oct 7, 2022

pmrowla Oct 11, 2022

pmrowla Oct 11, 2022

skshetry Oct 11, 2022 •

edited

Loading

pmrowla Oct 11, 2022

skshetry Oct 18, 2022

pmrowla Oct 18, 2022

efiop Oct 11, 2022

pmrowla Oct 11, 2022

	def fs(self, fs: "FileSystem"):
	self._fs = fs
	# Our graph cache is no longer valid, as it was based on the previous
	# fs.
	self._reset()

	def _reset(self):
	self.state.close()
	self.scm._reset() # pylint: disable=protected-access
	self.__dict__.pop("index", None)
	self.__dict__.pop("dvcignore", None)

index: add IndexView, brancher: support index #8407

index: add IndexView, brancher: support index #8407

Conversation

pmrowla commented Oct 7, 2022

pmrowla Oct 11, 2022

Choose a reason for hiding this comment

pmrowla Oct 11, 2022

Choose a reason for hiding this comment

skshetry Oct 11, 2022 • edited Loading

Choose a reason for hiding this comment

pmrowla Oct 11, 2022

Choose a reason for hiding this comment

skshetry Oct 18, 2022

Choose a reason for hiding this comment

pmrowla Oct 18, 2022

Choose a reason for hiding this comment

efiop Oct 11, 2022

Choose a reason for hiding this comment

pmrowla Oct 11, 2022

Choose a reason for hiding this comment

skshetry Oct 11, 2022 •

edited

Loading