Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

index: add IndexView, brancher: support index #8407

Merged
merged 4 commits into from
Oct 11, 2022

Conversation

pmrowla
Copy link
Contributor

@pmrowla pmrowla commented Oct 7, 2022

Thank you for the contribution - we'll try to review it as soon as possible. πŸ™

Prerequisite for #8249
Depends on iterative/dvc-data#195

@pmrowla pmrowla changed the title [WIP] index: add IndexView [WIP] index: add IndexView, brancher: support index Oct 7, 2022
@pmrowla pmrowla force-pushed the index-views branch 5 times, most recently from 29a2d04 to af888e9 Compare October 11, 2022 05:41
@@ -77,6 +77,7 @@ def brancher( # noqa: E302
from dvc.fs import GitFileSystem

for sha, names in found_revs.items():
self.__dict__.pop("index", None)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This at least makes repo.index usable at the same time as brancher. Previously you would just reuse whatever the first built index was due to it being a cached_property (so you would usually get the workspace index no matter what brancher rev you were at)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what we really want is a brancher replacement that returns built indexes, and ideally caches them on disk by git SHA. So for known git SHA's that we have already indexed we can just load the cached version instead of re-collecting the entire DVC repo for each rev every time we use brancher.

Copy link
Member

@skshetry skshetry Oct 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously you would just reuse whatever the first built index was due to it being a cached_property

Not sure I understand this. Index is cleared when the fs changes.

dvc/dvc/repo/__init__.py

Lines 321 to 325 in 5606608

def fs(self, fs: "FileSystem"):
self._fs = fs
# Our graph cache is no longer valid, as it was based on the previous
# fs.
self._reset()

dvc/dvc/repo/__init__.py

Lines 576 to 580 in 5606608

def _reset(self):
self.state.close()
self.scm._reset() # pylint: disable=protected-access
self.__dict__.pop("index", None)
self.__dict__.pop("dvcignore", None)

I think what we really want is a brancher replacement that returns built indexes

That was the original idea, see this gist for example. The problem is the way we are swapping out fs in between which needs to be fixed and needs to be per-index.

ideally caches them on disk by git SHA

That was something that I was thinking of working on next (after introducing index). So we have these bits implemented:

def dumpd(self) -> Dict[str, Dict]:

def identifier(self) -> str:

And the reason why index is immutable. But there were lots of issues, the important one being that the data structure was not appropriate at that time and the thinking in terms of stage is not quite suitable (if data is what we cared about). It is definitely complicated since we have params, metrics, etc which could also be cached.

Since the implementation currently uses stage.dumpd(), it is terribly slow to even create automatically. Last time, it would take ~5s just to serialize for a significantly large repo. Also there's a lot of cache invalidation that we need to be careful of (eg: dvc.yaml/.dvc broken, metrics missing, etc.). I did not find any difference between
loading from GitFS to loading from cache to justify the complicated implementation at the time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed the fs setter/_reset behavior. I was seeing some issue with caching during brancher but it might have been something unrelated then. I'll take another look and then remove the brancher change if I can't reproduce it again

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any updates on this? Do we need this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it can be removed, I'll take care of it when the import-url changes get merged. (I'm still in the process of testing some potential brancher related object+import collection changes)

@pmrowla pmrowla changed the title [WIP] index: add IndexView, brancher: support index index: add IndexView, brancher: support index Oct 11, 2022
@pmrowla pmrowla marked this pull request as ready for review October 11, 2022 07:35
@@ -113,6 +114,62 @@ def is_stage_inside_path(stage: "Stage") -> bool:

return self.filter(is_stage_inside_path)

@staticmethod
def _hash_targets(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, what is this for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caching known sets of targets that we've already collected using this index. It's just the python __hash__() of a frozen set of stages for a given targets string + deps/recursive collection flags.

So we can do the

targets_hash = self._hash_targets(targets, **kwargs)
if targets_hash not in self._collected_targets:

Maybe it would be easier to read code-wise if it was

targets = (frozenset(targets), with_deps, recursive)
collected_targets[targets] = ...

but we don't really need to re-use the entire frozenset from the key at all so I simplified it here to just use the hash

@efiop efiop merged commit add28f7 into iterative:main Oct 11, 2022
@pmrowla pmrowla deleted the index-views branch October 11, 2022 13:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants