-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
index: add IndexView, brancher: support index #8407
Conversation
29a2d04
to
af888e9
Compare
@@ -77,6 +77,7 @@ def brancher( # noqa: E302 | |||
from dvc.fs import GitFileSystem | |||
|
|||
for sha, names in found_revs.items(): | |||
self.__dict__.pop("index", None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This at least makes repo.index
usable at the same time as brancher. Previously you would just reuse whatever the first built index was due to it being a cached_property (so you would usually get the workspace index no matter what brancher rev you were at)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what we really want is a brancher replacement that returns built indexes, and ideally caches them on disk by git SHA. So for known git SHA's that we have already indexed we can just load the cached version instead of re-collecting the entire DVC repo for each rev every time we use brancher.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously you would just reuse whatever the first built index was due to it being a cached_property
Not sure I understand this. Index is cleared when the fs
changes.
Lines 321 to 325 in 5606608
def fs(self, fs: "FileSystem"): | |
self._fs = fs | |
# Our graph cache is no longer valid, as it was based on the previous | |
# fs. | |
self._reset() |
Lines 576 to 580 in 5606608
def _reset(self): | |
self.state.close() | |
self.scm._reset() # pylint: disable=protected-access | |
self.__dict__.pop("index", None) | |
self.__dict__.pop("dvcignore", None) |
I think what we really want is a brancher replacement that returns built indexes
That was the original idea, see this gist for example. The problem is the way we are swapping out fs
in between which needs to be fixed and needs to be per-index.
ideally caches them on disk by git SHA
That was something that I was thinking of working on next (after introducing index). So we have these bits implemented:
Line 307 in 5606608
def dumpd(self) -> Dict[str, Dict]: |
Line 319 in 5606608
def identifier(self) -> str: |
And the reason why index
is immutable. But there were lots of issues, the important one being that the data structure was not appropriate at that time and the thinking in terms of stage
is not quite suitable (if data
is what we cared about). It is definitely complicated since we have params, metrics, etc which could also be cached.
Since the implementation currently uses stage.dumpd()
, it is terribly slow to even create automatically. Last time, it would take ~5s just to serialize for a significantly large repo. Also there's a lot of cache invalidation that we need to be careful of (eg: dvc.yaml
/.dvc
broken, metrics missing, etc.). I did not find any difference between
loading from GitFS
to loading from cache to justify the complicated implementation at the time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed the fs setter/_reset behavior. I was seeing some issue with caching during brancher but it might have been something unrelated then. I'll take another look and then remove the brancher change if I can't reproduce it again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any updates on this? Do we need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it can be removed, I'll take care of it when the import-url changes get merged. (I'm still in the process of testing some potential brancher related object+import collection changes)
@@ -113,6 +114,62 @@ def is_stage_inside_path(stage: "Stage") -> bool: | |||
|
|||
return self.filter(is_stage_inside_path) | |||
|
|||
@staticmethod | |||
def _hash_targets( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, what is this for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caching known sets of targets that we've already collected using this index. It's just the python __hash__()
of a frozen set of stages for a given targets
string + deps/recursive collection flags.
So we can do the
targets_hash = self._hash_targets(targets, **kwargs)
if targets_hash not in self._collected_targets:
Maybe it would be easier to read code-wise if it was
targets = (frozenset(targets), with_deps, recursive)
collected_targets[targets] = ...
but we don't really need to re-use the entire frozenset from the key at all so I simplified it here to just use the hash
β I have followed the Contributing to DVC checklist.
π If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.
Thank you for the contribution - we'll try to review it as soon as possible. π
Prerequisite for #8249
Depends on iterative/dvc-data#195