-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dvc: optimize Git.is_tracked() #3053
Conversation
# There are 4 stages, see BaseIndexEntry.stage | ||
return any((path, i) in self.repo.index.entries for i in (0, 1, 2, 3)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There might be a new index version released at some point and this might break in a non-obvious way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you you have any benchmark-ish results on how much problems this is causing us right now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are not index versions. From BaseIndexEntry.stage
docstring:
Stage of the entry, either:
* 0 = default stage
* 1 = stage before a merge or common ancestor entry in case of a 3 way merge
* 2 = stage of entries from the 'left' side of the merge
* 3 = stage of entries from the right side of the merge
:note: For more information, see http://www.kernel.org/pub/software/scm/git/docs/git-read-tree.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Suor How much real-world problems is it causing us? Does it justify this hack at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tbh, glimpsing at it I don't quite understand what is going on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Benched it - doesn't help much, the reason is .index
is not cached it's reread on each use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Listing 10k files in 10k git repo took 6:19 with old code and 5:16 with new one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Suor When do we do that? When we have 10K stage outputs? We don't list files inside dirs, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We list when we do dvc add -r dir
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Suor But we don't recomend doing that on thousands of files, and consider that a misuse. Hence why I'm questioning if we even need this change.
Benchmarked old and new code to walk 10k files in 10k repo, with and without index entries caching:
|
For 100k files in repo:
All these tests are about running |
@Suor what is index entries caching? Don't see it mentioned/explained anywhere around here. |
This is how @property
def index(self):
""":return: IndexFile representing this repository's index.
:note: This property can be expensive, as the returned ``IndexFile`` will be
reinitialized. It's recommended to re-use the object."""
return IndexFile(self) Then For benching I cached it on our @cached_property
def index(self):
return self.repo.index and then used |
Summary:
|
Closes #3000.