Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvc: optimize Git.is_tracked() #3053

Merged
merged 1 commit into from
Jan 10, 2020
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion dvc/scm/git/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -229,7 +229,8 @@ def is_tracked(self, path):
# it is equivalent to `bool(self.repo.git.ls_files(path))` by
# functionality, but ls_files fails on unicode filenames
path = relpath(path, self.root_dir)
return path in [i[0] for i in self.repo.index.entries]
# There are 4 stages, see BaseIndexEntry.stage
return any((path, i) in self.repo.index.entries for i in (0, 1, 2, 3))
Comment on lines +232 to +233
Copy link
Contributor

@efiop efiop Jan 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be a new index version released at some point and this might break in a non-obvious way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you you have any benchmark-ish results on how much problems this is causing us right now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are not index versions. From BaseIndexEntry.stage docstring:

Stage of the entry, either:

    * 0 = default stage
    * 1 = stage before a merge or common ancestor entry in case of a 3 way merge
    * 2 = stage of entries from the 'left' side of the merge
    * 3 = stage of entries from the right side of the merge

:note: For more information, see http://www.kernel.org/pub/software/scm/git/docs/git-read-tree.html

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Suor How much real-world problems is it causing us? Does it justify this hack at all?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tbh, glimpsing at it I don't quite understand what is going on.

Copy link
Contributor Author

@Suor Suor Jan 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Benched it - doesn't help much, the reason is .index is not cached it's reread on each use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Listing 10k files in 10k git repo took 6:19 with old code and 5:16 with new one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Suor When do we do that? When we have 10K stage outputs? We don't list files inside dirs, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We list when we do dvc add -r dir.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Suor But we don't recomend doing that on thousands of files, and consider that a misuse. Hence why I'm questioning if we even need this change.


def is_dirty(self):
return self.repo.is_dirty()
Expand Down