Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scm: implement gitignore #5243

Merged
merged 4 commits into from
Jan 11, 2021
Merged

scm: implement gitignore #5243

merged 4 commits into from
Jan 11, 2021

Conversation

skshetry
Copy link
Member

@skshetry skshetry commented Jan 11, 2021

[UPDATE]: Went to implement scm._reset() function and use dulwich by caching its IgnoreManager. Regarding pygit2, leaving as-is till the migrations.


If we start making a lot of is_ignored checks, dulwich is a lot slower than pygit2. I'm toying with the idea of not walking through .gitignore-d files/directories for repo.stage and repo.graph collection.

On master version, the performance on dvc status is following:

  Time (mean ± σ):      4.240 s ±  0.192 s    [User: 3.535 s, System: 0.474 s]
  Range (min … max):    4.141 s …  4.781 s    10 runs

On a modified version that does not traverse through git-ignored directory, the performance with the dulwich/gitpython and pygit2 respectively are:

Benchmark #1: dvc status with dulwich
  Time (mean ± σ):      2.260 s ±  0.389 s    [User: 1.779 s, System: 0.074 s]
  Range (min … max):    1.837 s …  2.883 s    10 runs

Benchmark #2: dvc status with gitpython
  Time (mean ± σ):      1.740 s ±  0.317 s    [User: 874.2 ms, System: 719.3 ms]
  Range (min … max):    1.485 s …  2.604 s    10 runs

Benchmark #3: dvc status with pygit2
  Time (mean ± σ):     461.9 ms ±   8.8 ms    [User: 337.5 ms, System: 47.3 ms]
  Range (min … max):   449.1 ms … 476.3 ms    10 runs

Note that dulwich's performance can be made comparable to pygit2 if we cache the IgnoreManager, but by doing so, we won't be able to use it easily in the API (we need to reset it somewhere/somehow).
Just to compare, this is the performance with IgnoreManager cached.

Benchmark #1: dvc status with dulwich cached
  Time (mean ± σ):     512.1 ms ±  73.2 ms    [User: 387.6 ms, System: 43.3 ms]
  Range (min … max):   436.2 ms … 704.0 ms    10 runs

There is deviation in dulwich for some reason though, but it comparable with pygit2. Another reason to choose dulwich is that, we can clear the state of the .gitignore, but pygit2 does not provide an API to do it (the way is just clearing the backends I think).

Thank you for the contribution - we'll try to review it as soon as possible. 🙏

@skshetry skshetry added enhancement Enhances DVC performance improvement over resource / time consuming tasks skip-changelog Skips changelog labels Jan 11, 2021
@skshetry skshetry requested a review from pmrowla January 11, 2021 06:24
@skshetry skshetry self-assigned this Jan 11, 2021
Comment on lines 228 to 235
@cached_property
def ignore_manager(self):
from dulwich.ignore import IgnoreFilterManager

return IgnoreFilterManager.from_repo(self.repo)

manager = ignore.IgnoreFilterManager.from_repo(self.repo)
return manager.is_ignored(relpath(path, self.root_dir))
def is_ignored(self, path: str) -> bool:
return self.ignore_manager.is_ignored(relpath(path, self.root_dir))
Copy link
Member Author

@skshetry skshetry Jan 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like I have to drop this change and make it start using pygit2 by default.

@skshetry skshetry changed the title scm: implement gitignore; optimize on dulwich scm: implement gitignore Jan 11, 2021
@skshetry skshetry marked this pull request as draft January 11, 2021 06:54
@skshetry skshetry marked this pull request as ready for review January 11, 2021 07:13
@skshetry
Copy link
Member Author

Okay, decided to implement _reset anyway. :)

@skshetry skshetry force-pushed the gitignore branch 2 times, most recently from 9c66e11 to 283fed8 Compare January 11, 2021 08:40
* Add is_ignored implementation on `pygit2` and `gitpython` backends.
* Also implement `._reset()` on `repo.scm` which is most of the changes.
@skshetry skshetry requested review from efiop and pared January 11, 2021 08:48
@efiop efiop merged commit a2c91d2 into iterative:master Jan 11, 2021
@efiop
Copy link
Contributor

efiop commented Jan 11, 2021

Great stuff, @skshetry , thank you! 🙏

@skshetry skshetry deleted the gitignore branch January 11, 2021 10:58
@skshetry
Copy link
Member Author

skshetry commented Jan 11, 2021

@efiop, the above benchmarks are with #5244. This PR can be considered as a prerequisite for that issue, if we decide to move forward with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhances DVC performance improvement over resource / time consuming tasks skip-changelog Skips changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants