Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize dvcignore #3869

Closed
1 task
efiop opened this issue May 25, 2020 · 27 comments · Fixed by #3967, #4120 or #4371
Closed
1 task

optimize dvcignore #3869

efiop opened this issue May 25, 2020 · 27 comments · Fixed by #3967, #4120 or #4371
Labels
p1-important Important, aka current backlog of things to do performance improvement over resource / time consuming tasks research

Comments

@efiop
Copy link
Contributor

efiop commented May 25, 2020

From #3867 it is apparent that we need to bench dvcignore and research if anything else there requires optimization. A user (@courentin ) reported that something simple like

data/something
bash_projects
mlruns
notebooks
externals
examples
.circleci
.vscode
.pytest_cache
wandb

takes a significant time.

  • add benchmark for this to dvc-bench project
@triage-new-issues triage-new-issues bot added the triage Needs to be triaged label May 25, 2020
@efiop efiop added the p1-important Important, aka current backlog of things to do label May 25, 2020
@triage-new-issues triage-new-issues bot removed the triage Needs to be triaged label May 25, 2020
@efiop efiop added performance improvement over resource / time consuming tasks research labels May 25, 2020
@Suor
Copy link
Contributor

Suor commented May 29, 2020

Looked at this and I see several issues with code, most of which also affect performance:

  1. DvcIgnore* classes designed around using them in walk(), which leads to excessively complex code handling check for full path.
  2. All the complexity above is added to CleanTree, which makes it go into DvcIgnore* domain. It should really look like:
    def isfile(self, path):
        return self.tree.isfile(path) and not self.dvcignore.match_file(path)
  3. DvcIgnore* hierarchy forces sequential check against all rules and regexes. This might be optimized by constructing single structure, i.e. a big regex or a prefix tree. This is complicated by the fact that dvcignores may contain negations though.
  4. pathspec library also checks against all regexes sequentially, which adds to 3.
  5. High level issue. I suspect that we recheck files found via walk, so we run ignores twice. It needs to be estimated whether this is an issue.

@courentin
Copy link
Contributor

Thank you @Suor

It seems consistent with what I experienced. The more I add lines in .dvcignore, the slowest dvc status is.

@karajan1001
Copy link
Contributor

karajan1001 commented May 29, 2020

Looked at this and I see several issues with code, most of which also affect performance:

  1. DvcIgnore* classes designed around using them in walk(), which leads to excessively complex code handling check for full path.
  2. All the complexity above is added to CleanTree, which makes it go into DvcIgnore* domain. It should really look like:
    def isfile(self, path):
        return self.tree.isfile(path) and not self.dvcignore.match_file(path)
  3. DvcIgnore* hierarchy forces sequential check against all rules and regexes. This might be optimized by constructing single structure, i.e. a big regex or a prefix tree. This is complicated by the fact that dvcignores may contain negations though.
  4. pathspec library also checks against all regexes sequentially, which adds to 3.
  5. High level issue. I suspect that we recheck files found via walk, so we run ignores twice. It needs to be estimated whether this is an issue.

@Suor
I also have an interest in this issue and had looked into the code. And here is my solution to some of the issues above:

  1. For point (1), walk() should use information from DvcIgnore. For example, exiting the .git directory at the beginning of the iteration.
  2. For point (3), according to this article, tried tree or automaton would only perform better if the number of ignored-expressions was greater than several hundred.
  3. For point (4), In pathspec
	matched = False
	for pattern in patterns:
		if pattern.include is not None:
			if file in pattern.match((file,)):
				matched = pattern.include
	return matched

It should stop if any of the patterns matched the file.
@courentin
And I think this is the main reason that It gets slower as the ignore list grows.

I'd like to try to solve these points this weekend.

@Suor
Copy link
Contributor

Suor commented May 29, 2020

@karajan1001

  1. There is no issue with walk, the ignored dir won't we traversed. The issue is when we need to check whether some/path/abc/file.txt is ignored we need to build all of its parents and test them in an unnatural way.

It should stop if any of the patterns matched the file.

So for very common case that file is not ignored it will match it against all of those.

@karajan1001
Copy link
Contributor

karajan1001 commented May 30, 2020

  1. There is no issue with walk, the ignored dir won't we traversed. The issue is when we need to check whether some/path/abc/file.txt is ignored we need to build all of its parents and test them in an unnatural way.

Thank you
Dos Dvcignore support expressions like ../*.csv which influences files outside the current path?

@karajan1001
Copy link
Contributor

karajan1001 commented May 31, 2020

Haha, underrated the difficulty of it. Only written the benchmark (iterative/dvc-bench#30).

5. High level issue. I suspect that we recheck files found via walk, so we run ignores twice. It needs to be estimated whether this is an issue.

@Suor, According to @courentin 's call graph in #3867 it only runs once.

@karajan1001
Copy link
Contributor

karajan1001 commented Jun 7, 2020

@efiop @pared
I have a question how could I test unmerged changes using dvc-bench.

@pared
Copy link
Contributor

pared commented Jun 8, 2020

@karajan1001
Prepared PR explaining this in README, please take a look and review:
iterative/dvc-bench#41

@efiop
Copy link
Contributor Author

efiop commented Jun 8, 2020

Dos Dvcignore support expressions like ../*.csv which influences files outside the current path?

@karajan1001 No, same as gitignore, it cannot look back in the tree.

@karajan1001
Copy link
Contributor

README

Thank you

@efiop
Copy link
Contributor Author

efiop commented Jun 15, 2020

@pared Should we keep this open or are we fully done here?

@pared pared reopened this Jun 15, 2020
@pared
Copy link
Contributor

pared commented Jun 15, 2020

@efiop sorry, autoclose. Seems to me we should leave it open. The issue potentially is still present in the case of multiple dvcignore files. Also, points noted by @Suor (#3869 (comment)) still need to be addressed.

EDIT:
#3967 addresses 2 points out of 5 (addressed points are 3 and 4)

@skshetry skshetry assigned skshetry and unassigned skshetry Jun 16, 2020
efiop added a commit that referenced this issue Jul 14, 2020
* Remove Duplicate ignore Match

* Continue Optimize Dvcignore

fix #3869

* Add a new test with multi ignore files

* Solve merging two dvc files

* Solve Code Climate

* For Windows

* Complete addition of patterns. Add one test

* Systematic test

* Change request

* Change request

* Seperate path sepcification math

* Rename and add comment

* rename change_dirname to private

* Update dvc/pathspec_math.py

list comprehension

Co-authored-by: Alexander Schepanovski <[email protected]>

* Change request

* Update dvc/ignore.py

Co-authored-by: karajan1001 <[email protected]>
Co-authored-by: Alexander Schepanovski <[email protected]>
Co-authored-by: Ruslan Kuprieiev <[email protected]>
@pared
Copy link
Contributor

pared commented Jul 15, 2020

@efiop I think we should keep this issue open. There have been a lot of optimizations done thanks to @karajan1001 hard work.
Still, looking at @Suor original comment, the first 2 issues are still present and we will need to investigate them.

@pared pared reopened this Jul 15, 2020
@efiop
Copy link
Contributor Author

efiop commented Jul 15, 2020

@pared Oops, closed by accident.

@karajan1001
Copy link
Contributor

karajan1001 commented Jul 23, 2020

Benchmark result before and after #4242, it had been closed and not linked to this issue.

This update saves about 5ms.
Screenshot from 2020-07-23 21-24-41

@efiop
Copy link
Contributor Author

efiop commented Aug 4, 2020

Thank you @karajan1001 ! Amazing stuff! 🙏

Just to summarize: the only major thing that is left here is to tell dvcignore not to walk into output directories searching for .dvcignores. So kinda like dvcignore for dvcingores 🙂 Related to #3369 .

@skshetry
Copy link
Member

skshetry commented Aug 4, 2020

@efiop, another simpler optimization could be to not reset dvcignore at all (except on brancher, but on those repo._tree changes which change dvcignore anyway).

@efiop
Copy link
Contributor Author

efiop commented Aug 4, 2020

@skshetry Not sure what resetting has to do with that. Currently, it doesn't know about outputs at all.

@skshetry
Copy link
Member

skshetry commented Aug 4, 2020

Just saying that there's no need to reset dvcignore at all, so we could only collect dvcignores (dynamically once) and use it for the whole DVC session. Right now, we reset it each time we add a new output:

repo._reset() # pylint: disable=protected-access

Or, every time we run a stage:

stage.repo._reset() # pylint: disable=protected-access

But, we don't really need to start clean for dvcignores.

@efiop
Copy link
Contributor Author

efiop commented Aug 4, 2020

@skshetry That's a bit dangerous, and we do that to ensure that there is no hidden unwanted state when we use API (i mean Repo). We also need to reset it when walking the branches, as dvcignore might be different.

@skshetry
Copy link
Member

skshetry commented Aug 4, 2020

@efiop, when walking the branches, the tree will have their own dvcignore. Also, I have to point out that since CleanTree, we did not use to clear dvcignore (but of course, this was an error on our part), so should be safe?

@efiop
Copy link
Contributor Author

efiop commented Aug 4, 2020

@skshetry Ah, totally missed that! Great point! Indeed, seems like we no longer need to reset it and it won't cause us problems.

@karajan1001
Copy link
Contributor

Thank you @karajan1001 ! Amazing stuff! 🙏

Just to summarize: the only major thing that is left here is to tell dvcignore not to walk into output directories searching for .dvcignores. So kinda like dvcignore for dvcingores 🙂 Related to #3369 .
@efiop
Sorry, I had look through #3369, but I can't get this word.
tell dvcignore not to walk into output directories searching for ".dvcignore" s
Can you give me some more details?

@skshetry
Copy link
Member

skshetry commented Aug 5, 2020

@karajan1001, right now, we have to collect stages most of the time. So, even though we have dynamic dvcignore now, we still have to look in every nook and corner inside of the repo to collect those stages. We might even be traversing to the depths of directories that are dvc-tracked and might be searching for ".dvcignore" dynamically.

So, the idea is to either read ".gitignore"s or, @efiop mentioned, as we are already collecting for the stages, we could find out the outputs, and never enter into those directories at all.

Also to extend this idea, we could create a concept of "exhaustive" search inside dvcignore, so that after we are done looking for the stages, we can tell dvcignore to not even try updating dvcignore now (should make later dynamic calls faster), and what we have is what it is.

@karajan1001
Copy link
Contributor

karajan1001 commented Aug 10, 2020

@skshetry , @efiop
Seems that after we collecting stages, all the node of the dvcignore trie tree would be filled.

dvc/dvc/ignore.py

Lines 251 to 254 in 172032d

def _get_trie_pattern(self, dirname):
ignore_pattern = self.ignores_trie_tree.get(dirname)
if ignore_pattern:
return ignore_pattern

DvcIgnoreFilter would never try to update a node that it had looked in before. Actually it will never try to update a node twice and will update once only when a path calls it.

_reset() is, absolutely a big problem. I'm going to open a PR for it with a benchmark result.

@skshetry
Copy link
Member

@karajan1001, yes, I got confused by that too. If there was a complete search, it won't try to look for dvcignore again, but it will still look into every places, even those directories that are dvc-tracked.

But I am not sure if this would make it slower or faster, as there will never be a dvcignore in those directories (so, an .exists() call is only wasted), and os.walk seems cheap enough. Maybe, if the dvc-tracked directories is of too-many levels, it might help.


To get an idea about the impact of this change, I tried to walk() Linux's source tree by converting it to DVC repo (added .dvc folder and renamed all .gitignore to .dvcignore, not quite a dvc repo but good enough for understanding dvcignore's impact).

0.94   -> 16m20s
latest -> 2.76s at first, 2.02s later on
latest (without dvcignore) -> 202ms

Half of those times are spent on .match() for regex patterns. So, that's quite an optimization. Cheers.

@skshetry
Copy link
Member

Okay, I missed another important point: dvcignore has to still apply to the inside of the dvc-tracked directories, it's just the dvcignore that's not needed to be collected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
p1-important Important, aka current backlog of things to do performance improvement over resource / time consuming tasks research
Projects
None yet
6 participants