-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optimize dvcignore #3869
Comments
Looked at this and I see several issues with code, most of which also affect performance:
|
Thank you @Suor It seems consistent with what I experienced. The more I add lines in |
@Suor
matched = False
for pattern in patterns:
if pattern.include is not None:
if file in pattern.match((file,)):
matched = pattern.include
return matched It should stop if any of the patterns matched the file. I'd like to try to solve these points this weekend. |
So for very common case that file is not ignored it will match it against all of those. |
Thank you |
Haha, underrated the difficulty of it. Only written the benchmark (iterative/dvc-bench#30).
@Suor, According to @courentin 's call graph in #3867 it only runs once. |
@karajan1001 |
@karajan1001 No, same as gitignore, it cannot look back in the tree. |
Thank you |
@pared Should we keep this open or are we fully done here? |
@efiop sorry, autoclose. Seems to me we should leave it open. The issue potentially is still present in the case of multiple EDIT: |
* Remove Duplicate ignore Match * Continue Optimize Dvcignore fix #3869 * Add a new test with multi ignore files * Solve merging two dvc files * Solve Code Climate * For Windows * Complete addition of patterns. Add one test * Systematic test * Change request * Change request * Seperate path sepcification math * Rename and add comment * rename change_dirname to private * Update dvc/pathspec_math.py list comprehension Co-authored-by: Alexander Schepanovski <[email protected]> * Change request * Update dvc/ignore.py Co-authored-by: karajan1001 <[email protected]> Co-authored-by: Alexander Schepanovski <[email protected]> Co-authored-by: Ruslan Kuprieiev <[email protected]>
@efiop I think we should keep this issue open. There have been a lot of optimizations done thanks to @karajan1001 hard work. |
@pared Oops, closed by accident. |
Benchmark result before and after #4242, it had been closed and not linked to this issue. |
Thank you @karajan1001 ! Amazing stuff! 🙏 Just to summarize: the only major thing that is left here is to tell dvcignore not to walk into output directories searching for |
@efiop, another simpler optimization could be to not reset |
@skshetry Not sure what resetting has to do with that. Currently, it doesn't know about outputs at all. |
Just saying that there's no need to reset dvcignore at all, so we could only collect dvcignores (dynamically once) and use it for the whole DVC session. Right now, we reset it each time we add a new output: Line 150 in d8b0373
Or, every time we run a stage: Line 43 in d8b0373
But, we don't really need to start clean for dvcignores. |
@skshetry That's a bit dangerous, and we do that to ensure that there is no hidden unwanted state when we use API (i mean Repo). We also need to reset it when walking the branches, as dvcignore might be different. |
@efiop, when walking the branches, the |
@skshetry Ah, totally missed that! Great point! Indeed, seems like we no longer need to reset it and it won't cause us problems. |
|
@karajan1001, right now, we have to collect stages most of the time. So, even though we have dynamic dvcignore now, we still have to look in every nook and corner inside of the repo to collect those stages. We might even be traversing to the depths of directories that are dvc-tracked and might be searching for ".dvcignore" dynamically. So, the idea is to either read ".gitignore"s or, @efiop mentioned, as we are already collecting for the stages, we could find out the outputs, and never enter into those directories at all. Also to extend this idea, we could create a concept of "exhaustive" search inside dvcignore, so that after we are done looking for the stages, we can tell dvcignore to not even try updating dvcignore now (should make later dynamic calls faster), and what we have is what it is. |
@skshetry , @efiop Lines 251 to 254 in 172032d
DvcIgnoreFilter would never try to update a node that it had looked in before. Actually it will never try to update a node twice and will update once only when a path calls it.
|
@karajan1001, yes, I got confused by that too. If there was a complete search, it won't try to look for dvcignore again, but it will still look into every places, even those directories that are dvc-tracked. But I am not sure if this would make it slower or faster, as there will never be a dvcignore in those directories (so, an To get an idea about the impact of this change, I tried to
Half of those times are spent on |
Okay, I missed another important point: dvcignore has to still apply to the inside of the dvc-tracked directories, it's just the |
From #3867 it is apparent that we need to bench dvcignore and research if anything else there requires optimization. A user (@courentin ) reported that something simple like
takes a significant time.
The text was updated successfully, but these errors were encountered: