-
Notifications
You must be signed in to change notification settings - Fork 473
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster IgnoredError::shouldIgnore()
#737
Conversation
I think at this point considering the prevalence of generated baselines, we should start considering a non-regex-based ignoreErrors, if regex is this much of a bottleneck. Regex based error filtering made sense when you had to write all error patterns by hand, but now that we're generating baseline so much it doesn't make much sense anymore. |
Agree. For the baseline case we should not at all need regex matching. @ondrejmirtes do you agree? Would you accept a pr changing baseline generation and matching with plain strings instead of regex? |
To be clear, regex should still be supported, but using regex for baseline makes no sense. |
I will give it a try |
Hi, how many ignoreErrors do you have? I don't really consider regexes as a bottleneck. |
~15 000 (as can be seen in the profiles of this PRs description, in my case its the slowest part of the analysis run) |
Wouldn't it be better to have a lower level and smaller baseline? In this case you have to regenerate it very often I'd say. In my experience baseline works really well if you have hundreds of errors in it, not thousands. Because it assumes that you have to be able to write new code cleanly which is hard on level 7 and 8 if the rest of the application isn't typed well. |
to be precise: we have 15000 lines of baseline, which means ~3750 errors IMO if the code-base is "big enough" you will have that much errors. you have to start somewhere and going below 6 will leave too much room in new code IMO. |
Alright, so how much time does the analysis of the whole codebase takes (without result cache) and how much time from that is spent on filtering ignored errors? |
the whole run is 27-28secs. error ignoring is 2 secs on 2e32030 with a populated result cache relatively its way more: the whole run is 11secs. error ignoring is 2 secs on 2e32030 |
FWIW, I didn't see any clear evidence that the regex is a bottleneck in the PM case (less than 1% of CPU time spent, even if a full result cache is hit). But PM has somewhere in the ballpark of 1-2K errors and widely distributed over many files, with usually only a handful of errors per affected file. I'd imagine that given the file segregation it wouldn't have much impact unless you have large ratio of errors to number of files, since it only tries to match error patterns designated for that file specifically. Some numbers for context (these are from Windows, both of them assuming that the
So, less than 1% in either of these cases is pretty insignificant, so I don't really have a dog in this fight. |
btw: seems like we are not the only ones with such a big baseline: https://twitter.com/Kalle_/status/1455146801632272393 |
I don't think the size of baseline matters. What matters more is the number of ignored errors per file. You can see here it only attempts to check against ignoreError patterns with a matching filename: phpstan-src/src/Analyser/IgnoredErrorHelperResult.php Lines 139 to 150 in 1e50372
However, the algorithm used does appear to have an O(n^2) complexity. Some off-the-bat maths:
I think at most PM has maybe 20-30 errors in the worst file, so this doesn't really affect PM's analysis time at all. |
this fix is not very elegant, but as it improves the analysis by 7-8% I think its at least worth discussing.
running phpstan 1505153 leads to the following profile:
this means we still see a bottleneck caused by regex-matching of ignore rules.
this PR suggests a fast-exit condition based on the error-messages first character.
if the first char does not match - and its not a regex-meta char, the following regex-matching-logic can never match.
this optimization leads to the following comparison profile:
we can see a ~1s improvement while also saving ~20% memory.