-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes for NonBacktracking NFA mode #72199
Conversation
Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions Issue DetailsThis PR fixes bugs uncovered by setting the DFA->NFA transition state limit to zero. The main thing was that the non-capturing NFA execution mode was missing logic for backtracking simulation. The logic in taking NFA transitions is amended such that after the transitions for each source state (which are handled in order of priority) have been added to the set of next states, it is checked whether that source state is nullable. If so, the backtracking engine would prefer to end there rather than exploring lower priority paths, and as such any further source states are skipped. Since the backtracking simulation logic must be disabled in the second reverse phase, as indicated by the presence of a An existing bug affecting both the DFA and NFA modes was uncovered by setting the DFA state limit to zero: the first matching phase was not actually limiting the input for the inner loop to 1000 characters at a time, but this was hidden in the relevant unit test by the pattern switching from DFA to NFA mode after ~10000 characters, which also triggered a timeout check, thus letting the test pass. I'm guessing I introduced this in my latest refactoring of the matching loops. This is now fixed. I ran the benchmarks and there are no significant slowdowns. In fact, some tests show a double digit speedup, which if I had to guess was due to the summary:
|
...ies/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/StateFlags.cs
Outdated
Show resolved
Hide resolved
...ies/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/StateFlags.cs
Outdated
Show resolved
Hide resolved
...ies/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/StateFlags.cs
Outdated
Show resolved
Hide resolved
....Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/SymbolicRegexMatcher.cs
Outdated
Show resolved
Hide resolved
....Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/SymbolicRegexMatcher.cs
Outdated
Show resolved
Hide resolved
....Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/SymbolicRegexMatcher.cs
Show resolved
Hide resolved
....Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/SymbolicRegexMatcher.cs
Outdated
Show resolved
Hide resolved
....Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/SymbolicRegexMatcher.cs
Outdated
Show resolved
Hide resolved
Thanks for merging @stephentoub! Performance after the last changes looks good. Tested on a different day than the baseline, so wouldn't read too much into this, but things were actually faster after the fixes from review: summary: No Slower results for the provided threshold = 1% and noise filter = 0.3 ns.
|
This PR fixes bugs uncovered by setting the DFA->NFA transition state limit to zero.
The main thing was that the non-capturing NFA execution mode was missing logic for backtracking simulation. The logic in taking NFA transitions is amended such that after the transitions for each source state (which are handled in order of priority) have been added to the set of next states, it is checked whether that source state is nullable. If so, the backtracking engine would prefer to end there rather than exploring lower priority paths, and as such any further source states are skipped.
Since the backtracking simulation logic must be disabled in the second reverse phase, as indicated by the presence of a
DisableBacktrackingSimulation
node on the top level, I added a new bit to theContextIndependentState
flags to cache the check. That brought up the number of booleans in the "state info" tuples up to five, which started feeling unweildly. I've refactored this into just returning the underlying enum and adding extension methods to make their use concise. The enum is now calledStateFlags
.An existing bug affecting both the DFA and NFA modes was uncovered by setting the DFA state limit to zero: the first matching phase was not actually limiting the input for the inner loop to 1000 characters at a time, but this was hidden in the relevant unit test by the pattern switching from DFA to NFA mode after ~10000 characters, which also triggered a timeout check, thus letting the test pass. I'm guessing I introduced this in my latest refactoring of the matching loops. This is now fixed.
I ran the benchmarks and there are no significant slowdowns. In fact, some tests show a double digit speedup, which if I had to guess was due to the
StateFlags
refactoring.summary:
better: 40, geomean: 1.056
worse: 5, geomean: 1.042
total diff: 46