-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Eliminate backtracking in the interpreter for patterns with .* #51508
Conversation
Tagging subscribers to this area: @eerhardt, @pgovind Issue DetailsI spend the last week looking at some potential optimizations in the RegexInterpreter and found this improvement. This PR doesn't change any behavior and is a straightforward optimization. Here is how it works: Given a pattern such as Some follow up to investigate after this PR: Same optimization for patterns with Fixes
|
26d39ee
to
c8f3778
Compare
Is this an alternative approach to #42408 ? I need to think about it. |
There are no patterns in our regex perf tests that would be impacted by this optimization. You might consider adding one in the perf repo before committing this, so the change before and after is on the record. In fact, given our limited set of perf tests, it might be a good idea for us to always make sure there's a perf test that would benefit before committing any interesting regex optimization. I suggest in this case several variations. |
...raries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexInterpreter.cs
Outdated
Show resolved
Hide resolved
...raries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexInterpreter.cs
Show resolved
Hide resolved
@@ -1217,6 +1233,8 @@ protected override void Go() | |||
if (len > i && _operator == RegexCode.Notoneloop) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder whether this should also happen for Notoneloopatomic
.
I would need to spend some time refamiliarizing myself with the code. It would probably be good for @stephentoub to look at it as well as Tanner as he touched it last. |
Not really. It's more of a generalization of #42408 I think. #42408 strictly only optimized a pattern starting with |
If you have the time, I suggest working with this pattern and text: Put a breakpoint at the start of the while loop in |
There are multiple somewhat related optimizations that concern
And that is what I don't currently have clearly understood in my head right now. 😀 |
(2) and (4) are related. No matter the expression, we start at pos X, find the next place the expression could possibly start, run the match there, and if it fails, bump pos X to be X+1 and try again. That's the bump-ahead mechanism. #42408 optimizes a case where the (3) would be if you had a pattern like (1) doesn't require In other words, these are all mostly orthogonal:
|
Thanks, that's helpful.
In general, when a match fails, we inevitably bump 1 forward: whereas the bump ahead mechanism as I recall was an optimization (which I believe I proposed, but have paged out) to restart from further than 1 forward. Is this correct: if you have |
If you're matching against xyabcabcabcabc, you actually need to first try to match starting at the last abc rather than the first, and then if the rest of the pattern can't match there, back up to the next to last abc, and then the next to next to last abc, and so on. But regardless, if you can prove that you can't possibly match starting earlier than X, sure, you can jump to X. #42408 is an example of that for the case where the pattern starts with .*, and you can bump to the next \n rather than +1. In your example, with #42408 you don't even have to try again at y or b, but rather look for the next \n, find it doesn't exist, and you're done. |
...raries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexInterpreter.cs
Outdated
Show resolved
Hide resolved
@stephentoub : I fixed the CI issues. Not super urgent to review this right away. Just making sure it doesn't get lost in your notifications :) |
@pgovind you'll need to ping him when he's back May 21st if you want his review. Maybe one of us can review before then so that you can merge though. |
Ok, sounds good to me. I'll wait for your sign off then. It's not urgent whatsoever, but I don't want the PR to get too stale either |
@stephentoub If possible once you're back, it'd be great to get your review of this before the Preview 6 snap. |
ccd6643
to
d8e73cc
Compare
Ok, this is done now and I've addressed the last comment @stephentoub |
...libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCompiler.cs
Outdated
Show resolved
Hide resolved
...libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCompiler.cs
Outdated
Show resolved
Hide resolved
...libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCompiler.cs
Outdated
Show resolved
Hide resolved
...libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCompiler.cs
Outdated
Show resolved
Hide resolved
...libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCompiler.cs
Outdated
Show resolved
Hide resolved
...libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCompiler.cs
Outdated
Show resolved
Hide resolved
...libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCompiler.cs
Outdated
Show resolved
Hide resolved
Can you please make sure we have tests that cover various situations here? e.g.
|
Working on the unit tests. Will have them up tomorrow |
Hello @pgovind! Because this pull request has the p.s. you can customize the way I help with merging this pull request, such as holding this pull request until a specific person approves. Simply @mention me (
|
@danmoseley : Can I get sign off from you to backport this to P7 please? |
@pgovind You can request backport approval by doing the following:
|
/backport to release/6.0-preview7 |
Started backporting to release/6.0-preview7: https://github.com/dotnet/runtime/actions/runs/1046477355 |
dotnet#51508)" This reverts commit 7eb749c.
I spend the last week looking at some potential optimizations in the RegexInterpreter and found this improvement. This PR doesn't change current behavior and is a straightforward optimization. Here is how it works:
Given a pattern such as
.*foo
and a text such asabfoocde
, theRegexInterpreter
currently sees the.*
and zips to the end of the text. Then we start checking forfoo
from the end and backtrack 1 by 1 frome
tof
until we see thefoo
in the text. At this point we stop and return a match. That turns out to be 6 backtracking (and text compare) operations (e, d, c, o, o, f). With this change, after we zip to the end, we useLastIndexOf
to find the first potential match in the text and reset our current position toLastIndexOf
. If LastIndexOf is -1, we reset to our previous position before we zipped to the end and save all that backtracking work.Required follow up to this PR:
Potential follow up to investigate after this PR:
oneloop
andsetloop
nodes.Fixes
Optimize .*
in #1349Perf numbers on my machine:
There's already ~130 tests with various
.*
patterns, so I'm not adding any new ones yet. I'm investigating if there are potentially interesting patterns that are missing from our unit tests, but I'm reasonably confident that we have a good spread already.cc @tannergooding @danmoseley @jeffhandley