Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eliminate backtracking in the interpreter for patterns with .* #51508

Merged
17 commits merged into from
Jul 18, 2021

Conversation

pgovind
Copy link

@pgovind pgovind commented Apr 19, 2021

I spend the last week looking at some potential optimizations in the RegexInterpreter and found this improvement. This PR doesn't change current behavior and is a straightforward optimization. Here is how it works:

Given a pattern such as .*foo and a text such as abfoocde, the RegexInterpreter currently sees the .* and zips to the end of the text. Then we start checking for foo from the end and backtrack 1 by 1 from e to f until we see the foo in the text. At this point we stop and return a match. That turns out to be 6 backtracking (and text compare) operations (e, d, c, o, o, f). With this change, after we zip to the end, we use LastIndexOf to find the first potential match in the text and reset our current position to LastIndexOf. If LastIndexOf is -1, we reset to our previous position before we zipped to the end and save all that backtracking work.

Required follow up to this PR:

  1. Equivalent changes to RegexCompiler
  2. Add a benchmark to dotnet/performance

Potential follow up to investigate after this PR:

  1. Same optimization for patterns with oneloop and setloop nodes.

Fixes Optimize .* in #1349

Perf numbers on my machine:

_backtracking = new Regex(".*(ss)");
[Benchmark] public void Backtracking() => _backtracking.Match("Essential services are provided by regular exprs.");
> dotnet run --base "D:\repos\before_backtracking\" --diff "D:\repos\after_backtracking\" --threshold 0.001%
summary:
better: 3, geomean: 6.500
total diff: 3

No Slower results for the provided threshold = 0.001% and noise filter = 0.3ns.

| Faster                                                                           | base/diff | Base Median (ns) | Diff Median (ns) | Modality|
| -------------------------------------------------------------------------------- | ---------:| ----------------:| ----------------:| -------- |
| System.Text.RegularExpressions.Tests.Perf_Regex_Common.Backtracking(Options: Non |      6.56 |          1877.06 |           286.19 | several?|

There's already ~130 tests with various .* patterns, so I'm not adding any new ones yet. I'm investigating if there are potentially interesting patterns that are missing from our unit tests, but I'm reasonably confident that we have a good spread already.

cc @tannergooding @danmoseley @jeffhandley

@ghost
Copy link

ghost commented Apr 19, 2021

Tagging subscribers to this area: @eerhardt, @pgovind
See info in area-owners.md if you want to be subscribed.

Issue Details

I spend the last week looking at some potential optimizations in the RegexInterpreter and found this improvement. This PR doesn't change any behavior and is a straightforward optimization. Here is how it works:

Given a pattern such as .*foo and a text such as abfoocde, the RegexInterpreter currently sees the .* and zips to the end of the text (technically it zips till the first \n character). Then we start checking for foo from the end and backtrack 1 by 1 from e to f until we see the foo in the text. At this point we stop and return a match. That turns out to be 6 backtracking (and text compare) operations (e, d, c, o, o, f). With this change, after we zip to the end, we use LastIndexOf to find the first potential match in the text and reset our current position to LastIndexOf. If LastIndexOf is -1, we reset to our previous position before we zipped to the end and save all that backtracking work.

Some follow up to investigate after this PR: Same optimization for patterns with oneloop and setloop nodes.

Fixes Optimize .* in #1349

Author: pgovind
Assignees: -
Labels:

area-System.Text.RegularExpressions

Milestone: -

@pgovind pgovind force-pushed the explore_better_FindFirstChar branch from 26d39ee to c8f3778 Compare April 19, 2021 18:46
@pgovind pgovind marked this pull request as ready for review April 19, 2021 18:54
@danmoseley
Copy link
Member

Is this an alternative approach to #42408 ? I need to think about it.

@danmoseley
Copy link
Member

danmoseley commented Apr 20, 2021

There are no patterns in our regex perf tests that would be impacted by this optimization. You might consider adding one in the perf repo before committing this, so the change before and after is on the record. In fact, given our limited set of perf tests, it might be a good idea for us to always make sure there's a perf test that would benefit before committing any interesting regex optimization. I suggest in this case several variations.

@@ -1217,6 +1233,8 @@ protected override void Go()
if (len > i && _operator == RegexCode.Notoneloop)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether this should also happen for Notoneloopatomic.

@danmoseley danmoseley closed this Apr 20, 2021
@danmoseley danmoseley reopened this Apr 20, 2021
@danmoseley
Copy link
Member

I would need to spend some time refamiliarizing myself with the code. It would probably be good for @stephentoub to look at it as well as Tanner as he touched it last.

@pgovind
Copy link
Author

pgovind commented Apr 20, 2021

Is this an alternative approach to #42408 ? I need to think about it.

Not really. It's more of a generalization of #42408 I think. #42408 strictly only optimized a pattern starting with .*. So a pattern such as hi.*foo would've still had backtracking. With this PR, caching runtimepos in _maxBacktrackPosition after processing hi lets the optimization work anywhere a .* is encountered

@pgovind
Copy link
Author

pgovind commented Apr 20, 2021

I would need to spend some time refamiliarizing myself with the code.

If you have the time, I suggest working with this pattern and text:
Pattern: hi.*foo
Text: hifooabcd

Put a breakpoint at the start of the while loop in Go() and breakpoints in case RegexCode.Notoneloop:, case RegexCode.Multi: and case RegexCode.Notoneloop | RegexCode.Back: to see the backtracking in action.

@danmoseley
Copy link
Member

There are multiple somewhat related optimizations that concern .*

  1. This
  2. Remove implicit anchoring optimization from Regex #42408 proposal
  3. Auto atomification
  4. The bump ahead mechanism.

And that is what I don't currently have clearly understood in my head right now. 😀

@stephentoub
Copy link
Member

stephentoub commented Apr 20, 2021

There are multiple somewhat related optimizations that concern .*
And that is what I don't currently have clearly understood in my head right now.

(2) and (4) are related. No matter the expression, we start at pos X, find the next place the expression could possibly start, run the match there, and if it fails, bump pos X to be X+1 and try again. That's the bump-ahead mechanism. #42408 optimizes a case where the .* occurs at the beginning of the pattern; if the .* fails to match, we know it can't possibly match until the next newline (as that's where the .* stops), so there's no point in trying again until that point, and rather than bumping by 1, we can bump until the next \n.

(3) would be if you had a pattern like .*\n, in which case the .* could become atomic because there's nothing it could "give up" that would match \n, since .*\n is (by default) the same as [^\n]*\n.

(1) doesn't require .*abc to be at the beginning of the string; it's just vectorizing via LastIndexOf the search for "abc" backwards through the search space carved out by the .*... normally we'd back up by 1 to try to match the remainder of the pattern, and this instead searches for the next backtracking spot faster.

In other words, these are all mostly orthogonal:

  • (2)/(4) reduce the number of places we need to run the whole Go routine
  • (3) avoids having to backtrack into a .* in some cases
  • (1) makes it faster to backtrack into a .*.

@danmoseley
Copy link
Member

Thanks, that's helpful.

we start at pos X, find the next place the expression could possibly start, run the match there, and if it fails, bump pos X to be X+1 and try again. That's the bump-ahead mechanism

In general, when a match fails, we inevitably bump 1 forward: whereas the bump ahead mechanism as I recall was an optimization (which I believe I proposed, but have paged out) to restart from further than 1 forward. Is this correct: if you have .*abc against xyabc this would run to the first abc and continue, and if that match fails, continue from the b rather than the y ?

@stephentoub
Copy link
Member

Is this correct: if you have .*abc against xyabc this would run to the first abc and continue, and if that match fails, continue from the b rather than the y ?

If you're matching against xyabcabcabcabc, you actually need to first try to match starting at the last abc rather than the first, and then if the rest of the pattern can't match there, back up to the next to last abc, and then the next to next to last abc, and so on.

But regardless, if you can prove that you can't possibly match starting earlier than X, sure, you can jump to X. #42408 is an example of that for the case where the pattern starts with .*, and you can bump to the next \n rather than +1. In your example, with #42408 you don't even have to try again at y or b, but rather look for the next \n, find it doesn't exist, and you're done.

@jeffhandley jeffhandley added this to the 6.0.0 milestone Apr 21, 2021
@pgovind
Copy link
Author

pgovind commented Apr 28, 2021

@stephentoub : I fixed the CI issues. Not super urgent to review this right away. Just making sure it doesn't get lost in your notifications :)

@danmoseley
Copy link
Member

@pgovind you'll need to ping him when he's back May 21st if you want his review. Maybe one of us can review before then so that you can merge though.

@pgovind
Copy link
Author

pgovind commented Apr 28, 2021

Maybe one of us can review before then so that you can merge though

Ok, sounds good to me. I'll wait for your sign off then. It's not urgent whatsoever, but I don't want the PR to get too stale either

@jeffhandley
Copy link
Member

@stephentoub If possible once you're back, it'd be great to get your review of this before the Preview 6 snap.

@pgovind pgovind force-pushed the explore_better_FindFirstChar branch from ccd6643 to d8e73cc Compare July 14, 2021 19:07
@pgovind
Copy link
Author

pgovind commented Jul 14, 2021

is that feasible?

Ok, this is done now and I've addressed the last comment @stephentoub

@stephentoub
Copy link
Member

Can you please make sure we have tests that cover various situations here? e.g.

  • .* followed by something other than a string and then a string
  • .* followed by a string run against input that can match that string in multiple locations and where the last occurrence may or may not result in a match for the whole pattern
  • .* followed by a string matched with and without case sensitivity and with and without RTL
    etc.

@pgovind
Copy link
Author

pgovind commented Jul 16, 2021

Working on the unit tests. Will have them up tomorrow

@ghost
Copy link

ghost commented Jul 16, 2021

Hello @pgovind!

Because this pull request has the auto-merge label, I will be glad to assist with helping to merge this pull request once all check-in policies pass.

p.s. you can customize the way I help with merging this pull request, such as holding this pull request until a specific person approves. Simply @mention me (@msftbot) and give me an instruction to get started! Learn more here.

@ghost ghost merged commit 7eb749c into dotnet:main Jul 18, 2021
@pgovind
Copy link
Author

pgovind commented Jul 19, 2021

@danmoseley : Can I get sign off from you to backport this to P7 please?

@jeffhandley
Copy link
Member

@pgovind You can request backport approval by doing the following:

  • Use the backport bot to create the backport PR to the preview7 branch
  • Fill out the template, especially highlighting the test coverage and risk potential
  • Email tactics with a link to the backport PR and copy the PR description into the email
  • CC Dan, myself, and Stephen

@pgovind
Copy link
Author

pgovind commented Jul 19, 2021

/backport to release/6.0-preview7

@github-actions
Copy link
Contributor

Started backporting to release/6.0-preview7: https://github.com/dotnet/runtime/actions/runs/1046477355

stephentoub added a commit that referenced this pull request Jul 20, 2021
pgovind pushed a commit to pgovind/runtime that referenced this pull request Jul 20, 2021
stephentoub added a commit that referenced this pull request Jul 20, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Aug 18, 2021
This pull request was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants