System.Text.RegularExpressions work planned for .NET 7 #62758

joperezr · 2021-12-14T00:44:17Z

This issue captures the planned work for .NET 7. This list is expected to change throughout the release cycle according to ongoing planning and discussions, with possible additions and subtractions to the scope.

Summary

There are currently several efforts for fixing/improving our RegularExpressions engines. This includes test coverage efforts, perf-enhancing efforts, but a lot of new feature work as well. Even though we have individual issues for most of these efforts, this is meant to serve as an all-up tracking issue which shows all of these efforts and their category. Some issues could probably be placed in more than one category, but I tried to pick whichever category had the highest order bit for each. The list here will be updated with new work as it comes up, as well as with expected shipping milestones when we think some of them will land.

Planned for .NET 7

Main Regex Investments

P0: [NonBacktracking] [API Proposal]: RegexOptions.Constrained #57891
P0: [Source Generator] Developers can precompile their Regex code for faster startup #44676
P0: [All Engines] Augment Regex extensibility point for better perf and span-based matching #59629
P0: [All Engines] Overhaul Regex's handling of RegexOptions.IgnoreCase #61048
P0: [All Engines] [API Proposal] Add cultureName constructors to GeneratedRegex #59492
P1: [All Engines] Improve Regex support for automatically making loops atomic #62449
P1: [All Engines] [API Proposal]: Add Regex.Enumerate(ReadOnlySpan<char>) which is allocation free #65011
P2: [Source Generator with NonBacktracking] Add RegexGenerator support for RegexOptions.NonBacktracking #61903
P1: Analyzer/fixer for converting Regex use to source generator #68147

API Proposals

General Perf Improvements

General Improvements

Improve Test Coverage

Non-Backtracking Engine Fixes and Improvements

Source Generator Fixes and Style Improvements

Performance Regressions That Need Fixing

P0: Regressions in System.Text.RegularExpressions.Tests.Perf_Regex_Common #61972

Performance Investigations

cc: @stephentoub @danmoseley @olsaarik @veanes

ghost · 2021-12-14T00:44:22Z

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

Summary

There are currently several efforts for fixing/improving our RegularExpressions engines. This includes test coverage efforts, perf-enhancing efforts, but a lot of new feature work as well. Even though we have individual issues for most of these efforts, this is meant to serve as an all-up tracking issue which shows all of these efforts and their category. Some issues could probably be placed in more than one category, but I tried to pick whichever category had the highest order bit for each. The list here will be updated with new work as it comes up, as well as with expected shipping milestones when we think some of them will land.

Main Regex Investments

[Non-backtracking Engine] Investigate adding capture support for RegexOptions.NonBacktracking
[All Engines] Augment Regex extensibility point for better perf and span-based matching
[All Engines] Improve Regex support for automatically making loops atomic
[Source Generator with NonBacktracking] Add RegexGenerator support for RegexOptions.NonBacktracking
[All Engines] Overhaul Regex's handling of RegexOptions.IgnoreCase
[All Engines] Determine if/how to handle CultureInfo better in Regex (and RegexGenerator)
[Source Generator] Developers can precompile their Regex code for faster startup

API Proposals

General Perf Improvements

General Improvements

Improve Test Coverage

Non-Backtracking Engine Fixes and Improvements

Source Generator Fixes and Style Improvements

Performance Regressions That Need Fixing

Regressions in System.Text.RegularExpressions.Tests.Perf_Regex_Common

Performance Investigations

cc: @stephentoub @danmoseley @olsaarik @veanes

Author:	joperezr
Assignees:	-
Labels:	`area-System.Text.RegularExpressions`, `tracking`
Milestone:	7.0.0

danmoseley · 2021-12-14T01:11:12Z

thanks for pulling this together.

A user story is essentially a significant unit of user visible value that can be completed independently.

So this feels like it can probably actually be several issues each labeled with "user story" eg

Predictable regular expression performance, using a non-backtracking engine
Faster, no-JIT regular expressions, using a source generator
Higher performance regular expression matching // random perf stuff
More robust regular expression engines // for stuff like test coverage improvements, general fixes and robustness?

All the stories would ultimately be parented by a single Theme, not sure what that will be yet.

huan086 · 2022-06-16T17:31:18Z

Feature request: can a more helpful message be shown when an OutOfMemoeyException is thrown in the regex source generator? Currently it shows OutOfMemoeyException and the generic message that source generation has failed and the code may not work correctly. It would be helpful to link to best practices to simplify the regex, and another link to show how to increase memory limit

Background: I have a 1 million character regex which is a word list in different languages, each word separated by |. It works (slowly) in regex interpreted mode. Hoping to increase performance using source generator but ran into OOM

stephentoub · 2022-06-20T22:04:12Z

I have a 1 million character regex which is a word list in different languages

Are you able to share your regex? It'd be interesting to see.

huan086 · 2022-06-21T03:39:17Z

I have a 1 million character regex which is a word list in different languages

Are you able to share your regex? It'd be interesting to see.

This regex labels each location name it finds with the Geoname ID. The interpreted regex typically runs within 5 seconds for a sentence to 60 seconds for paragraph of about 500 words.

Using a regex instead of a loop and IndexOf to take advantage of matching the longest name. E.g. finding the location "New York City" and "New York" will only match "New York City".

Cutting down the number of locations to 1/100 of the list below allows the RegexGenerator to run successfully, creating a source file that is 14MB

https://1drv.ms/t/s!AoUkMG56ao1eq_A4tUTHghOmnndp8A?e=zFPSgH

joperezr · 2022-06-22T22:03:59Z

How common is it for names of Geoname Id's to be substrings of other geoname Id's? I want to try to understand why you think Regex is the best solution to your problem.

huan086 · 2022-06-23T07:47:17Z

How common is it for names of Geoname Id's to be substrings of other geoname Id's? I want to try to understand why you think Regex is the best solution to your problem.

You are right, turns out Regex isn't the correct solution. A loop on every location name, finding all positions of the location name in text repeatedly using IndexOf, and handling the few cases where locations are substrings of other locations by looking at the positions matched before, runs faster. The Aho–Corasick algorithm mention in https://stackoverflow.com/a/498349/263003 would be the optimization I could use.

Originally I thought finding a set of strings within an input text is most straightforward with a Regex.

Back to the original comment I made, it would be helpful to offer some advice when bad regex pattern creates OOM.

joperezr · 2022-06-27T15:53:03Z

it would be helpful to offer some advice when bad regex pattern creates OOM.

That's good feedback, and we can think of something to solve that. I do think it is very rare to have 1 million character RegEx patterns, so I do think that OOM (at least in most of our engines) should be rare and more of an edge case. For .NET 7, we are introducing a new engine: The Non-Backtracking engine, which uses a DFA and NFA based algorithms underneath, which works mainly as a way to ensure linear processing times in terms of the input. This new engine does have a mechanism similar to what you suggest, and will in fact throw an exception when you call the constructor if your pattern is too large and could cause an OOM exception. We can probably think about sharing this concept of catching potential problematic patterns at construction time for the other engines. For more info on the new non-backtracking engine, and for knowledge on the rest of the improvements we've been working on for .NET 7, I would suggest checking out @stephentoub's regex blog post.

jzabroski · 2022-07-18T18:20:57Z

@stephentoub I read through your blog post, as @joperezr suggested, and I think there's a feature missing here that I am somewhat surprised didn't get even punted to a .NET 8 backlog somewhere, or at least deliberately listed as "out of scope" for time constraints:

Using Regex named groups, create structured DTOs to save the match data:

// Define DTO that holds match results
public partial class RegexDto
{
}

// Define regex
[RegexGenerator("(?<DayOfWeek>Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|Sunday)", nameof(RegexDto))]
private static partial Regex MyCoolRegex();

// Would generate partial class equal to...
public partial class RegexDto
{
  public Group DayOfWeek { get; set; } 
}

// Perform match operation 
var regexDto = MyCoolRegex().Match(someText);

// New school approach to reading groups
var dayOfWeek = regexDto.DayOfWeek.Success
    ? Enum.Parse(typeof(DayOfWeek), regexDto.DayOfWeek.Value)
    : (DayOfWeek?)null);
// Old school approach to reading groups
var dayOfWeek = regexDto.Groups["DayOfWeek"].Success
    ? Enum.Parse(typeof(DayOfWeek), regexDto.Groups["DayOfWeek"].Value)
    : (DayOfWeek?)null);

The Visual Studio IntelliSense algorithm already has some basic knowledge around this, but it's currently strings-only. Here is a screenshot of some stupid regex I wrote, to munge some odd SQL tool data format:

Additional benefit would be possible eventual removal of intermediary allocations to GroupCollection and CaptureCollection.

Additional benefit would be that IntelliSense doesnt catch you if you copy-paste Regex code from somewhere else, and have a group name that doesnt exist in the regex itself.

Additionally, something I have always found rather silly about how Visual Studio pretty prints regexes, is that it doesn't provide an UI to visually aid in seeing each group as a standalone value. It is thus very hard to see whether parens are balanced or not. I've never understood why it just doesn't work like this, with underlines showing the named group zones:

Having a slightly different background color for groups would also be pretty helpful. - like Benjamin Moore Yellow Sorbet or Benjamin Moore Sun Dance.

The only reason I can think of, is analysis paralysis, where developers wondered how to visualize nested named groups. But I'd be willing to bet that most regex's in the wild are relatively flat in nature, as most people tend to think of "Named Groups as Parser Tokens", including myself. - Anything not in a named group is just "epsilon" or non-material "whitespace". For example, by far the most common use I have for regex's is a simple way to match filenames. I'm guessing this is one of the main uses people have for regexes: matching things like file name extension plus some filename prefix plus some formatted date stamp/datetime stamp munged somewhere before the extension, or some kind of monotonically increasing sequence id.

jzabroski · 2022-07-18T18:42:36Z

The Aho–Corasick algorithm mention in https://stackoverflow.com/a/498349/263003 would be the optimization I could use.

This is actually mentioned above: #62447

In general, it would be nice if Aho-Corasick itself were a utility class. I had written a post somewhere on reddit on my experiments creating very fast needle in haystack string search using a parallel, C#-based, Aho-Corasick. The actual use case was for something like this. Except, I rather prefer an interface that models the data structure first, and therefore it built a Trie object, vs. generated a switch statement that sliced through the string as though it were a flattened trie. Having a Trie-centric approach would be more useful for things like IDEs where you'd want to be able to redo an algorithm after some inserts.

stephentoub · 2022-07-25T15:55:33Z

@joperezr, @jeffhandley, I've never been clear on the purpose of these epic issues. Does this need to remain open? As these are just links to other issues, they don't track anything themselves that isn't tracked elsewhere.

joperezr · 2022-07-25T16:34:49Z

Yes this can be closed. This was more for my personal tracking and planning, but I agree that the remaining work is tracked by individual issues so no point any longer on keeping this open.

joperezr added area-System.Text.RegularExpressions tracking This issue is tracking the completion of other related issues. labels Dec 14, 2021

joperezr added this to the 7.0.0 milestone Dec 14, 2021

dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Dec 14, 2021

joperezr removed the untriaged New issue has not been triaged by the area owner label Dec 14, 2021

joperezr self-assigned this Dec 14, 2021

jeffhandley added Team:Libraries Priority:1 Work that is critical for the release, but we could probably ship without Epic Groups multiple user stories. Can be grouped under a theme. and removed tracking This issue is tracking the completion of other related issues. labels Jan 10, 2022

ericstj mentioned this issue Jan 12, 2022

Libraries continuous improvement work planned for .NET 7 #57209

Closed

19 tasks

jeffhandley changed the title ~~Planned work for System.Text.RegularExpressions for 7.0.0~~ System.Text.RegularExpressions work planned for .NET 7 Jan 14, 2022

acasorran mentioned this issue Apr 22, 2022

Take advantage of performance improvement using .NET 7 regex source generator nopSolutions/nopCommerce#6241

Closed

joperezr closed this as completed Jul 25, 2022

ghost locked as resolved and limited conversation to collaborators Aug 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

System.Text.RegularExpressions work planned for .NET 7 #62758

System.Text.RegularExpressions work planned for .NET 7 #62758

joperezr commented Dec 14, 2021 •

edited by danmoseley

Loading

ghost commented Dec 14, 2021

Summary

Main Regex Investments

API Proposals

General Perf Improvements

General Improvements

Improve Test Coverage

Non-Backtracking Engine Fixes and Improvements

Source Generator Fixes and Style Improvements

Performance Regressions That Need Fixing

Performance Investigations

danmoseley commented Dec 14, 2021

huan086 commented Jun 16, 2022

stephentoub commented Jun 20, 2022

huan086 commented Jun 21, 2022

joperezr commented Jun 22, 2022

huan086 commented Jun 23, 2022

joperezr commented Jun 27, 2022 •

edited

Loading

jzabroski commented Jul 18, 2022 •

edited

Loading

jzabroski commented Jul 18, 2022

stephentoub commented Jul 25, 2022

joperezr commented Jul 25, 2022

System.Text.RegularExpressions work planned for .NET 7 #62758

System.Text.RegularExpressions work planned for .NET 7 #62758

Comments

joperezr commented Dec 14, 2021 • edited by danmoseley Loading

Summary

Planned for .NET 7

Main Regex Investments

API Proposals

General Perf Improvements

General Improvements

Improve Test Coverage

Non-Backtracking Engine Fixes and Improvements

Source Generator Fixes and Style Improvements

Performance Regressions That Need Fixing

Performance Investigations

ghost commented Dec 14, 2021

Summary

Main Regex Investments

API Proposals

General Perf Improvements

General Improvements

Improve Test Coverage

Non-Backtracking Engine Fixes and Improvements

Source Generator Fixes and Style Improvements

Performance Regressions That Need Fixing

Performance Investigations

danmoseley commented Dec 14, 2021

huan086 commented Jun 16, 2022

stephentoub commented Jun 20, 2022

huan086 commented Jun 21, 2022

joperezr commented Jun 22, 2022

huan086 commented Jun 23, 2022

joperezr commented Jun 27, 2022 • edited Loading

jzabroski commented Jul 18, 2022 • edited Loading

jzabroski commented Jul 18, 2022

stephentoub commented Jul 25, 2022

joperezr commented Jul 25, 2022

joperezr commented Dec 14, 2021 •

edited by danmoseley

Loading

joperezr commented Jun 27, 2022 •

edited

Loading

jzabroski commented Jul 18, 2022 •

edited

Loading