-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
System.Text.RegularExpressions work planned for .NET 7 #62758
Comments
thanks for pulling this together. A user story is essentially a significant unit of user visible value that can be completed independently. So this feels like it can probably actually be several issues each labeled with "user story" eg
All the stories would ultimately be parented by a single Theme, not sure what that will be yet. |
Feature request: can a more helpful message be shown when an OutOfMemoeyException is thrown in the regex source generator? Currently it shows OutOfMemoeyException and the generic message that source generation has failed and the code may not work correctly. It would be helpful to link to best practices to simplify the regex, and another link to show how to increase memory limit Background: I have a 1 million character regex which is a word list in different languages, each word separated by |. It works (slowly) in regex interpreted mode. Hoping to increase performance using source generator but ran into OOM |
Are you able to share your regex? It'd be interesting to see. |
This regex labels each location name it finds with the Geoname ID. The interpreted regex typically runs within 5 seconds for a sentence to 60 seconds for paragraph of about 500 words. Using a regex instead of a loop and IndexOf to take advantage of matching the longest name. E.g. finding the location "New York City" and "New York" will only match "New York City". Cutting down the number of locations to 1/100 of the list below allows the RegexGenerator to run successfully, creating a source file that is 14MB |
How common is it for names of Geoname Id's to be substrings of other geoname Id's? I want to try to understand why you think Regex is the best solution to your problem. |
You are right, turns out Regex isn't the correct solution. A loop on every location name, finding all positions of the location name in text repeatedly using IndexOf, and handling the few cases where locations are substrings of other locations by looking at the positions matched before, runs faster. The Aho–Corasick algorithm mention in https://stackoverflow.com/a/498349/263003 would be the optimization I could use. Originally I thought finding a set of strings within an input text is most straightforward with a Regex. Back to the original comment I made, it would be helpful to offer some advice when bad regex pattern creates OOM. |
That's good feedback, and we can think of something to solve that. I do think it is very rare to have 1 million character RegEx patterns, so I do think that OOM (at least in most of our engines) should be rare and more of an edge case. For .NET 7, we are introducing a new engine: The Non-Backtracking engine, which uses a DFA and NFA based algorithms underneath, which works mainly as a way to ensure linear processing times in terms of the input. This new engine does have a mechanism similar to what you suggest, and will in fact throw an exception when you call the constructor if your pattern is too large and could cause an OOM exception. We can probably think about sharing this concept of catching potential problematic patterns at construction time for the other engines. For more info on the new non-backtracking engine, and for knowledge on the rest of the improvements we've been working on for .NET 7, I would suggest checking out @stephentoub's regex blog post. |
@stephentoub I read through your blog post, as @joperezr suggested, and I think there's a feature missing here that I am somewhat surprised didn't get even punted to a .NET 8 backlog somewhere, or at least deliberately listed as "out of scope" for time constraints: Using Regex named groups, create structured DTOs to save the match data: // Define DTO that holds match results
public partial class RegexDto
{
}
// Define regex
[RegexGenerator("(?<DayOfWeek>Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|Sunday)", nameof(RegexDto))]
private static partial Regex MyCoolRegex();
// Would generate partial class equal to...
public partial class RegexDto
{
public Group DayOfWeek { get; set; }
}
// Perform match operation
var regexDto = MyCoolRegex().Match(someText);
// New school approach to reading groups
var dayOfWeek = regexDto.DayOfWeek.Success
? Enum.Parse(typeof(DayOfWeek), regexDto.DayOfWeek.Value)
: (DayOfWeek?)null);
// Old school approach to reading groups
var dayOfWeek = regexDto.Groups["DayOfWeek"].Success
? Enum.Parse(typeof(DayOfWeek), regexDto.Groups["DayOfWeek"].Value)
: (DayOfWeek?)null); The Visual Studio IntelliSense algorithm already has some basic knowledge around this, but it's currently strings-only. Here is a screenshot of some stupid regex I wrote, to munge some odd SQL tool data format: Additional benefit would be possible eventual removal of intermediary allocations to Additional benefit would be that IntelliSense doesnt catch you if you copy-paste Regex code from somewhere else, and have a group name that doesnt exist in the regex itself. Additionally, something I have always found rather silly about how Visual Studio pretty prints regexes, is that it doesn't provide an UI to visually aid in seeing each group as a standalone value. It is thus very hard to see whether parens are balanced or not. I've never understood why it just doesn't work like this, with underlines showing the named group zones: Having a slightly different background color for groups would also be pretty helpful. - like Benjamin Moore Yellow Sorbet or Benjamin Moore Sun Dance. The only reason I can think of, is analysis paralysis, where developers wondered how to visualize nested named groups. But I'd be willing to bet that most regex's in the wild are relatively flat in nature, as most people tend to think of "Named Groups as Parser Tokens", including myself. - Anything not in a named group is just "epsilon" or non-material "whitespace". For example, by far the most common use I have for regex's is a simple way to match filenames. I'm guessing this is one of the main uses people have for regexes: matching things like file name extension plus some filename prefix plus some formatted date stamp/datetime stamp munged somewhere before the extension, or some kind of monotonically increasing sequence id. |
This is actually mentioned above: #62447 In general, it would be nice if Aho-Corasick itself were a utility class. I had written a post somewhere on reddit on my experiments creating very fast needle in haystack string search using a parallel, C#-based, Aho-Corasick. The actual use case was for something like this. Except, I rather prefer an interface that models the data structure first, and therefore it built a Trie object, vs. generated a switch statement that sliced through the string as though it were a flattened trie. Having a Trie-centric approach would be more useful for things like IDEs where you'd want to be able to redo an algorithm after some inserts. |
@joperezr, @jeffhandley, I've never been clear on the purpose of these epic issues. Does this need to remain open? As these are just links to other issues, they don't track anything themselves that isn't tracked elsewhere. |
Yes this can be closed. This was more for my personal tracking and planning, but I agree that the remaining work is tracked by individual issues so no point any longer on keeping this open. |
This issue captures the planned work for .NET 7. This list is expected to change throughout the release cycle according to ongoing planning and discussions, with possible additions and subtractions to the scope.
Summary
There are currently several efforts for fixing/improving our RegularExpressions engines. This includes test coverage efforts, perf-enhancing efforts, but a lot of new feature work as well. Even though we have individual issues for most of these efforts, this is meant to serve as an all-up tracking issue which shows all of these efforts and their category. Some issues could probably be placed in more than one category, but I tried to pick whichever category had the highest order bit for each. The list here will be updated with new work as it comes up, as well as with expected shipping milestones when we think some of them will land.
Planned for .NET 7
Main Regex Investments
API Proposals
General Perf Improvements
General Improvements
Improve Test Coverage
Non-Backtracking Engine Fixes and Improvements
Source Generator Fixes and Style Improvements
Performance Regressions That Need Fixing
Performance Investigations
cc: @stephentoub @danmoseley @olsaarik @veanes
The text was updated successfully, but these errors were encountered: