Developers can precompile their Regex code for faster startup #44676

danmoseley · 2020-11-14T01:19:46Z

AB#1255830
Regex has an interpreted mode and a compiled mode. The compiled mode takes longer to start, but is generally faster. Some users want both startup time and performance; other users want to run on a platform where JITted code is not allowed, and also want performance. For those users, .NET Framework allowed the generated code to be saved out but .NET Core/.NET 5 does not plan to support saving emitted IL.

Instead, we could develop a Roslyn Source Generator that would generate the necessary code (as C#) at compile time.

Essentially API should be need on Regex, as the generated code would hook into it in the regular way, by deriving from RegexRunner and implementing FindFirstChar() and Go(). This API is all public/protected today (indeed it happens that .NET Core can load regexes written out by .NET Framework.)

TBD the user experience - eg., how one must annotate regexes in order to trigger the generator - assuming it doesn't attempt to read all the code to try to infer them - and how the generated code is wired up.

For example, one could imagine requiring an annotation like this (we could also expose API on the source generator so it could consume a list of regexes, for example)

[RegexSourceGenerator(“(?:[0-9]{1,3}\.){3}[0-9]{1,3}”, RegexOptions.IgnoreCase)]
public partial Regex CreateIpAddressRegex();

This new attribute would need to be exposed somewhere, probably on the Regex assembly since the code being compiled would be referencing that anyway.

Also TBD the implementation details necessary to wire into the existing Regex implementation - this is probably most of the work.

cc @pgovind @eerhardt @stephentoub this is just a summary of our email thread. Anything to add?

In terms of customers, we have at least one major 1st party service that heavily uses regex and highly values both throughput and startup time. This is probably P1 for them.

@danroth27 in a Blazor app, can I assume that most users in the .NET 6 timeframe will be content with either using Javascript regexes, or interpreted mode .NET regexes? In my mind this is P2/P3.

@marek-safar similar question for Xamarin, on platforms that don't JIT, Xamarin apps presumably currently interpret the ref-emitted code, or use the regex interpreted mode - and this would not be a big win for them - we had talked about P2.

The text was updated successfully, but these errors were encountered:

ghost · 2020-11-14T01:19:49Z

Tagging @eerhardt, @pgovind, @jeffhandley as subscribed to this area.
See info in area-owners.md if you want to be subscribed.

Issue Details

Description:	Regex has an interpreted mode and a compiled mode. The compiled mode takes longer to start, but is generally faster. Some users want both startup time and performance; other users want to run on a platform where JITted code is not allowed, and also want performance. For those users, .NET Framework allowed the generated code to be saved out but .NET Core/.NET 5 does not plan to support saving emitted IL. Instead, we could develop a Roslyn Source Generator that would generate the necessary code (as C#) at compile time. Essentially API should be need on Regex, as the generated code would hook into it in the regular way, by deriving from RegexRunner and implementing FindFirstChar() and Go(). This API is all public/protected today (indeed it happens that .NET Core can load regexes written out by .NET Framework.) TBD the user experience - eg., how one must annotate regexes in order to trigger the generator - assuming it doesn't attempt to read all the code to try to infer them. For example, one could imagine requiring an annotation like this (we could also expose API on the source generator so it could consume a list of regexes, for example) `[RegexSourceGenerator(“[a-f0-9]*”, RegexOptions.IgnoreCase)] public partial Regex CreateExpression();` This new attribute would need to be exposed somewhere, probably on the Regex assembly since the code being compiled would be referencing that anyway. Also TBD the implementation details necessary to wire into the existing Regex implementation - this is probably most of the work. cc @pgovind @eerhardt @stephentoub this is just a summary of our email thread. Anything to add? In terms of customers, we have at least one major 1st party service that heavily uses regex and highly values both throughput and startup time. This is probablyP1 for them. @danroth27 in a Blazor app, can I assume that most users in the .NET 6 timeframe will be content with either using Javascript regexes, or interpreted mode .NET regexes? In my mind this is P2/P3. @marek-safar similar question for Xamarin, on platforms that don't JIT, Xamarin apps presumably currently interpret the ref-emitted code, or use the regex interpreted mode - and this would not be a big win for them - we had talked about P2.
Author:	danmosemsft
Assignees:	-
Labels:	`area-System.Text.RegularExpressions`, `untriaged`
Milestone:	-

danmoseley · 2020-11-14T01:20:58Z

cc @mjsabby

am11 · 2020-11-14T05:34:55Z

With source generator based implementation, will the difference between compiled and interpreted regex modes disappear; i.e. they both will endup using the same 'generated managed code' version?

danmoseley · 2020-11-14T15:55:20Z

@am11 if you enable it, and presumably annotate appropriately.

There should already be no observable difference, other than perf.

am11 · 2020-11-14T16:59:48Z

Great! In addition to annotations (or instead of, if low granularity is not required?), AppContext setting approach could also be useful here, for the end-users to specify project-wide choice.

<!-- enduser-lib.csrpoj -->
<ItemGroup>
  <!-- existing options, where we also do not care about low granularity by means of annotations -->
  <RuntimeHostConfigurationOption Include="System.Globalization.Invariant" Value="true" />
  <RuntimeHostConfigurationOption Include="System.Globalization.UseNls" Value="true" />
  <RuntimeHostConfigurationOption Include="System.Threading.ThreadPool.MaxThreads" Value="true" />
  <!-- etc. -->

  <!-- new, could be annotations and this option, or just this option -->
  <RuntimeHostConfigurationOption Include="System.Text.RegularExpressions.UseSourceGenerated" Value="true" />
</ItemGroup>

svick · 2020-11-15T06:21:37Z

@am11 I'm not sure it's feasible to make code that uses existing Regex members use source generated implementation.

stephentoub · 2020-11-15T14:58:06Z

I'm not sure it's feasible to make code that uses existing Regex members use source generated implementation.

There are technically ways, but I currently don't think we should pursue them.

For example, imagine we added a static method like:

public abstract class RegexRunnerFactory
{
    public static void Register(string pattern, RegexOptions options, Func<RegexRunnerFactory> createFactory);
}

and then modified the Regex implementation to consult upon Regex construction a dictionary of registered runner factories populated by Register. That would enable any Regex usage to instead be backed by something provided via a Register call. Then in addition to the regex source generator spitting out the relevant types, it would also spit out a module initializer that called Register, e.g.

var r = new Regex("[a-f0-9]*", RegexOptions.IgnoreCase);

would be recognized by the source generator and would spit out:

[ModuleInitializer]
internal static void Initialize1() => RegexRunnerFactory.Register("[a-f0-9]*", RegexOptions.IgnoreCase, () => new RegexRunner1Factory());

where RegexRunner1Factory contains/references all the generated regex code.

But there are a variety of downsides to an approach like this, including but not limited to it bloating the IL with the regex code for all regexes and not just those a developer opted-in to having be source generated. I'm currently of the opinion it's better to opt-in on a case-by-case basis and have it be clear in the source what's resulting in a potentially huge amount of IL being spit into the binary.

danmoseley · 2020-11-16T01:36:14Z

Note that the original proposal would require existing code to be refactored

Places where commonly used regexes are newed up need to call the factory method instead
Places that use the static regex methods (Match, Matches, Replace, IsMatch, Split) need to be modified to call the factory method and call instance methods on the result instead.

In the latter case, there is currently an internal cache of Regex objects (by default 15) that is consulted by the static methods; by changing to the factory method you lose the benefit of that cache, but of course most of the purpose of that cache is to help avoid re-generating the IL, which would no longer happen anyway.

If it was important to avoid the refactorings above (eg., so that you could "warm up" the key regexes you plan to use in your startup code, say, without changing existing points of use) it would be easy to expose API for the generated code to add its Regex object to the existing cache (for static methods) or add its runner to a new RegexRunnerCache (for the instance methods, or possibly both). But I assume anyone opting into this is fine changing their existing points of use.

marek-safar · 2020-11-16T12:17:20Z

@danmosemsft I don't have any customer evidence that this is an important scenario. You are correct that by default we run in non-SRE mode on all existing mobile configurations

danmoseley · 2020-11-16T23:25:35Z

As for existing code that uses the .NET Framework ability to write out compiled regexes (like @mjsabby ) -- using CompileToAssembly -- that would have to be modified; instead of newing up the type derived from Regex, it would have to be refactored to use the factory method. I think that's fine -- I don't see any particular advantage of a type over a factory method, and an annotated partial method is more convenient way to hook into compilation.

stephentoub · 2020-11-16T23:45:36Z

it would have to be refactored to use the factory method

Which, for a scenario involving dynamically generating lots of regexes, might mean generating source that in turn relies on a source generator.

svick · 2020-11-17T07:50:57Z

@stephentoub Except one source generator can't depend on another source generator: dotnet/roslyn#48358.

stephentoub · 2020-11-17T10:13:18Z

Except one source generator can't depend on another source generator

Needn't be an actual source generator, just a program that spits out C# source that uses the regex source generator. I was careful in my wording above ;-)

nbevans · 2020-11-17T13:23:35Z

How will this work for F#?

stephentoub · 2020-11-17T13:32:00Z

How will this work for F#?

F# continues to support Regex the same ways it always has. The feature here is an additional mechanism relying on and specific to Roslyn.

7sharp9 · 2020-11-17T14:07:57Z

In other words F# is completly excluded from this optimisation.

jaredpar · 2020-11-17T16:03:28Z

This new attribute would need to be exposed somewhere, probably on the Regex assembly since the code being compiled would be referencing that anyway.

This doesn't need to be defined ahead of time. It could be one of the outputs of the source generator.

nbevans · 2020-11-17T16:19:02Z

How will this work for F#?

F# continues to support Regex the same ways it always has. The feature here is an additional mechanism relying on and specific to Roslyn.

Will the dotnet/fsharp project be involved at all about these ideas? Why is the dotnet/runtime project doing work for dotnet/roslyn project?

So many questions...

stephentoub · 2020-11-17T16:23:45Z

This doesn't need to be defined ahead of time. It could be one of the outputs of the source generator.

But if the attribute is the thing the source generator keys off of, you wouldn't get IntelliSense when authoring use of the attribute, would you?

cartermp · 2020-11-17T17:01:38Z

@nbevans F# and source generators are tracked here: fsharp/fslang-suggestions#864

I encourage questions like this (what about F#?) to be routed there, because there are several options and we'll take the one that is the best for F# when we consider the state of the .NET ecosystem.

This issue tracks adding a way to push regex compilation into compile-time, which is inherently going to depend on how a language can support that kind of scenario. For C#, it is via a C# Source Generator that the Roslyn C# compiler supports. That's what this means:

The feature here is an additional mechanism relying on and specific to Roslyn.

As mentioned, this doesn't somehow exclude F# from using the Regex API. It is an additional mechanism that optimizes an existing code path. That's the general model for source generators, at least certainly for something like the runtime where there is an extremely high bar for compatibility.

If you're concerned about F# support for source generators then the issue I linked is the best place to engage.

mjsabby · 2020-11-17T21:35:15Z

@stephentoub Correct we'd write a tool to spit out C#, although you could imagine we could just as well spit out what the source generator would, but perhaps it's easier to leverage the work being done here. I imagine this issue covers changes in the public surface area and not the specific way of invocation?

I'm sure sure you know that and I'm probably imagining reticence in your comment that isn't actually there, but I'm chiming in per the tag that this will be quite useful to us.

stephentoub · 2020-11-17T21:42:10Z

although you could imagine we could just as well spit out what the source generator would

I assume you mean if there were a programmatic API for invoking the generator you'd use it, not that you'd implement your own equivalent generator (if you actually mean the latter, please first contribute that implementation 😉)

stephentoub · 2021-09-09T22:40:25Z

How is this more of an issue than with Regex.CompileToAssembly on .NET Framework?

GrabYourPitchforks · 2021-09-09T22:48:44Z

How is this more of an issue than with Regex.CompileToAssembly on .NET Framework?

It's not. But I believe with what we know now, with the various MSRCs which have afflicted Regex over the years (and the fact that we even implemented big-hammer timeout functionality!), we would have done the same check in Regex.CompileToAssembly's compiled code had we been able to go back in time.

GrabYourPitchforks · 2021-09-09T23:06:01Z

A concrete example: dotnet/corefx#38091, which was a fix for CVE-2019-0820.

If regex source generators were in common use, would we have been able to provide a centralized patch within the runtime, or would we have required all code bases which used source generators to recompile?

jkotas · 2021-09-09T23:27:01Z

// The runtime is signaling that there's a bug in the current version of the generated code.
// We should use 'new Regex' as a slower (but correct!) fallback.

If the regex is used on performance critical path, going down this fallback can lead to missing the performance SLA and effectively killing the app after servicing. Is it a concern?

stephentoub · 2021-09-09T23:30:21Z

Is it a concern?

It is for me, one of the reasons I'm questioning the suggestion.

I understand the goal. At the same time, I don't see this as being any different from, say, static linking, or even building your app referencing a nuget package and thus shipping your app with a specific binary. If there's a vulnerability in the component you use, you need to rebuild.

eerhardt · 2021-09-09T23:33:04Z

But the reason we didn't do it for JsonSerializer is that the logic for JSON parsing still exists within the System.Text.Json library itself. The source generation for JSON is more like a coordinator. If we moved the actual parsing logic ("How do I read a string from the JSON payload? How do I know if this is a key or a value?") into the generated source itself, I would have pushed for an emergency fallback similar to the one I requested for Regex above.

With the "fast path" logic, this is no longer the case. The Json source generator generates a "Serialize" function that looks like:

        private static void CustomerSerialize(global::System.Text.Json.Utf8JsonWriter writer, global::Foo.Customer value)
        {
            if (value == null)
            {
                writer.WriteNullValue();
                return;
            }
        
            writer.WriteStartObject();
            writer.WriteString(PropName_Name, value.Name);
            writer.WriteString(PropName_Address, value.Address);
        
            writer.WriteEndObject();
        }

Granted, in 6.0 we only did Serialize, and not Deserialize. But we are on the path being described above as dangerous.

cc @layomia @steveharter

GrabYourPitchforks · 2021-09-09T23:39:48Z

If the regex is used on performance critical path, going down this fallback can lead to missing the performance SLA and effectively killing the app after servicing. Is it a concern?

It can be. But we make perf-vs.-correctness tradeoffs all the time, usually siding with correctness over performance. If there's a serious enough bug that would warrant us triggering this safety net, then it means we've weighed the facts and deemed the tradeoff worth it. I don't anticipate that we would trigger the safety net for minor bugs that are unlikely to impact real-world apps.

I don't see this as being any different from, say, static linking, or even building your app referencing a nuget package and thus shipping your app with a specific binary. If there's a vulnerability in the component you use, you need to rebuild.

It's not significantly different from static linking. But we still have dynamic linking scenarios within .NET, including publishing an app and asking it to use a runtime installed in a location separate from the application. And in those scenarios, we absolutely do not ask the app developers to rebuild & redeploy their apps. Similarly, NuGet packages are dynamically linked dependencies, and we don't ask NuGet package authors to publish a new version of their package every time any of their dependencies (direct or indirect) has a vulnerability. We leave it up to the app's build system to handle this scenario properly.

But we are on the path being described above as dangerous.

I don't think this is quite comparable. What you're describing is generated code which acts as a coordinator around library APIs. If there's a serious bug, it's almost certainly going to be in the implementation of those APIs (encoding, protocol parsing, memory management) - not in the coordination routine which wraps these APIs. The types of bugs we see in coordinators tend to be "you failed to serialize or deserialize this specific type correctly," which manifest themselves clearly at development time and tend not to have severe issues hidden beneath the surface.

stephentoub · 2021-09-09T23:41:35Z

We leave it up to the app's build system to handle this scenario properly.

Exactly. So you rebuild, and you get the updated nuget. You rebuild, and you get the updated code gen from the compiler.

GrabYourPitchforks · 2021-09-09T23:48:13Z

Exactly. So you rebuild, and you get the updated nuget. You rebuild, and you get the updated code gen from the compiler.

That sentence was very specifically referring to NuGet dependencies. The remainder of that paragraph discussed app developers who rely on shared runtime installs and who would not get source generator patches without recompiling & redeploying.

If you're making the argument that the runtime should be treated as a NuGet-style dependency and that applications should always recompile & redeploy when such a "dependency" has a vulnerability, I'm ok with that argument. But you'd need to square it with the fact that we do indeed allow (and sometimes promote!) shared installs, runtime redirection, and similar facilities.

stephentoub · 2021-09-09T23:55:56Z

If you're making the argument that the runtime should be treated as a NuGet-style dependency and that applications should always recompile & redeploy when such a "dependency" has a vulnerability, I'm ok with that argument. But you'd need to square it with the fact that we do indeed allow (and sometimes promote!) shared installs, runtime redirection, and similar facilities.

I'm making the argument that source generators are all about promoting run-time logic to build-time logic, and in doing so, you get all the benefits of that precomputation along with all the ramifications build-time decisions bring, whether it's code you manually wrote, code the C# compiler itself generates, code from a dependency selected at build time, code built into your binary from static linking, or code built into your binary from a source generator. The code from this or any other source generator is no longer part of the runtime.

GrabYourPitchforks · 2021-09-10T00:12:40Z

I'm making the argument that source generators are all about promoting run-time logic to build-time logic, and in doing so, you get all the benefits of that precomputation along with all the ramifications build-time decisions bring...

Sure, that's a fair point. And perhaps this is primarily a discussion more geared for a proper threat modeling session.

In a nutshell, I guess it comes down to expectations. Say that we advertise this nifty new attribute as a performance win, that it gets wide acclaim and documented in tutorials and magazine articles. What expectation would our larger developer audience have with respect to how it works? If they're just slapping an attribute on a method, do they know that they're in effect performing static linking of runtime code?

If devs who apply this attribute properly understand all the ramifications it brings, then I think we've done due diligence and there's no need to implement my original request above. But judging from how our product is used, I think it would be a leap to say that the majority of our users who would apply these attributes really appreciate these ramifications. Or that they understand how it interplays with other systems (like shared runtime).

The logic I shared above is just one mechanism for solving this. And perhaps I'm unfairly pushing a little too hard on it. An alternative could be education and documentation. (Though docs are always a bit meh because it's often too easy to skip over them.) You clearly have given source generators considerable thought and may be forming some preliminary ideas for improving user experience. As this feature matures I'd be interested to see what proposals emerge.

bartonjs · 2021-09-10T19:37:34Z

I don't know that I, personally, see the need for a runtime check for the generator version.

There's probably some room for developer education that, in general, running a source generator is somewhere between calling a library routine (bugs get fixed in dot-releases) and copying and pasting some code off of (e.g.) StackOverflow (where it's yours forever, no matter what problems it has), and maybe Regex warrants special notes because of how complex the output might be (since regular expressions as essentially a programming language).

Issues like the timeout bypass tend to happen only on certain shapes of regex; the version check is a very heavy handed approach that is probably overreaching 99+% of the time.
We already have precedence with "you need to update the SDK and then rebuild to get this security fix" in apps using a standalone deployment mode.
I believe that Blazor is always in standalone mode, so it's not necessarily a fringe model.
Other build-time tools exist, like sgen, and they have the same "the fix doesn't happen for free"
- It's not like we invented the concept of generating source code as part of an overall build process... random generation scripts and IL rewriters and such all were here before us. The only thing that's new is we've increased the in-your-face-ness as well as maybe increasing the complexity of generated code (because regex is a programming language).

For me, the followups would be

If we have a source generators concept topic somewhere, say something like "unlike calling into library routines, which can receive fixes as part of an update to the runtime, the generated code is part of your library/application and will only receive fixes after you install a newer version of the SDK and recompile". Then some extra words about making sure that the build machine has the newer SDK, and loosening the wording so that it doesn't imply all library code is part of the shared framework and that all generators come from the .NET SDK.
Presumably there'll be some docs somewhere that tout the virtues of precompiled regex FSMs. They might warrant some additional words like "Generally speaking, the complexity of the generated code increases with the complexity of the regular expression. Unlike using instances of the Regex class, regular expressions precompiled with this mechanism will only receive performance, functionality, or security fixes when you update the version of the .NET SDK on the build machine and recompile your code."

joperezr · 2021-12-13T23:25:51Z

Regex source generator has now been part of the shared framework for a while now, and it's feature set is very likely on par compared to Regex.Compiled. @danmoseley is there any specific work that we want to additionally track with this issue, or can we go ahead and close this now?

danmoseley · 2021-12-13T23:53:39Z

@joperezr I think it's fine to close this one. I assume any remaining work is tracked elsewhere. We should parent it under some common user story if it isn't already.

thanks for doing housekeeping.

joperezr · 2021-12-14T00:06:55Z

We should parent it under some common user story if it isn't already.

Yup, I'm preparing that issue now, which is how I came to this one 😄

Dotnet-GitSync-Bot added area-System.Text.RegularExpressions untriaged New issue has not been triaged by the area owner labels Nov 14, 2020

danmoseley mentioned this issue Nov 14, 2020

User Story: Source generators are added to common libraries to improve performance #43545

Closed

danmoseley added the User Story A single user-facing feature. Can be grouped under an epic. label Nov 14, 2020

danmoseley changed the title ~~Create a source generator for Regex~~ Source generator for Regex for faster startup Nov 14, 2020

am11 mentioned this issue Nov 17, 2020

Support Source Generators fsharp/fslang-suggestions#864

Open

5 tasks

stephentoub assigned stephentoub and unassigned jeffhandley and pgovind Sep 10, 2021

stephentoub mentioned this issue Sep 15, 2021

Add initial regex source generator #59186

Merged

danmoseley closed this as completed Dec 13, 2021

joperezr mentioned this issue Dec 14, 2021

System.Text.RegularExpressions work planned for .NET 7 #62758

Closed

48 tasks

jeffhandley mentioned this issue Jan 9, 2022

Libraries continuous improvement work planned for .NET 7 #57209

Closed

19 tasks

ghost locked as resolved and limited conversation to collaborators Jan 13, 2022

Developers can precompile their Regex code for faster startup #44676

Developers can precompile their Regex code for faster startup #44676

Comments

danmoseley commented Nov 14, 2020 • edited by StephenBonikowsky Loading

ghost commented Nov 14, 2020

danmoseley commented Nov 14, 2020

am11 commented Nov 14, 2020

danmoseley commented Nov 14, 2020

am11 commented Nov 14, 2020

svick commented Nov 15, 2020 • edited Loading

stephentoub commented Nov 15, 2020 • edited Loading

danmoseley commented Nov 16, 2020

marek-safar commented Nov 16, 2020

danmoseley commented Nov 16, 2020

stephentoub commented Nov 16, 2020

svick commented Nov 17, 2020

stephentoub commented Nov 17, 2020 • edited Loading

nbevans commented Nov 17, 2020

stephentoub commented Nov 17, 2020

7sharp9 commented Nov 17, 2020

jaredpar commented Nov 17, 2020

nbevans commented Nov 17, 2020 • edited Loading

stephentoub commented Nov 17, 2020

cartermp commented Nov 17, 2020

mjsabby commented Nov 17, 2020

stephentoub commented Nov 17, 2020

stephentoub commented Sep 9, 2021

GrabYourPitchforks commented Sep 9, 2021

GrabYourPitchforks commented Sep 9, 2021

jkotas commented Sep 9, 2021

stephentoub commented Sep 9, 2021 • edited Loading

eerhardt commented Sep 9, 2021

GrabYourPitchforks commented Sep 9, 2021

stephentoub commented Sep 9, 2021

GrabYourPitchforks commented Sep 9, 2021 • edited Loading

stephentoub commented Sep 9, 2021 • edited Loading

GrabYourPitchforks commented Sep 10, 2021

bartonjs commented Sep 10, 2021

joperezr commented Dec 13, 2021

danmoseley commented Dec 13, 2021

joperezr commented Dec 14, 2021

danmoseley commented Nov 14, 2020 •

edited by StephenBonikowsky

Loading

svick commented Nov 15, 2020 •

edited

Loading

stephentoub commented Nov 15, 2020 •

edited

Loading

stephentoub commented Nov 17, 2020 •

edited

Loading

nbevans commented Nov 17, 2020 •

edited

Loading

stephentoub commented Sep 9, 2021 •

edited

Loading

GrabYourPitchforks commented Sep 9, 2021 •

edited

Loading

stephentoub commented Sep 9, 2021 •

edited

Loading