Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Developers can precompile their Regex code for faster startup #44676

Closed
danmoseley opened this issue Nov 14, 2020 · 54 comments
Closed

Developers can precompile their Regex code for faster startup #44676

danmoseley opened this issue Nov 14, 2020 · 54 comments
Assignees
Labels
area-System.Text.RegularExpressions Cost:M Work that requires one engineer up to 2 weeks Priority:2 Work that is important, but not critical for the release Team:Libraries User Story A single user-facing feature. Can be grouped under an epic.
Milestone

Comments

@danmoseley
Copy link
Member

danmoseley commented Nov 14, 2020

AB#1255830
Regex has an interpreted mode and a compiled mode. The compiled mode takes longer to start, but is generally faster. Some users want both startup time and performance; other users want to run on a platform where JITted code is not allowed, and also want performance. For those users, .NET Framework allowed the generated code to be saved out but .NET Core/.NET 5 does not plan to support saving emitted IL.

Instead, we could develop a Roslyn Source Generator that would generate the necessary code (as C#) at compile time.

Essentially API should be need on Regex, as the generated code would hook into it in the regular way, by deriving from RegexRunner and implementing FindFirstChar() and Go(). This API is all public/protected today (indeed it happens that .NET Core can load regexes written out by .NET Framework.)

TBD the user experience - eg., how one must annotate regexes in order to trigger the generator - assuming it doesn't attempt to read all the code to try to infer them - and how the generated code is wired up.

For example, one could imagine requiring an annotation like this (we could also expose API on the source generator so it could consume a list of regexes, for example)

[RegexSourceGenerator(“(?:[0-9]{1,3}\.){3}[0-9]{1,3}”, RegexOptions.IgnoreCase)]
public partial Regex CreateIpAddressRegex(); 

This new attribute would need to be exposed somewhere, probably on the Regex assembly since the code being compiled would be referencing that anyway.

Also TBD the implementation details necessary to wire into the existing Regex implementation - this is probably most of the work.

cc @pgovind @eerhardt @stephentoub this is just a summary of our email thread. Anything to add?

In terms of customers, we have at least one major 1st party service that heavily uses regex and highly values both throughput and startup time. This is probably P1 for them.

@danroth27 in a Blazor app, can I assume that most users in the .NET 6 timeframe will be content with either using Javascript regexes, or interpreted mode .NET regexes? In my mind this is P2/P3.

@marek-safar similar question for Xamarin, on platforms that don't JIT, Xamarin apps presumably currently interpret the ref-emitted code, or use the regex interpreted mode - and this would not be a big win for them - we had talked about P2.

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-System.Text.RegularExpressions untriaged New issue has not been triaged by the area owner labels Nov 14, 2020
@ghost
Copy link

ghost commented Nov 14, 2020

Tagging @eerhardt, @pgovind, @jeffhandley as subscribed to this area.
See info in area-owners.md if you want to be subscribed.

Issue Details
Description:

Regex has an interpreted mode and a compiled mode. The compiled mode takes longer to start, but is generally faster. Some users want both startup time and performance; other users want to run on a platform where JITted code is not allowed, and also want performance. For those users, .NET Framework allowed the generated code to be saved out but .NET Core/.NET 5 does not plan to support saving emitted IL.

Instead, we could develop a Roslyn Source Generator that would generate the necessary code (as C#) at compile time.

Essentially API should be need on Regex, as the generated code would hook into it in the regular way, by deriving from RegexRunner and implementing FindFirstChar() and Go(). This API is all public/protected today (indeed it happens that .NET Core can load regexes written out by .NET Framework.)

TBD the user experience - eg., how one must annotate regexes in order to trigger the generator - assuming it doesn't attempt to read all the code to try to infer them.

For example, one could imagine requiring an annotation like this (we could also expose API on the source generator so it could consume a list of regexes, for example)

[RegexSourceGenerator(“[a-f0-9]*”, RegexOptions.IgnoreCase)]
public partial Regex CreateExpression();

This new attribute would need to be exposed somewhere, probably on the Regex assembly since the code being compiled would be referencing that anyway.

Also TBD the implementation details necessary to wire into the existing Regex implementation - this is probably most of the work.

cc @pgovind @eerhardt @stephentoub this is just a summary of our email thread. Anything to add?

In terms of customers, we have at least one major 1st party service that heavily uses regex and highly values both throughput and startup time. This is probablyP1 for them.

@danroth27 in a Blazor app, can I assume that most users in the .NET 6 timeframe will be content with either using Javascript regexes, or interpreted mode .NET regexes? In my mind this is P2/P3.

@marek-safar similar question for Xamarin, on platforms that don't JIT, Xamarin apps presumably currently interpret the ref-emitted code, or use the regex interpreted mode - and this would not be a big win for them - we had talked about P2.

Author: danmosemsft
Assignees: -
Labels:

area-System.Text.RegularExpressions, untriaged

Milestone: -

@danmoseley danmoseley added the User Story A single user-facing feature. Can be grouped under an epic. label Nov 14, 2020
@danmoseley
Copy link
Member Author

cc @mjsabby

@danmoseley danmoseley changed the title Create a source generator for Regex Source generator for Regex for faster startup Nov 14, 2020
@am11
Copy link
Member

am11 commented Nov 14, 2020

With source generator based implementation, will the difference between compiled and interpreted regex modes disappear; i.e. they both will endup using the same 'generated managed code' version?

@danmoseley
Copy link
Member Author

@am11 if you enable it, and presumably annotate appropriately.

There should already be no observable difference, other than perf.

@am11
Copy link
Member

am11 commented Nov 14, 2020

Great! In addition to annotations (or instead of, if low granularity is not required?), AppContext setting approach could also be useful here, for the end-users to specify project-wide choice.

<!-- enduser-lib.csrpoj -->
<ItemGroup>
  <!-- existing options, where we also do not care about low granularity by means of annotations -->
  <RuntimeHostConfigurationOption Include="System.Globalization.Invariant" Value="true" />
  <RuntimeHostConfigurationOption Include="System.Globalization.UseNls" Value="true" />
  <RuntimeHostConfigurationOption Include="System.Threading.ThreadPool.MaxThreads" Value="true" />
  <!-- etc. -->

  <!-- new, could be annotations and this option, or just this option -->
  <RuntimeHostConfigurationOption Include="System.Text.RegularExpressions.UseSourceGenerated" Value="true" />
</ItemGroup>

@svick
Copy link
Contributor

svick commented Nov 15, 2020

@am11 I'm not sure it's feasible to make code that uses existing Regex members use source generated implementation.

@stephentoub
Copy link
Member

stephentoub commented Nov 15, 2020

I'm not sure it's feasible to make code that uses existing Regex members use source generated implementation.

There are technically ways, but I currently don't think we should pursue them.

For example, imagine we added a static method like:

public abstract class RegexRunnerFactory
{
    public static void Register(string pattern, RegexOptions options, Func<RegexRunnerFactory> createFactory);
}

and then modified the Regex implementation to consult upon Regex construction a dictionary of registered runner factories populated by Register. That would enable any Regex usage to instead be backed by something provided via a Register call. Then in addition to the regex source generator spitting out the relevant types, it would also spit out a module initializer that called Register, e.g.

var r = new Regex("[a-f0-9]*", RegexOptions.IgnoreCase);

would be recognized by the source generator and would spit out:

[ModuleInitializer]
internal static void Initialize1() => RegexRunnerFactory.Register("[a-f0-9]*", RegexOptions.IgnoreCase, () => new RegexRunner1Factory());

where RegexRunner1Factory contains/references all the generated regex code.

But there are a variety of downsides to an approach like this, including but not limited to it bloating the IL with the regex code for all regexes and not just those a developer opted-in to having be source generated. I'm currently of the opinion it's better to opt-in on a case-by-case basis and have it be clear in the source what's resulting in a potentially huge amount of IL being spit into the binary.

@danmoseley
Copy link
Member Author

Note that the original proposal would require existing code to be refactored

  • Places where commonly used regexes are newed up need to call the factory method instead
  • Places that use the static regex methods (Match, Matches, Replace, IsMatch, Split) need to be modified to call the factory method and call instance methods on the result instead.

In the latter case, there is currently an internal cache of Regex objects (by default 15) that is consulted by the static methods; by changing to the factory method you lose the benefit of that cache, but of course most of the purpose of that cache is to help avoid re-generating the IL, which would no longer happen anyway.

If it was important to avoid the refactorings above (eg., so that you could "warm up" the key regexes you plan to use in your startup code, say, without changing existing points of use) it would be easy to expose API for the generated code to add its Regex object to the existing cache (for static methods) or add its runner to a new RegexRunnerCache (for the instance methods, or possibly both). But I assume anyone opting into this is fine changing their existing points of use.

@marek-safar
Copy link
Contributor

@danmosemsft I don't have any customer evidence that this is an important scenario. You are correct that by default we run in non-SRE mode on all existing mobile configurations

@danmoseley
Copy link
Member Author

As for existing code that uses the .NET Framework ability to write out compiled regexes (like @mjsabby ) -- using CompileToAssembly -- that would have to be modified; instead of newing up the type derived from Regex, it would have to be refactored to use the factory method. I think that's fine -- I don't see any particular advantage of a type over a factory method, and an annotated partial method is more convenient way to hook into compilation.

@stephentoub
Copy link
Member

it would have to be refactored to use the factory method

Which, for a scenario involving dynamically generating lots of regexes, might mean generating source that in turn relies on a source generator.

@svick
Copy link
Contributor

svick commented Nov 17, 2020

@stephentoub Except one source generator can't depend on another source generator: dotnet/roslyn#48358.

@stephentoub
Copy link
Member

stephentoub commented Nov 17, 2020

Except one source generator can't depend on another source generator

Needn't be an actual source generator, just a program that spits out C# source that uses the regex source generator. I was careful in my wording above ;-)

@nbevans
Copy link

nbevans commented Nov 17, 2020

How will this work for F#?

@stephentoub
Copy link
Member

How will this work for F#?

F# continues to support Regex the same ways it always has. The feature here is an additional mechanism relying on and specific to Roslyn.

@7sharp9
Copy link

7sharp9 commented Nov 17, 2020

In other words F# is completly excluded from this optimisation.

@jaredpar
Copy link
Member

This new attribute would need to be exposed somewhere, probably on the Regex assembly since the code being compiled would be referencing that anyway.

This doesn't need to be defined ahead of time. It could be one of the outputs of the source generator.

@nbevans
Copy link

nbevans commented Nov 17, 2020

How will this work for F#?

F# continues to support Regex the same ways it always has. The feature here is an additional mechanism relying on and specific to Roslyn.

Will the dotnet/fsharp project be involved at all about these ideas? Why is the dotnet/runtime project doing work for dotnet/roslyn project?

So many questions...

@stephentoub
Copy link
Member

This doesn't need to be defined ahead of time. It could be one of the outputs of the source generator.

But if the attribute is the thing the source generator keys off of, you wouldn't get IntelliSense when authoring use of the attribute, would you?

@cartermp
Copy link

@nbevans F# and source generators are tracked here: fsharp/fslang-suggestions#864

I encourage questions like this (what about F#?) to be routed there, because there are several options and we'll take the one that is the best for F# when we consider the state of the .NET ecosystem.

This issue tracks adding a way to push regex compilation into compile-time, which is inherently going to depend on how a language can support that kind of scenario. For C#, it is via a C# Source Generator that the Roslyn C# compiler supports. That's what this means:

The feature here is an additional mechanism relying on and specific to Roslyn.

As mentioned, this doesn't somehow exclude F# from using the Regex API. It is an additional mechanism that optimizes an existing code path. That's the general model for source generators, at least certainly for something like the runtime where there is an extremely high bar for compatibility.

If you're concerned about F# support for source generators then the issue I linked is the best place to engage.

@mjsabby
Copy link
Contributor

mjsabby commented Nov 17, 2020

@stephentoub Correct we'd write a tool to spit out C#, although you could imagine we could just as well spit out what the source generator would, but perhaps it's easier to leverage the work being done here. I imagine this issue covers changes in the public surface area and not the specific way of invocation?

I'm sure sure you know that and I'm probably imagining reticence in your comment that isn't actually there, but I'm chiming in per the tag that this will be quite useful to us.

@stephentoub
Copy link
Member

although you could imagine we could just as well spit out what the source generator would

I assume you mean if there were a programmatic API for invoking the generator you'd use it, not that you'd implement your own equivalent generator (if you actually mean the latter, please first contribute that implementation 😉)

@stephentoub
Copy link
Member

How is this more of an issue than with Regex.CompileToAssembly on .NET Framework?

@GrabYourPitchforks
Copy link
Member

How is this more of an issue than with Regex.CompileToAssembly on .NET Framework?

It's not. But I believe with what we know now, with the various MSRCs which have afflicted Regex over the years (and the fact that we even implemented big-hammer timeout functionality!), we would have done the same check in Regex.CompileToAssembly's compiled code had we been able to go back in time.

@GrabYourPitchforks
Copy link
Member

A concrete example: dotnet/corefx#38091, which was a fix for CVE-2019-0820.

If regex source generators were in common use, would we have been able to provide a centralized patch within the runtime, or would we have required all code bases which used source generators to recompile?

@jkotas
Copy link
Member

jkotas commented Sep 9, 2021

// The runtime is signaling that there's a bug in the current version of the generated code.
// We should use 'new Regex' as a slower (but correct!) fallback.

If the regex is used on performance critical path, going down this fallback can lead to missing the performance SLA and effectively killing the app after servicing. Is it a concern?

@stephentoub
Copy link
Member

stephentoub commented Sep 9, 2021

Is it a concern?

It is for me, one of the reasons I'm questioning the suggestion.

I understand the goal. At the same time, I don't see this as being any different from, say, static linking, or even building your app referencing a nuget package and thus shipping your app with a specific binary. If there's a vulnerability in the component you use, you need to rebuild.

@eerhardt
Copy link
Member

eerhardt commented Sep 9, 2021

But the reason we didn't do it for JsonSerializer is that the logic for JSON parsing still exists within the System.Text.Json library itself. The source generation for JSON is more like a coordinator. If we moved the actual parsing logic ("How do I read a string from the JSON payload? How do I know if this is a key or a value?") into the generated source itself, I would have pushed for an emergency fallback similar to the one I requested for Regex above.

With the "fast path" logic, this is no longer the case. The Json source generator generates a "Serialize" function that looks like:

        private static void CustomerSerialize(global::System.Text.Json.Utf8JsonWriter writer, global::Foo.Customer value)
        {
            if (value == null)
            {
                writer.WriteNullValue();
                return;
            }
        
            writer.WriteStartObject();
            writer.WriteString(PropName_Name, value.Name);
            writer.WriteString(PropName_Address, value.Address);
        
            writer.WriteEndObject();
        }

Granted, in 6.0 we only did Serialize, and not Deserialize. But we are on the path being described above as dangerous.

cc @layomia @steveharter

@GrabYourPitchforks
Copy link
Member

If the regex is used on performance critical path, going down this fallback can lead to missing the performance SLA and effectively killing the app after servicing. Is it a concern?

It can be. But we make perf-vs.-correctness tradeoffs all the time, usually siding with correctness over performance. If there's a serious enough bug that would warrant us triggering this safety net, then it means we've weighed the facts and deemed the tradeoff worth it. I don't anticipate that we would trigger the safety net for minor bugs that are unlikely to impact real-world apps.

I don't see this as being any different from, say, static linking, or even building your app referencing a nuget package and thus shipping your app with a specific binary. If there's a vulnerability in the component you use, you need to rebuild.

It's not significantly different from static linking. But we still have dynamic linking scenarios within .NET, including publishing an app and asking it to use a runtime installed in a location separate from the application. And in those scenarios, we absolutely do not ask the app developers to rebuild & redeploy their apps. Similarly, NuGet packages are dynamically linked dependencies, and we don't ask NuGet package authors to publish a new version of their package every time any of their dependencies (direct or indirect) has a vulnerability. We leave it up to the app's build system to handle this scenario properly.

But we are on the path being described above as dangerous.

I don't think this is quite comparable. What you're describing is generated code which acts as a coordinator around library APIs. If there's a serious bug, it's almost certainly going to be in the implementation of those APIs (encoding, protocol parsing, memory management) - not in the coordination routine which wraps these APIs. The types of bugs we see in coordinators tend to be "you failed to serialize or deserialize this specific type correctly," which manifest themselves clearly at development time and tend not to have severe issues hidden beneath the surface.

@stephentoub
Copy link
Member

We leave it up to the app's build system to handle this scenario properly.

Exactly. So you rebuild, and you get the updated nuget. You rebuild, and you get the updated code gen from the compiler.

@GrabYourPitchforks
Copy link
Member

GrabYourPitchforks commented Sep 9, 2021

Exactly. So you rebuild, and you get the updated nuget. You rebuild, and you get the updated code gen from the compiler.

That sentence was very specifically referring to NuGet dependencies. The remainder of that paragraph discussed app developers who rely on shared runtime installs and who would not get source generator patches without recompiling & redeploying.

If you're making the argument that the runtime should be treated as a NuGet-style dependency and that applications should always recompile & redeploy when such a "dependency" has a vulnerability, I'm ok with that argument. But you'd need to square it with the fact that we do indeed allow (and sometimes promote!) shared installs, runtime redirection, and similar facilities.

@stephentoub
Copy link
Member

stephentoub commented Sep 9, 2021

If you're making the argument that the runtime should be treated as a NuGet-style dependency and that applications should always recompile & redeploy when such a "dependency" has a vulnerability, I'm ok with that argument. But you'd need to square it with the fact that we do indeed allow (and sometimes promote!) shared installs, runtime redirection, and similar facilities.

I'm making the argument that source generators are all about promoting run-time logic to build-time logic, and in doing so, you get all the benefits of that precomputation along with all the ramifications build-time decisions bring, whether it's code you manually wrote, code the C# compiler itself generates, code from a dependency selected at build time, code built into your binary from static linking, or code built into your binary from a source generator. The code from this or any other source generator is no longer part of the runtime.

@GrabYourPitchforks
Copy link
Member

I'm making the argument that source generators are all about promoting run-time logic to build-time logic, and in doing so, you get all the benefits of that precomputation along with all the ramifications build-time decisions bring...

Sure, that's a fair point. And perhaps this is primarily a discussion more geared for a proper threat modeling session.

In a nutshell, I guess it comes down to expectations. Say that we advertise this nifty new attribute as a performance win, that it gets wide acclaim and documented in tutorials and magazine articles. What expectation would our larger developer audience have with respect to how it works? If they're just slapping an attribute on a method, do they know that they're in effect performing static linking of runtime code?

If devs who apply this attribute properly understand all the ramifications it brings, then I think we've done due diligence and there's no need to implement my original request above. But judging from how our product is used, I think it would be a leap to say that the majority of our users who would apply these attributes really appreciate these ramifications. Or that they understand how it interplays with other systems (like shared runtime).

The logic I shared above is just one mechanism for solving this. And perhaps I'm unfairly pushing a little too hard on it. An alternative could be education and documentation. (Though docs are always a bit meh because it's often too easy to skip over them.) You clearly have given source generators considerable thought and may be forming some preliminary ideas for improving user experience. As this feature matures I'd be interested to see what proposals emerge.

@bartonjs
Copy link
Member

I don't know that I, personally, see the need for a runtime check for the generator version.

There's probably some room for developer education that, in general, running a source generator is somewhere between calling a library routine (bugs get fixed in dot-releases) and copying and pasting some code off of (e.g.) StackOverflow (where it's yours forever, no matter what problems it has), and maybe Regex warrants special notes because of how complex the output might be (since regular expressions as essentially a programming language).

  • Issues like the timeout bypass tend to happen only on certain shapes of regex; the version check is a very heavy handed approach that is probably overreaching 99+% of the time.
  • We already have precedence with "you need to update the SDK and then rebuild to get this security fix" in apps using a standalone deployment mode.
  • I believe that Blazor is always in standalone mode, so it's not necessarily a fringe model.
  • Other build-time tools exist, like sgen, and they have the same "the fix doesn't happen for free"
    • It's not like we invented the concept of generating source code as part of an overall build process... random generation scripts and IL rewriters and such all were here before us. The only thing that's new is we've increased the in-your-face-ness as well as maybe increasing the complexity of generated code (because regex is a programming language).

For me, the followups would be

  • If we have a source generators concept topic somewhere, say something like "unlike calling into library routines, which can receive fixes as part of an update to the runtime, the generated code is part of your library/application and will only receive fixes after you install a newer version of the SDK and recompile". Then some extra words about making sure that the build machine has the newer SDK, and loosening the wording so that it doesn't imply all library code is part of the shared framework and that all generators come from the .NET SDK.
  • Presumably there'll be some docs somewhere that tout the virtues of precompiled regex FSMs. They might warrant some additional words like "Generally speaking, the complexity of the generated code increases with the complexity of the regular expression. Unlike using instances of the Regex class, regular expressions precompiled with this mechanism will only receive performance, functionality, or security fixes when you update the version of the .NET SDK on the build machine and recompile your code."

@joperezr
Copy link
Member

Regex source generator has now been part of the shared framework for a while now, and it's feature set is very likely on par compared to Regex.Compiled. @danmoseley is there any specific work that we want to additionally track with this issue, or can we go ahead and close this now?

@danmoseley
Copy link
Member Author

@joperezr I think it's fine to close this one. I assume any remaining work is tracked elsewhere. We should parent it under some common user story if it isn't already.

thanks for doing housekeeping.

@joperezr
Copy link
Member

We should parent it under some common user story if it isn't already.

Yup, I'm preparing that issue now, which is how I came to this one 😄

@ghost ghost locked as resolved and limited conversation to collaborators Jan 13, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Text.RegularExpressions Cost:M Work that requires one engineer up to 2 weeks Priority:2 Work that is important, but not critical for the release Team:Libraries User Story A single user-facing feature. Can be grouped under an epic.
Projects
None yet
Development

No branches or pull requests