-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Developers can precompile their Regex code for faster startup #44676
Comments
Tagging @eerhardt, @pgovind, @jeffhandley as subscribed to this area. Issue Details
|
cc @mjsabby |
With source generator based implementation, will the difference between compiled and interpreted regex modes disappear; i.e. they both will endup using the same 'generated managed code' version? |
@am11 if you enable it, and presumably annotate appropriately. There should already be no observable difference, other than perf. |
Great! In addition to annotations (or instead of, if low granularity is not required?), AppContext setting approach could also be useful here, for the end-users to specify project-wide choice. <!-- enduser-lib.csrpoj -->
<ItemGroup>
<!-- existing options, where we also do not care about low granularity by means of annotations -->
<RuntimeHostConfigurationOption Include="System.Globalization.Invariant" Value="true" />
<RuntimeHostConfigurationOption Include="System.Globalization.UseNls" Value="true" />
<RuntimeHostConfigurationOption Include="System.Threading.ThreadPool.MaxThreads" Value="true" />
<!-- etc. -->
<!-- new, could be annotations and this option, or just this option -->
<RuntimeHostConfigurationOption Include="System.Text.RegularExpressions.UseSourceGenerated" Value="true" />
</ItemGroup> |
@am11 I'm not sure it's feasible to make code that uses existing |
There are technically ways, but I currently don't think we should pursue them. For example, imagine we added a static method like: public abstract class RegexRunnerFactory
{
public static void Register(string pattern, RegexOptions options, Func<RegexRunnerFactory> createFactory);
} and then modified the Regex implementation to consult upon Regex construction a dictionary of registered runner factories populated by Register. That would enable any Regex usage to instead be backed by something provided via a Register call. Then in addition to the regex source generator spitting out the relevant types, it would also spit out a module initializer that called Register, e.g. var r = new Regex("[a-f0-9]*", RegexOptions.IgnoreCase); would be recognized by the source generator and would spit out: [ModuleInitializer]
internal static void Initialize1() => RegexRunnerFactory.Register("[a-f0-9]*", RegexOptions.IgnoreCase, () => new RegexRunner1Factory()); where RegexRunner1Factory contains/references all the generated regex code. But there are a variety of downsides to an approach like this, including but not limited to it bloating the IL with the regex code for all regexes and not just those a developer opted-in to having be source generated. I'm currently of the opinion it's better to opt-in on a case-by-case basis and have it be clear in the source what's resulting in a potentially huge amount of IL being spit into the binary. |
Note that the original proposal would require existing code to be refactored
In the latter case, there is currently an internal cache of Regex objects (by default 15) that is consulted by the static methods; by changing to the factory method you lose the benefit of that cache, but of course most of the purpose of that cache is to help avoid re-generating the IL, which would no longer happen anyway. If it was important to avoid the refactorings above (eg., so that you could "warm up" the key regexes you plan to use in your startup code, say, without changing existing points of use) it would be easy to expose API for the generated code to add its Regex object to the existing cache (for static methods) or add its runner to a new RegexRunnerCache (for the instance methods, or possibly both). But I assume anyone opting into this is fine changing their existing points of use. |
@danmosemsft I don't have any customer evidence that this is an important scenario. You are correct that by default we run in non-SRE mode on all existing mobile configurations |
As for existing code that uses the .NET Framework ability to write out compiled regexes (like @mjsabby ) -- using CompileToAssembly -- that would have to be modified; instead of newing up the type derived from Regex, it would have to be refactored to use the factory method. I think that's fine -- I don't see any particular advantage of a type over a factory method, and an annotated partial method is more convenient way to hook into compilation. |
Which, for a scenario involving dynamically generating lots of regexes, might mean generating source that in turn relies on a source generator. |
@stephentoub Except one source generator can't depend on another source generator: dotnet/roslyn#48358. |
Needn't be an actual source generator, just a program that spits out C# source that uses the regex source generator. I was careful in my wording above ;-) |
How will this work for F#? |
F# continues to support Regex the same ways it always has. The feature here is an additional mechanism relying on and specific to Roslyn. |
In other words F# is completly excluded from this optimisation. |
This doesn't need to be defined ahead of time. It could be one of the outputs of the source generator. |
Will the dotnet/fsharp project be involved at all about these ideas? Why is the dotnet/runtime project doing work for dotnet/roslyn project? So many questions... |
But if the attribute is the thing the source generator keys off of, you wouldn't get IntelliSense when authoring use of the attribute, would you? |
@nbevans F# and source generators are tracked here: fsharp/fslang-suggestions#864 I encourage questions like this (what about F#?) to be routed there, because there are several options and we'll take the one that is the best for F# when we consider the state of the .NET ecosystem. This issue tracks adding a way to push regex compilation into compile-time, which is inherently going to depend on how a language can support that kind of scenario. For C#, it is via a C# Source Generator that the Roslyn C# compiler supports. That's what this means:
As mentioned, this doesn't somehow exclude F# from using the Regex API. It is an additional mechanism that optimizes an existing code path. That's the general model for source generators, at least certainly for something like the runtime where there is an extremely high bar for compatibility. If you're concerned about F# support for source generators then the issue I linked is the best place to engage. |
@stephentoub Correct we'd write a tool to spit out C#, although you could imagine we could just as well spit out what the source generator would, but perhaps it's easier to leverage the work being done here. I imagine this issue covers changes in the public surface area and not the specific way of invocation? I'm sure sure you know that and I'm probably imagining reticence in your comment that isn't actually there, but I'm chiming in per the tag that this will be quite useful to us. |
I assume you mean if there were a programmatic API for invoking the generator you'd use it, not that you'd implement your own equivalent generator (if you actually mean the latter, please first contribute that implementation 😉) |
How is this more of an issue than with Regex.CompileToAssembly on .NET Framework? |
It's not. But I believe with what we know now, with the various MSRCs which have afflicted |
A concrete example: dotnet/corefx#38091, which was a fix for CVE-2019-0820. If regex source generators were in common use, would we have been able to provide a centralized patch within the runtime, or would we have required all code bases which used source generators to recompile? |
If the regex is used on performance critical path, going down this fallback can lead to missing the performance SLA and effectively killing the app after servicing. Is it a concern? |
It is for me, one of the reasons I'm questioning the suggestion. I understand the goal. At the same time, I don't see this as being any different from, say, static linking, or even building your app referencing a nuget package and thus shipping your app with a specific binary. If there's a vulnerability in the component you use, you need to rebuild. |
With the "fast path" logic, this is no longer the case. The Json source generator generates a "Serialize" function that looks like: private static void CustomerSerialize(global::System.Text.Json.Utf8JsonWriter writer, global::Foo.Customer value)
{
if (value == null)
{
writer.WriteNullValue();
return;
}
writer.WriteStartObject();
writer.WriteString(PropName_Name, value.Name);
writer.WriteString(PropName_Address, value.Address);
writer.WriteEndObject();
} Granted, in 6.0 we only did Serialize, and not Deserialize. But we are on the path being described above as dangerous. |
It can be. But we make perf-vs.-correctness tradeoffs all the time, usually siding with correctness over performance. If there's a serious enough bug that would warrant us triggering this safety net, then it means we've weighed the facts and deemed the tradeoff worth it. I don't anticipate that we would trigger the safety net for minor bugs that are unlikely to impact real-world apps.
It's not significantly different from static linking. But we still have dynamic linking scenarios within .NET, including publishing an app and asking it to use a runtime installed in a location separate from the application. And in those scenarios, we absolutely do not ask the app developers to rebuild & redeploy their apps. Similarly, NuGet packages are dynamically linked dependencies, and we don't ask NuGet package authors to publish a new version of their package every time any of their dependencies (direct or indirect) has a vulnerability. We leave it up to the app's build system to handle this scenario properly.
I don't think this is quite comparable. What you're describing is generated code which acts as a coordinator around library APIs. If there's a serious bug, it's almost certainly going to be in the implementation of those APIs (encoding, protocol parsing, memory management) - not in the coordination routine which wraps these APIs. The types of bugs we see in coordinators tend to be "you failed to serialize or deserialize this specific type correctly," which manifest themselves clearly at development time and tend not to have severe issues hidden beneath the surface. |
Exactly. So you rebuild, and you get the updated nuget. You rebuild, and you get the updated code gen from the compiler. |
That sentence was very specifically referring to NuGet dependencies. The remainder of that paragraph discussed app developers who rely on shared runtime installs and who would not get source generator patches without recompiling & redeploying. If you're making the argument that the runtime should be treated as a NuGet-style dependency and that applications should always recompile & redeploy when such a "dependency" has a vulnerability, I'm ok with that argument. But you'd need to square it with the fact that we do indeed allow (and sometimes promote!) shared installs, runtime redirection, and similar facilities. |
I'm making the argument that source generators are all about promoting run-time logic to build-time logic, and in doing so, you get all the benefits of that precomputation along with all the ramifications build-time decisions bring, whether it's code you manually wrote, code the C# compiler itself generates, code from a dependency selected at build time, code built into your binary from static linking, or code built into your binary from a source generator. The code from this or any other source generator is no longer part of the runtime. |
Sure, that's a fair point. And perhaps this is primarily a discussion more geared for a proper threat modeling session. In a nutshell, I guess it comes down to expectations. Say that we advertise this nifty new attribute as a performance win, that it gets wide acclaim and documented in tutorials and magazine articles. What expectation would our larger developer audience have with respect to how it works? If they're just slapping an attribute on a method, do they know that they're in effect performing static linking of runtime code? If devs who apply this attribute properly understand all the ramifications it brings, then I think we've done due diligence and there's no need to implement my original request above. But judging from how our product is used, I think it would be a leap to say that the majority of our users who would apply these attributes really appreciate these ramifications. Or that they understand how it interplays with other systems (like shared runtime). The logic I shared above is just one mechanism for solving this. And perhaps I'm unfairly pushing a little too hard on it. An alternative could be education and documentation. (Though docs are always a bit meh because it's often too easy to skip over them.) You clearly have given source generators considerable thought and may be forming some preliminary ideas for improving user experience. As this feature matures I'd be interested to see what proposals emerge. |
I don't know that I, personally, see the need for a runtime check for the generator version. There's probably some room for developer education that, in general, running a source generator is somewhere between calling a library routine (bugs get fixed in dot-releases) and copying and pasting some code off of (e.g.) StackOverflow (where it's yours forever, no matter what problems it has), and maybe Regex warrants special notes because of how complex the output might be (since regular expressions as essentially a programming language).
For me, the followups would be
|
Regex source generator has now been part of the shared framework for a while now, and it's feature set is very likely on par compared to Regex.Compiled. @danmoseley is there any specific work that we want to additionally track with this issue, or can we go ahead and close this now? |
@joperezr I think it's fine to close this one. I assume any remaining work is tracked elsewhere. We should parent it under some common user story if it isn't already. thanks for doing housekeeping. |
Yup, I'm preparing that issue now, which is how I came to this one 😄 |
AB#1255830
Regex has an interpreted mode and a compiled mode. The compiled mode takes longer to start, but is generally faster. Some users want both startup time and performance; other users want to run on a platform where JITted code is not allowed, and also want performance. For those users, .NET Framework allowed the generated code to be saved out but .NET Core/.NET 5 does not plan to support saving emitted IL.
Instead, we could develop a Roslyn Source Generator that would generate the necessary code (as C#) at compile time.
Essentially API should be need on Regex, as the generated code would hook into it in the regular way, by deriving from RegexRunner and implementing FindFirstChar() and Go(). This API is all public/protected today (indeed it happens that .NET Core can load regexes written out by .NET Framework.)
TBD the user experience - eg., how one must annotate regexes in order to trigger the generator - assuming it doesn't attempt to read all the code to try to infer them - and how the generated code is wired up.
For example, one could imagine requiring an annotation like this (we could also expose API on the source generator so it could consume a list of regexes, for example)
This new attribute would need to be exposed somewhere, probably on the Regex assembly since the code being compiled would be referencing that anyway.
Also TBD the implementation details necessary to wire into the existing Regex implementation - this is probably most of the work.
cc @pgovind @eerhardt @stephentoub this is just a summary of our email thread. Anything to add?
In terms of customers, we have at least one major 1st party service that heavily uses regex and highly values both throughput and startup time. This is probably P1 for them.
@danroth27 in a Blazor app, can I assume that most users in the .NET 6 timeframe will be content with either using Javascript regexes, or interpreted mode .NET regexes? In my mind this is P2/P3.
@marek-safar similar question for Xamarin, on platforms that don't JIT, Xamarin apps presumably currently interpret the ref-emitted code, or use the regex interpreted mode - and this would not be a big win for them - we had talked about P2.
The text was updated successfully, but these errors were encountered: