Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider Span<char> overloads on Regex classes #23602

Closed
danmoseley opened this issue Sep 19, 2017 · 35 comments · Fixed by #67794
Closed

Consider Span<char> overloads on Regex classes #23602

danmoseley opened this issue Sep 19, 2017 · 35 comments · Fixed by #67794
Assignees
Labels
api-needs-work API needs work before it is approved, it is NOT ready for implementation area-System.Text.RegularExpressions
Milestone

Comments

@danmoseley
Copy link
Member

edit by @ViktorHofer, moved initial post down.

Spanifying Regex removes (a) unnecessary string allocations that tend to decrease perf and (b) allows different types of Memory to be processed.
API proposal and implementation: ViktorHofer/corefx#1

Proposed APIs

This diff contains the Memory overloads and the MatchEvaluator overloads. See discussion above if we should introduce new ref types for Match, Group & Capture.

namespace System.Text.RegularExpressions
{
    public partial class Capture
    {
        internal Capture() { }
        public int Index { get { throw null; } }
        public int Length { get { throw null; } }
        public string Value { get { throw null; } }
+       public ReadOnlySpan<char> ValueSpan { get { throw null; } }
        public override string ToString() { throw null; }
    }
    public partial class Match : System.Text.RegularExpressions.Group
    {
        internal Match() { }
        public static System.Text.RegularExpressions.Match Empty { get { throw null; } }
        public virtual System.Text.RegularExpressions.GroupCollection Groups { get { throw null; } }
        public System.Text.RegularExpressions.Match NextMatch() { throw null; }
        public virtual string Result(string replacement) { throw null; }
+       public virtual bool TryResult(string replacement, Span<char> destination, out int charsWritten) { throw null; }
        public static System.Text.RegularExpressions.Match Synchronized(System.Text.RegularExpressions.Match inner) { throw null; }
    }
    public partial class Regex : System.Runtime.Serialization.ISerializable
    {
+       public ref struct SplitEnumerator
+       {
+           public ReadOnlySpan<char> Current { get { throw null; } }
+           public SplitEnumerator GetEnumerator() { throw null; }
+           public bool MoveNext() { throw null; }
+       }
        protected internal System.Collections.Hashtable caps;
        protected internal System.Collections.Hashtable capnames;
        protected internal int capsize;
        protected internal string[] capslist;
        protected internal System.Text.RegularExpressions.RegexRunnerFactory factory;
        public static readonly System.TimeSpan InfiniteMatchTimeout;
        protected internal System.TimeSpan internalMatchTimeout;
        protected internal string pattern;
        protected internal System.Text.RegularExpressions.RegexOptions roptions;
        protected Regex() { }
        protected Regex(System.Runtime.Serialization.SerializationInfo info, System.Runtime.Serialization.StreamingContext context) { }
        public Regex(string pattern) { }
        public Regex(string pattern, System.Text.RegularExpressions.RegexOptions options) { }
        public Regex(string pattern, System.Text.RegularExpressions.RegexOptions options, System.TimeSpan matchTimeout) { }
        public static int CacheSize { get { throw null; } set { } }
        [System.CLSCompliant(false)]
        protected System.Collections.IDictionary Caps { get { throw null; } set { } }
        [System.CLSCompliant(false)]
        protected System.Collections.IDictionary CapNames { get { throw null; } set { } }
        public System.TimeSpan MatchTimeout { get { throw null; } }
        public System.Text.RegularExpressions.RegexOptions Options { get { throw null; } }
        public bool RightToLeft { get { throw null; } }
        public static string Escape(string str) { throw null; }
+       public static bool TryEscape(ReadOnlySpan<char> str, Span<char> destination, out int charsWritten) { throw null; }
        public string[] GetGroupNames() { throw null; }
        public int[] GetGroupNumbers() { throw null; }
        public string GroupNameFromNumber(int i) { throw null; }
        public int GroupNumberFromName(string name) { throw null; }
        protected void InitializeReferences() { }
        public bool IsMatch(string input) { throw null; }
+       public bool IsMatch(ReadOnlySpan<char> input) { throw null; }
        public bool IsMatch(string input, int startat) { throw null; }
        public static bool IsMatch(string input, string pattern) { throw null; }
        public static bool IsMatch(string input, string pattern, System.Text.RegularExpressions.RegexOptions options) { throw null; }
        public static bool IsMatch(string input, string pattern, System.Text.RegularExpressions.RegexOptions options, System.TimeSpan matchTimeout) { throw null; }
+       public static bool IsMatch(ReadOnlySpan<char> input, string pattern, System.Text.RegularExpressions.RegexOptions options = System.Text.RegularExpressions.RegexOptions.None, System.TimeSpan? matchTimeout = null) { throw null; }
        public System.Text.RegularExpressions.Match Match(string input) { throw null; }
+       public System.Text.RegularExpressions.Match Match(ReadOnlyMemory<char> input) { throw null; }
        public System.Text.RegularExpressions.Match Match(string input, int startat) { throw null; }
        public System.Text.RegularExpressions.Match Match(string input, int beginning, int length) { throw null; }
        public static System.Text.RegularExpressions.Match Match(string input, string pattern) { throw null; }
        public static System.Text.RegularExpressions.Match Match(string input, string pattern, System.Text.RegularExpressions.RegexOptions options) { throw null; }
        public static System.Text.RegularExpressions.Match Match(string input, string pattern, System.Text.RegularExpressions.RegexOptions options, System.TimeSpan matchTimeout) { throw null; }
+       public static System.Text.RegularExpressions.Match Match(ReadOnlyMemory<char> input, string pattern, System.Text.RegularExpressions.RegexOptions options = RegexOptions.None, System.TimeSpan? matchTimeout = null) { throw null; }
        public System.Text.RegularExpressions.MatchCollection Matches(string input) { throw null; }
+       public System.Text.RegularExpressions.MatchCollection Matches(ReadOnlyMemory<char> input) { throw null; }
        public System.Text.RegularExpressions.MatchCollection Matches(string input, int startat) { throw null; }
        public static System.Text.RegularExpressions.MatchCollection Matches(string input, string pattern) { throw null; }
        public static System.Text.RegularExpressions.MatchCollection Matches(string input, string pattern, System.Text.RegularExpressions.RegexOptions options) { throw null; }
        public static System.Text.RegularExpressions.MatchCollection Matches(string input, string pattern, System.Text.RegularExpressions.RegexOptions options, System.TimeSpan matchTimeout) { throw null; }
+       public static System.Text.RegularExpressions.MatchCollection Matches(ReadOnlyMemory<char> input, string pattern, System.Text.RegularExpressions.RegexOptions options = RegexOptions.None, System.TimeSpan? matchTimeout = null) { throw null; }
        public string Replace(string input, string replacement) { throw null; }
        public string Replace(string input, string replacement, int count) { throw null; }
+       public bool TryReplace(ReadOnlySpan<char> input, string replacement, Span<char> destination, out int charsWritten, int count = -1) { throw null; }
        public string Replace(string input, string replacement, int count, int startat) { throw null; }
        public static string Replace(string input, string pattern, string replacement) { throw null; }
        public static string Replace(string input, string pattern, string replacement, System.Text.RegularExpressions.RegexOptions options) { throw null; }
        public static string Replace(string input, string pattern, string replacement, System.Text.RegularExpressions.RegexOptions options, System.TimeSpan matchTimeout) { throw null; }
+       public static bool TryReplace(ReadOnlySpan<char> input, string pattern, string replacement, Span<char> destination, out int charsWritten, System.Text.RegularExpressions.RegexOptions options = System.Text.RegularExpressions.RegexOptions.None, System.TimeSpan? matchTimeout = null) { throw null; }
        public static string Replace(string input, string pattern, System.Text.RegularExpressions.MatchEvaluator evaluator) { throw null; }
        public static string Replace(string input, string pattern, System.Text.RegularExpressions.MatchEvaluator evaluator, System.Text.RegularExpressions.RegexOptions options) { throw null; }
        public static string Replace(string input, string pattern, System.Text.RegularExpressions.MatchEvaluator evaluator, System.Text.RegularExpressions.RegexOptions options, System.TimeSpan matchTimeout) { throw null; }
+       public static bool TryReplace(ReadOnlySpan<char> input, string pattern, System.Text.RegularExpressions.MatchEvaluator evaluator, Span<char> destination, out int charsWritten, System.Text.RegularExpressions.RegexOptions options = System.Text.RegularExpressions.RegexOptions.None, System.TimeSpan? matchTimeout = null) { throw null; }
        public string Replace(string input, System.Text.RegularExpressions.MatchEvaluator evaluator) { throw null; }
        public string Replace(string input, System.Text.RegularExpressions.MatchEvaluator evaluator, int count) { throw null; }
+       public bool TryReplace(ReadOnlySpan<char> input, System.Text.RegularExpressions.MatchEvaluator evaluator, Span<char> destination, out int charsWritten, int count = -1) { throw null; }
        public string Replace(string input, System.Text.RegularExpressions.MatchEvaluator evaluator, int count, int startat) { throw null; }
        public string[] Split(string input) { throw null; }
        public string[] Split(string input, int count) { throw null; }
        public string[] Split(string input, int count, int startat) { throw null; }
+       public SplitEnumerator Split(ReadOnlySpan<char> input, int count = 0) { throw null; }
        public static string[] Split(string input, string pattern) { throw null; }
        public static string[] Split(string input, string pattern, System.Text.RegularExpressions.RegexOptions options) { throw null; }
        public static string[] Split(string input, string pattern, System.Text.RegularExpressions.RegexOptions options, System.TimeSpan matchTimeout) { throw null; }
+       public static SplitEnumerator Split(ReadOnlySpan<char> input, string pattern, RegexOptions options = RegexOptions.None, TimeSpan? matchTimeout = null) { throw null; }
        void System.Runtime.Serialization.ISerializable.GetObjectData(System.Runtime.Serialization.SerializationInfo si, System.Runtime.Serialization.StreamingContext context) { }
        public override string ToString() { throw null; }
        public static string Unescape(string str) { throw null; }
+       public static bool TryUnescape(ReadOnlySpan<char> str, Span<char> destination, out int charsWritten) { throw null; }
        protected bool UseOptionC() { throw null; }
        protected bool UseOptionR() { throw null; }
        protected internal static void ValidateMatchTimeout(System.TimeSpan matchTimeout) { }
    }
}

Discussion points

Ref struct for Match and siblings (Capture & Group).

I had a discussion with Jan offline and he pointed out that we might want to introduce a ref struct MatchValue type that is returned by APIs that take Span/Memory as an input.

The problem with just using the current class Match is that it gives you unsecure access to the Span. For example, you can send the Match object to other thread and start working on the Span there, while the current thread unwinds and frees the memory.
ref struct MatchValue would avoid this issue
(And also saved the alllocation)

The issues with that is that we currently have the following hiearchy: Match --> Group --> Capture and that Groups and Match contain collections of Captures/Groups.

Yes, the flip from class to valuetype tends to be like this. E.g. when we have introduced ValueTask to CoreFX, a bunch of parallel ValueSomething types went with it.
It is a topic for API review discussion
One option is to just not have Span version of the APIs that returns these collections or have callbacks

startat overload

For things like Regex.Replace, the startat argument means copy everything up to startat to the destination Span, and then run regular Replace that does not take startat method. So this looks like a convenience method to me - it saves you from typing a tiny bit of code in rare cases to achieve the same effect.

Should we add these startat convenience overloads for Span also?
If yes, this commit should be reverted ViktorHofer/corefx@bf7d7f9

RegexSplitEnumerator RTL yield order

If you call the Span version of Regex.Split and pass RegexOptions.RightToLeft to it the yield order of the enumerator will also be right to left as we start looking for matches from right to left. The current implementation (which is not an enumerator!) reverses the captured strings before returning.

RegexSplitEnumerator GetEnumerator (see ref diff)

I'm not aware of any other cases in the BCL where we have a GetEnumerator method like this on a struct enumerator. I understand you want it to be able to directly foreach the results without introducing an enumerable struct to serve as the return type, but I'm not sure this is a pattern we want to introduce. You should be sure to highlight this as part of any API review discussion.

@danmoseley
Copy link
Member Author

danmoseley commented Nov 28, 2017

copied danmosemsft's initial post:

Right now System.Text.RegularExpression.* only works against strings. Matching against buffers requires a string be created from the buffer. That string might not be needed again (especially if there's no match).

Motivating example: a grep tool that enumerates files in a directory tree, and scans their content against a pattern. It should not need to create strings except when a line in the buffer matches the pattern.

@stephentoub mentioned that while Regex.IsMatch could likely have a Span overload, ReadOnlyMemory may be a better fit for Regex.Match, as it returns a Match object. Also worth considering is a Stream overload.

This proposal obviously isn't ready for review.

@ViktorHofer @JeremyKuhne

@ViktorHofer could you take on making this into a reviewable proposal? It seems to be a fit for your regex perf mission..

@danmoseley
Copy link
Member Author

@JeremyKuhne presumably your regex filter for GetFiles could operate more efficiently if IsMatch could accept a span?

@ViktorHofer
Copy link
Member

Yes I already started thinking about it. Will need bit more investigation.

@JeremyKuhne
Copy link
Member

Even if we can do this in phases it would be awesome. Presumably it wouldn't be too hard to change RegexRunner.runtext to ReadOnlySpan and that would open up hitting statics on Regex? Haven't looked too deeply, but the key is not the pattern, but the input.

@magol
Copy link

magol commented Dec 20, 2017

+1

@powercode
Copy link

This would really help Select-String in PowerShell. Tried to parallelize it and ran into GC death on string allocations for the inputs.

So +1

@powercode
Copy link

Just as a summary of the Select-String use case:
Run the Match function on every line of a file, given an encoding.

ReadOnlyMemory seems like a good start, but there needs to be some transform from the readonly memory to a enumeration of the decoded char/rune of the specified encoding.

Maybe also provide

IEnumerable<Match> MatchInFile(string fullname, string pattern, Encoding encoding)
IEnumerable<Match> ReplaceInFile(string fullname, string pattern, string replacement, Encoding encoding)

@danmoseley
Copy link
Member Author

cc @JeremyKuhne

@danmoseley
Copy link
Member Author

Some discussion in https://github.com/dotnet/corefx/issues/27267

@powercode for your MatchInFile proposal, please open a new issue using the regular API propsal format. It would need to add significant value over a simple loop over getting lines from a filestream. I'm not sure that it does. If Regex could consume an actual stream, it might. Some discussion in dotnet/corefx#27267.

@danmoseley
Copy link
Member Author

Design in progress but will miss 2.1

@ViktorHofer
Copy link
Member

Presumably it wouldn't be too hard to change RegexRunner.runtext to ReadOnlySpan and that would open up hitting statics on Regex? Haven't looked too deeply, but the key is not the pattern, but the input.

RegexRunner.runtext is exposed (internal protected) and therefore can't be changed easily without breaking already compiled assemblies.

@terrajobst
Copy link
Member

  • We should only expose these APIs if we can handle both, interpreted as well as compiled regexes, because otherwise we'd make compiled regexes even slower than today (because we need to copy the ReadOnlyMemory<char>/ReadOnlySpan<char> to a string).

  • The current API shape has two modes IsMatch which doesn't return any results and Match which returns class instances that point back to the original input. Ideally, we'd like the former to be able operate on spans and the latter over memories. However, this isn't easily doable without hacks (e.g. MemoryOrPinnedSpan) as they both use the same infratructure.

    • We don't like the MemoryOrPinnedSpan type and having to pass pointers internally as this seems like a farm for future AVs.

    • Jan proposed that we add public ref structs like CaptureValue and MatchValue so that we can return spans, but this wouldn't be straight forward to do with the current API shape we where have to return collections of these. This would become messy or force a persistent regex engine where the caller can enumerate the matches.

  • It seems the most sensbile model for the current API shape of regex is to only accept memories. Yes, it's less safe than spans, but it's more flexible and more suitable for the current API shape without crazy hacks in the implementation.

  • If we can change our infastructure to use ref structs internally, we can add an overload for IsMatch that operates over a span instead of a memory.

  • It might be worthwhile looking into what a brand new ideal API surface for Regex would look like (both in terms of reducing allocations as well as using a better internal engine). If that shape is close to what we have, bolting it on top might be best. If it isn't it seems to be better to offer a brand new API in a new namespace.

@powercode
Copy link

Any progress on this?

@electricessence
Copy link

Just my 2 cents but I'm excited for this to be a reality since it seems so logical to expose .ValueSpan as an option on classes and sub classes of Capture. @terrajobst: I wish I had a better understanding of why this could be bad with compiled Regex.

Isn't the real difference this?

string Value => Text.Substring(Index, Length);

vs

ReadOnlySpan<char> ValueSpan => Text.AsSpan().Slice(Index, Length);

I'm definitely not a pro at Spans, but I get the underlying benefit.

@ViktorHofer
Copy link
Member

Right I agree, the ValueSpan change should go into 3.0 if possible. That said, we need an api-review for it first.

@jyrkive
Copy link

jyrkive commented Nov 29, 2019

I have another use case for this proposal.

I'm writing a compiler for a simple scripting language. Provided how simple the language is, I'm planning to just use a simple regex-based tokenizer (an LL-recursive descent lexer or something would be overkill).

Now, I have exactly 25,3 MB worth of text to process. Having to create substrings for every single token means millions of allocations.

(Still, this isn't exactly a major problem for me. Performance isn't critical. I can easily allow corert to spend a few seconds allocating/deallocating strings. I ended up here mainly because I "wanted to do the right thing" and was curious to see why Regex API doesn't accept spans.)

@msftgits msftgits transferred this issue from dotnet/corefx Jan 31, 2020
@msftgits msftgits added this to the 5.0 milestone Jan 31, 2020
@maryamariyan maryamariyan added the untriaged New issue has not been triaged by the area owner label Feb 23, 2020
@danmoseley danmoseley modified the milestones: 5.0.0, Future Jun 18, 2020
@danmoseley danmoseley removed the untriaged New issue has not been triaged by the area owner label Jul 6, 2020
@ViktorHofer ViktorHofer removed their assignment Dec 14, 2020
@michaelherndon
Copy link

Are there any plans to look at adding ReadOnlySpan overloads or a different implementation for Regex for .NET 6?

@danmoseley
Copy link
Member Author

danmoseley commented May 3, 2021

@michaelherndon there are not plans for .NET 6. If anyone is passionate about this, the next step is to finalize an API proposal. That would probably need a fair bit of thought -- see detailed comments by @terrajobst above.

One other thing to consider is that there are 2 regex engines (the mode that compiles at runtime and the interpreter) and any new API would need implementing in all engines. We are considering adding others -- a source generator one, which would be a variation of compiled but at build time using C# instead of ref-emit, and possibly a DFA based engine which would provide stronger execution time guarantees.. Neither of these will be in .NET 6, but the first one is probably funded for post .NET 6, possibly as an out of band release. So with 2, 3 or 4 engines, each based on quite different mechanisms, implementing new API would be some work.

@michaelherndon
Copy link

@danmoseley thanks for the update. That definitely sounds like a bit of work.

@danmoseley
Copy link
Member Author

@michaelherndon could you say more about your scenario?

@michaelherndon
Copy link

I was debating about trying to rewrite C# code based on rails Inflector logic for pluralization and singularization. Inflector uses a fair amount of regular expressions. Humanizer has similar logic. Using ReadOnlySpan for changing code styles and capitalization is pretty straightforward.

Transforming something like Inflector into a parser to leverage ReadOnlySpan would be a bit of work, so I looking to see if Regex started implementing it to leave it as-is and still benefit from lowering allocations/copies of strings.

@moh-hassan
Copy link

Four years passed and the issue is still open. I hope it's implemented in NET6.

@terrajobst
Copy link
Member

At this point it seems unlikely, given that we still don't have a design.

@electricessence
Copy link

Uhg. Why! Why can't this happen?! It's such a logical step. Why can't we just provide a .ValueSpan (readonly) as part of groups/captures?

@terrajobst
Copy link
Member

terrajobst commented Jul 9, 2021

See my comment above. It's not that it can't, just that it requires design work. And generally speaking the runway for .NET 6 is closing.

@michaelherndon
Copy link

As the rando internet person that bumped the thread: I get the frustration of a 4-year-old feature request and obviously I do see the benefits of Regex having lower allocations. Conversely, I also get that only so many things can go into the release of a widely used framework that values backwards compatibility. Even for a company like Microsoft, there are still only so many resources and hours.

Personally, I'd still take the improvements in System.Security, the new Metrics API, or the elusive Assembly Neutral Interfaces over a new Regex API surface.

@electricessence
Copy link

Ok... So I have to admit that I'm a bit frustrated at this one.
I've observed that it would be easy to write extensions for this, BUT Capture.Text is internal.
If it was exposed, then implementing my own .AsSpan() would obviously be easy. Right now, I'm crippled to passing the original source in order to get a ReadOnlySpan<char> or ReadOnlyMemory<char>.
Is there a serious reason the original .Text property is hidden?
Why not just write these simple extensions in .NET now?

@BenjaBobs
Copy link

This is probably not the solution you're looking for, but since the source is available, you could do some surgery:

public static class CaptureSurgery
{
    private static readonly PropertyInfo _textProperty = typeof(Capture).GetProperty("Text", BindingFlags.Instance | BindingFlags.NonPublic);

    public static ReadOnlySpan<char> AsSpan(this Capture capture)
    {
        return ((string)_textProperty.GetValue(capture)).AsSpan();
    }
}

Although since the point is improved performance, this is hardly a solution.

@svick
Copy link
Contributor

svick commented Aug 13, 2021

@BenjaBobs If you're worried about performance, create and cache a delegate to the getter. Also, you need to slice the text to get the value of the capture:

public static class CaptureSurgery
{
    private static readonly Func<Capture, string> _textDelegate =
        typeof(Capture).GetProperty("Text", BindingFlags.Instance | BindingFlags.NonPublic)
            .GetGetMethod(nonPublic: true)
            .CreateDelegate<Func<Capture, string>>();

    public static ReadOnlySpan<char> AsSpan(this Capture capture) =>
        _textDelegate.Invoke(capture).AsSpan(capture.Index, capture.Length);
}

@electricessence
Copy link

@BenjaBobs @svick I didn't know that was possible! That's awesome to know. Thank you!

@electricessence
Copy link

electricessence commented Aug 13, 2021

And just a note:
Performance? YES (obviously that's important).
Superior memory usage? YES (that's the whole point of this).

@electricessence
Copy link

@svick What version of .NET has the generic call for .CreateDelegate<T>()? I'm implementing this in .NET Standard.

@silkfire
Copy link

@electricessence As of right now only .NET, i.e. 5 and 6+.

@electricessence
Copy link

@silkfire Got it to work in .NET Standard without the generic methods. I wonder if there's a boxing cost.

@stephentoub
Copy link
Member

@joperezr, this overlaps with the span work you already did for 7 and with #65011. We should determine if there are any other APIs from this we want to add in 7 as part of the span effort or if it's all covered and can be closed.

@stephentoub stephentoub modified the milestones: Future, 7.0.0 Mar 11, 2022
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Apr 9, 2022
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Apr 13, 2022
@ghost ghost locked as resolved and limited conversation to collaborators May 13, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-needs-work API needs work before it is approved, it is NOT ready for implementation area-System.Text.RegularExpressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.