-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regex Match, Split and Matches should support RegexOptions.AnyNewLine as (?=\r\z|\n\z|\r\n\z|\z) #25598
Comments
I don't like that this would depend on Here is a table comparing the options (✓ means processing files with the given line-ending style works well on the given OS, ✗ means it does not):
|
Petr, that is exactly what I meant if you looked at how I defined it:
(?=[\r\n|\z) can match any mix of line endings, even files trashed by mixed
line endings. AnyNewline is a MUCH better name than what I came up with.
Love teamwork!
Adding a portable sigil for Environment.NewLine probably isn't a bad idea,
but the only pragmatic way I could think of doing it would be to export all
Regex sigils and let the consumer of the API reprogram the sigil mapping. I
thought about this, but felt it was too complicated and also ruin the
tooling around current .NET regular expressions. Sites like RegexHero make
writing Regex so easy that I wanted a solution that would be easy for tools
to adopt.
The only gotchas I typically encounter are forgetting to escape a pipe, and
then forgetting about how line ending semantics works. Tools help me with
the former, but the latter always trips me up.
…On Sat, Mar 24, 2018, 8:32 AM Petr Onderka ***@***.***> wrote:
I don't like that this would depend on Environment.NewLine. Transferring
files between different OSes is common, so I think there is a good chance
you will need to process files with different line endings than those that
are native to your OS. Instead, I think a setting that would match any line
ending (independent of the current OS) would be better. It could be called
something like AnyNewLine.
Here is a table comparing the options (✓ means processing files with the
given line-ending style works well on the given OS, ✗ means it does not):
OS Line-ending style Current EnvironmentNewLine AnyNewLine
Windows Windows ✗ ✓ ✓
Windows Unix ✓ ✗ ✓
Unix Windows ✗ ✗ ✓
Unix Unix ✓ ✓ ✓
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<https://github.com/dotnet/corefx/issues/28410#issuecomment-375884537>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAbT_WOjuiqujVh1zPVyNau26LElEe-aks5thjzMgaJpZM4S473p>
.
|
Actually, I see now that in my original description my description of the behavior is inconsistent. Thank you for that wonderful table. I will update the first post with it. |
I like the idea and its cross-platform benefit.
How is any line ending defined? The name suggests that it can handle any line ending, are there more than Unix and Windows line ending chars? We could probably refine the name. |
According to Wikipedia's summary of the Unicode standard, there are 8 character sequences that form a line break (7 single characters, including CR and LF, and the sequence CR+LF). Though I think those other characters are not widely used, so supporting only either LF and CR+LF (Unix and Windows) or CR, LF and CR+LF (old Mac, Unix and Windows) would be fine, especially if it lead to better performance than fully supporting the Unicode definition of line break.
I don't understand this regex, as far as I can tell, it wouldn't match |
I'll write some test cases with XUnit and use that as a specification. We can then tweak the test cases to come up with a final specification/acceptance criteria. Fare could also be used to confirm. |
potential hackathon candidate if we review the API addition in time. cc @karelz |
Looks good. A few comments:
|
I don't think the table is wrong, see the documentation linked in the top post, where it says that you should use You can also try running code like the following: var match = Regex.Match("foo\r\nbar", ".*$", RegexOptions.Multiline);
foreach (char c in match.Value)
Console.WriteLine($"{c} (0x{(int)c:X2})"); For me, running it on .Net Core 2.1 RTM on Windows 10 prints:
While if this was "working well", I would expect to get just the three characters of |
@terrajobst I agree with @svick . Intuitively, all his table is saying is that neither $ nor Environment.NewLine is fully portable solution for the two ubiquitous platforms. AnyNewLine therefore proposes to fill that gap. |
I'll update the original post to modify the enum options to be 0x0400 (1024) |
Taking a stab at this for the Hackaton |
Ok, assigned you. |
Had some fun this afternoon going through the internals of I pushed what I have so far here: Mpdreamz/corefx@4c4d688 I might pick it up for fun at some point again, feel free to unassign me 😄 |
wow that's great progress! it's up to you if you want to continue working on it or pass it on to me / other community members. |
Unassigning for someone else to look. Thanks for the start! |
Hi @danmosemsft, I would like to have a go at this. Can you please get this assigned to me. Thanks. |
@shishirchawla sounds great, go ahead! BTW I sent you a collaborator invite. When you accept LMK and I will assign this formally. |
Assigned, it's all yours @shishirchawla work on any cadence you want and just post here if you have any questions or issues. |
Hi @danmosemsft / @jzabroski , Heres my PR for the new API - dotnet/corefx#41195 . This is my first PR on the project, so please bare with me. Looking forward to your feedback. Shishir |
@danmosemsft @terrajobst @ViktorHofer @svick It seems that during the process of implementing this feature, @shishirchawla discovered an area to improve the API a tiny bit further. If you go to the Anchors documentation, there are two anchors, I like @shishirchawla 's improvement, but I think the final focus should be on how this is communicated in the Docs, such as https://docs.microsoft.com/en-us/dotnet/standard/base-types/anchors-in-regular-expressions and ensuring full test coverage. |
@terrajobst could you please give your take on the |
@danmosemsft So sorry I'm a foot dragger on this. Been trying to keep up with my own open source project, life, work, etc. Relevant examples / test cases: https://docs.microsoft.com/en-us/dotnet/standard/base-types/anchors-in-regular-expressions#end-of-string-or-line- The examples are a bit idiosyncratic in that they reference |
ping @terrajobst |
The name |
@danmosemsft Thanks for putting this together. This issue has been on my mind for some time, but life keeps getting in the way! The main reason I did not put this on my plate is precisely the risk of causing a regression in behavior that you highlight by the thirteen different places that need to be considered. Perhaps I am too detail-oriented, but as we are nearing .NET Core 5, my thought was, "Maybe this should wait for the next minor version?" In terms of your design proposal, I agree that making the code long-term easier to understand and maintain is a big win. Having the parser handle the phrase as a token makes sense, and also cuts down the risk of a major regression due to unforeseen feature combinations. It would also be easier to lift back up to the interpreter / compiler the token. Plus, you could have a compiler that analyzes the parse tree to determine if it is an "extended Regular Expression" or actual "regular language Regular Expression." Similar to how .NET Expression tree API allows preferInterpretation. In most cases, it would be convenient to have the high-level language services just transparently do what's optimal. |
@stephentoub For .NET vNext, can you comment on @danmosemsft ideas to lift this: #25598 (comment) |
I like the simplicity and isolated nature of that approach, and I think it's reasonable to start, at least for a prototype, and see a) whether it's functionally correct with all the various new tests we throw at it, and b) what the performance hit is without further optimization. Assuming it's functionally sound and the perf hit is reasonable, it sounds like a good way to go for now; if in the future the perf of it became more of an issue, we could address that in a variety of ways, either by adding more optimizations for the patterns it employs, or by switching away from doing it purely in the parser as a lowering step. |
Now we just need a volunteer... |
(Just got back from vacation.) I'm interested in doing it, and think I almost got my two open source projects in a state where they're ready for v5. Anyway, I think it would be a lot of fun to do - if I use that word. :) |
Great, I assigned you @jzabroski . As @stephentoub mentions, you'll probably want to start by prototyping and getting perf measurements. Thanks! |
@jzabroski I cleared your assignment. |
@danmoseley Thanks. Sorry about not finishing this. I caught Covid shortly after you assigned this and my time for additional work outside my normal hours diminished (long recovery). |
no problem! sorry to hear about that 😿 you're always welcome to contribute here or to anything else in the repo. |
Returning to this briefly, I wanted to point out that the behavior described in the top post is not what we want. It says that "The match must occur at the end of the string or before \r\n, \n or \r at the end of the string." That is not desirable, because That's why in my table above, I have $ equivalent to (effectively) |
@stephentoub I hacked up what lowering might look like -- see https://github.com/dotnet/runtime/compare/main...danmoseley:runtime:anynewline?expand=1#diff-548465bb4dbdd05024f0676c944e3c2b89677c48257bf28dfb0afccde31ded0aR19 I wrapped the pattern in an iterator with the idea that it could show "real" locations in the pattern in error messages, but not sure whether it's worth that abstraction. Do you think this is a worthwhile direction? Edit: The approach makes those handful of tests pass. I'd need to do more tests/inspection to see whether there's other positional state that would need correction on inserting into the pattern like this, and update that too. I suspect perf with this lowering approach may be more of a worry than the parser. Once the flag exists, folks may update existing patterns to use it, which may then be a perf trap if they are "hot". Possibly adding \R would be less of a problem in this respect as it would require modifying the pattern, so less "easy", and by putting \R in a non capturing group or an assertion as appropriate you can achieve some of the same things as AnyNewLine albeit potentially in a clumsy way (but a standard way). So a possible approach might be to implement \R for .NET 8 then at a future point add AnyNewLine but properly, as it's own node type that the engines accommodate. (In this model \R would match any new line out of the gate, but that's the same as PCRE: by default in PCRE, . $ \Z recognize only \n but \R recognizes all newline flavors.) |
I didn't like the parser approach -- it's clumsy and will run slow. working in https://github.com/dotnet/runtime/compare/main...danmoseley:anynewline2?expand=1 |
You mean someone else would add it for the other engines? |
Potentially, in future, with care to avoid impacting perf in existing cases. As AnyNewLine would be new, I think it's OK if only the new engine supports it (meaning it will only work for static patterns) initially. Just checking for more than \n is easy, the complexity is dealing with \r\n atomically |
What are your thoughts, would it be acceptable to only support for static patterns (generator) |
I'm not a fan of that. At a minimum, when we support it, it should be in all but the NonBacktracking engine, but ideally all of them. There are currently zero differences in supported functionality between the interpreter, compiler, and source generator; I'm not excited to add any. Plus, there are cases where the source generator will fall back to the other engines, eg case-insensitive back references, so whether this new option would work would be very confusing. |
From MSDN:
This is by far one of the biggest gotchas with using .NET Regex class.
I suggest adding a RegexOptions.AnyNewLine which treats $ as matching both Windows' Environment.NewLine and UNIX' Environment.NewLine, regardless of the Environment running corefx.
Portability concerns
According to Wikipedia, there are a ton of different operating systems, all with different line ending settings. The current implementation hardcodes Unix line-ending style. RegexOptions.AnyNewLine, defined as (?=[\r\n]|\z), would add support for Windows line-ending style.
The advise written in the current docs is actually not portable on Unix, which is becoming a more popular option. As it is suggested, \r?$ will capture one or two lines on Unix, and one on Windows. If you try running Windows assemblies with this hack on Linux, you will change the semantics of programs.
Backward compatibility concerns
Fully backward compatible: This RegexOptions enum extension would not be a default, and so it would not break any clients with reasonably written code. The only existing code that might display different behavior would be reflection code that sets every option on RegexOptions enum variable. I really can't envision anyone doing this on purpose.
Here is Petr Onderka (@svick)'s summary:
Api Proposal
edit by @ViktorHofer
API Review Notes
Video
Looks good. A few comments:
We cannot use the proposed value of 128 because it's already taken (see #if DBG in code)Spec updated so thatAnyNewLine = 0x400
(1024).The table looks wrong (Windows on Windows on the Current should work IMHO)The table is correct. The fact this trips up experts just speaks to why this is a profound GOTCHA in the Core SDK.May be AcceptAllLineEndings?Some hallway testing I've done indicates AnyNewLine is a good name. Plus, (argument after the final name was chosen) this enumeration will be transliterated into a checkbox on Regular Expression Visualization tools like Regex Hero, so it is preferable to have a concise explanation for the feature to avoid excessive screen space.PR Review Notes
After work had started on the approved proposal, @danmosemsft asked if the scope of this feature should be changed to also adjust the meaning
\Z
anchor. @jzabroski suggested writing how the end user documentation will look after this change, as good docs will determine if it is a function step improvement in usability and reducing gotchas.Also, during the PR, it seems @shishirchawla also proposed AnyEndZ as a way to use AnyNewLine as an "Anchor Modifier", which will alter the meaning of '\Z' anchor in addition to altering the meaning of '$' anchor. The intent of this improvement appears to be to remove all platform-specific language from the Anchors documentation, which seems like a great improvement.
AnyNewLine as Anchor Modifier to \Z and $ Anchors
(?=\n\z|\z)
\n
at the end of the string.$
with this option.)$
with this option.)RegexOptions.Multiline
(?=\n|\n\z|\z)
\n
anywhere in the string.(?=\n\z|\z)
\n
at the end of the string.RegexOptions.Multiline | RegexOptions.AnyNewLine
(?=\r\n|\r|\n|\r\n\z|\r\z|\n\z|\z)
\r\n
,\n
or\r
anywhere in the string.(?=\r\n\z|\r\z|\n\z|\z)
\r\n
,\n
or\r
at the end of the string.RegexOptions.AnyNewLine
(?=\r\n\z|\r\z|\n\z|\z)
\r\n
,\n
or\r
at the end of the string.$
with this option.)$
with this option.)The text was updated successfully, but these errors were encountered: