Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.
/ corefx Public archive

Adds a new regex option - RegexOptions.AnyNewLine. #41195

Closed

Conversation

shishirchawla
Copy link

@shishirchawla shishirchawla commented Sep 19, 2019

Adds a new regex option - RegexOptions.AnyNewLine, which supports '\r' (old mac), '\n' (unix), and '\r\n' (windows) as line endings for the '$' (EOL) operator.

Fixes #28410

@dnfclas
Copy link

dnfclas commented Sep 19, 2019

CLA assistant check
All CLA requirements met.

@shishirchawla shishirchawla changed the title Dev/shchawl/anynewline Adds a new regex option - RegexOptions.AnyNewLine. Sep 19, 2019
@danmoseley
Copy link
Member

Thanks @shishirchawla! Added @ViktorHofer for review

break;
if (rightChars == 1 && CharAt(Textpos()) != '\r' && CharAt(Textpos()) != '\n')
break;
if (rightChars == 2 && (CharAt(Textpos()) != '\r' || CharAt(Textpos()+1) != '\n'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, spaces around +

@@ -22,5 +22,6 @@ public enum RegexOptions

ECMAScript = 0x0100, // "e"
CultureInvariant = 0x0200,
AnyNewLine = 0x0400, // "a", Treat "$" as (?=[\r\n]|\z)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in the issue (?=[\r\n]|\z) matches "any one of \r, \n, or end of input" whereas what is intended is "any one of \r, \n, \r\n, or end of input". Plus, only end of input if Multiline is not also set (?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it needs a more precise definition, which I'm not clear about so hopefully someone can help clarify it in discussion here.

@danmoseley
Copy link
Member

danmoseley commented Oct 5, 2019

Is this right?

flags $ is treated as
neither (?=\n\z|\z)
RegexOptions.Multiline (?=\n)
RegexOptions.Multiline | RegexOptions.AnyNewLine (?=\r\n|\r|\n)
RegexOptions.AnyNewLine (?=\r\n\z|\r\z|\n\z|\z)

@jzabroski @svick

@danmoseley
Copy link
Member

What about the behavior of \Z (capital Z) -- docs say, "\Z | The match must occur at the end of the string or before \n at the end of the string.".

seems like $ where it ignores RegexOptions.Multiline. Should it respect RegexOptions.AnyNewLine, ie

flags \Z is treated as
RegexOptions.AnyNewLine (?=\r\n\z|\r\z|\n\z|\z)
any case without above flag (?=\n\z|\z)

@jzabroski
Copy link
Contributor

Dan,

The intent of AnyNewLine is to allow engineers a succinct, portable way to match Unix or Windows style line endings.

Your recommendation seems to be in line with the intent, and am thankful for your watchful eye and feedback.

Is it too late to revise #28410 to include your addition? I still want to champion this as overall API improvement and see your remarks as further improvement.

That said, what becomes the specification of \Z anchor?

I think I should write up in plain English how this will end up appearing on https://docs.microsoft.com/en-us/dotnet/standard/base-types/regular-expression-language-quick-reference - I hate judging whether a specification change makes sense in a PR.

I feel we can all agree the \n in the current doc isn't Linux friendly or portable. The question is how to make the effects of AnyNewLine intuitive and also easy to read.

I'll try to summarize this tomorrow evening EST time.

Copy link
Contributor

@jzabroski jzabroski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shishirchawla Separate from @danmosemsft comments: It looks like you changed the specification to match old Mac style line endings, too. This explodes the number of test cases you have to write, since every | in the Regex increases the base number that is being raised to the power of 2. It also reduces performance. Am I missing discussion somewhere about adding support for that? I re-read through all the comments sans watching the API Review video, which I will do tonight when I have access to audio (machine locked down at work / no speakers), and I don't see any discussion of supporting Mac-style.

Finally, I see what you did that resulted in Dan's comments: You added AnyEndZ. This is probably a good improvement but should be discussed outside the PR - looking at this from the level of code is a distraction - I will update the original post in #28410 with these notes.

Please also make sure you cover the test given by @svick as this was one test that tripped up an API Reviewer. https://github.com/dotnet/corefx/issues/28410#issuecomment-392884975

@danmoseley
Copy link
Member

danmoseley commented Oct 7, 2019

@jzabroski thanks for updating the original proposal. If my two tables above correctly reflect the desired outcome -- perhaps you could include those as well (or a similar table)? It would have made it clearer to me, than trying to parse from the text description.

Wrt process, we do need an ack from at least @terrajobst to deviate from what was approved (although as I say, it wasn't 100% clear to me what that was)

@jzabroski
Copy link
Contributor

@danmosemsft I agree with you. With due respect, I didn't want to just copy the table you gave because I wanted to verify it myself and independently construct it to make sure you didn't introduce a bug. That is also what I meant in the OP about looking at the documentation and making sure we have the user stories right from a docs perspective, and that it'll be clear to users how this works. In the OP, I introduced today the phrase "Anchor Modifier" which to my knowledge isn't used anywhere in the docs previously, despite MultiLine option being an "Anchor Modifier" in that... it modifies the meaning of the anchor.

Hope that's fair. Here is the table I have in mind to fill out:

flags $ is treated as $ documentation \Z is treated as \Z documentation
neither
RegexOptions.Multiline
RegexOptions.Multiline | RegexOptions.AnyNewLine
RegexOptions.AnyNewLine

This puts everything for the reviewer in one place, so they're not flipping back and forth trying to figure out if the spec is right, which I'm imaginging both you and I concentrated hard on recently 👍

@danmoseley
Copy link
Member

@jzabroski that sounds even better. Yes don't rely on anything I wrote without verification 😄

@shishirchawla
Copy link
Author

Hi @jzabroski , I think I saw the following comment - https://github.com/dotnet/corefx/issues/28410#issuecomment-376611805, that mentions the most commonly used line endings ('\n', '\r\n', and '\r') and just went with the assumption that it will be a good idea to deal with all three cases. Please let me know if you’d like me to remove the '\r' option.

Regarding the AnyEndZ option, I think it makes sense to have it in parallel with the "$" modifier as when the MultiLine option is not used, the logic basic defaults to "\Z" so to me it makes sense to alter the definition of "\Z" in the same PR.

I will go over the test case that you mentioned and make sure that everythings working as expected.

@jzabroski
Copy link
Contributor

jzabroski commented Oct 8, 2019

Yes, but I think @terrajobst needs to approve the AnyEndZ opcode and the right place to do that is in the issue, not the PR - think of this from the perspective of the Microsoft Docs team and their job to update documentation for the end-user developers who will use this API. It's not clear from the Video Review of the API if Immo said something about \Z, and his notes didn't communicate such.

The main thing I got from watching the video review is it took people awhile to understand the summary table explaining the current problem and how AnyNewLine should solve it.

Separately, last night I read through about 70 pages of Jeffrey Freidl's Mastering Regular Expressions and his coverage of line anchors to take a step back and make sure I wasn't missing anything, and also thinking how to communicate this to end-user developers. I'm not kidding when I say literally >10% of Jeffrey's book covers how to capture the end of a line correctly. I documented some of that here: https://github.com/jzabroski/Home/blob/master/RegularExpressions/README.md

@danmoseley
Copy link
Member

@terrajobst can you help unblock @shishirchawla here pls?

@danmoseley
Copy link
Member

Per discussion in the issue, we do want \Z to be affected as well. @jzabroski I've filled in your table. Does this look correct to you?

flags $ is treated as $ documentation \Z is treated as \Z documentation
neither (?=\n\z|\z) The match must occur at the end of the string or before \n at the end of the string. (Same as $ with this option.) (Same as $ with this option.)
RegexOptions.Multiline (?=\n|\n\z|\z) The match must occur at the end of the string or before \n anywhere in the string. (?=\n\z|\z) The match must occur at the end of the string or before \n at the end of the string.
RegexOptions.Multiline | RegexOptions.AnyNewLine (?=\r\n|\r|\n|\r\n\z|\r\z|\n\z|\z) The match must occur at the end of the string or before \r\n, \n or \r anywhere in the string. (?=\r\n\z|\r\z|\n\z|\z) The match must occur at the end of the string or before \r\n, \n or \r at the end of the string.
RegexOptions.AnyNewLine (?=\r\n\z|\r\z|\n\z|\z) The match must occur at the end of the string or before \r\n, \n or \r at the end of the string. (Same as $ with this option.) (Same as $ with this option.)

@jzabroski
Copy link
Contributor

Yes, that is fine. It adds the slight mod of accepting old Mac OS style line endings, which I suppose is OK. Otherwise, the table is perfect. Thanks for doing that :) Been feeling overwhelmed running two open source projects and doing my main job / running a business. If you're ever in Boston, I owe you a beer 🥂

@danmoseley
Copy link
Member

@shishirchawla if you haven't lost interest, which would be understandable, I think we have all we need to finish this PR. Mostly just clearer tests I guess.

@shishirchawla
Copy link
Author

Thanks @danmosemsft . I will get started on this soon.

@maryamariyan
Copy link
Member

maryamariyan commented Nov 6, 2019

Thank you for your contribution. As announced in dotnet/coreclr#27549 this repository will be moving to dotnet/runtime on November 13. If you would like to continue working on this PR after this date, the easiest way to move the change to dotnet/runtime is:

  1. In your corefx repository clone, create patch by running git format-patch origin
  2. In your runtime repository clone, apply the patch by running git apply --directory src/libraries <path to the patch created in step 1>

@maryamariyan
Copy link
Member

maryamariyan commented Nov 8, 2019

Hey @shishirchawla don't forget to also click on the link posted by dnfclas bot here: #41195 (comment)
to sign contributor license agreement. Needed before the PR can be merged.

@shishirchawla
Copy link
Author

Hi @danmosemsft /@jzabroski , I have added the missing test case and the required modifications for it. Please let me know if youd like me to add some specific tests.

yield return new object[] { @"$", "line1\nline2\nline3\nline4\r\n", RegexOptions.RightToLeft | RegexOptions.AnyNewLine, 0, 25, true, "" };

// AnyNewLine | Multiline ('.' will match everything except \r and \n)
yield return new object[] { @".*$", "foo\r\nbar", RegexOptions.AnyNewLine | RegexOptions.Multiline, 0, 8, true, "foo" };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @shishirchawla . Looks good to me.

@maryamariyan
Copy link
Member

maryamariyan commented Nov 11, 2019

hey @shishirchawla. There's some errors showing up on the CI that need fixing. You can check them out here: https://github.com/dotnet/corefx/pull/41195/checks?check_run_id=297038224

@jzabroski
Copy link
Contributor

@danmosemsft I added your wonderful table to the top of the original issue https://github.com/dotnet/corefx/issues/28410 - thanks again!

@shishirchawla
Copy link
Author

Hey @maryamariyan, I see there are some new compiler errors. Will look at this soon. Thanks for letting know.

@ViktorHofer ViktorHofer added the * NO MERGE * The PR is not ready for merge yet (see discussion for detailed reasons) label Nov 14, 2019
@maryamariyan
Copy link
Member

Thank you for your contribution. As announced in #27549 the dotnet/runtime repository will be used going forward for changes to this code base. Closing this PR as no more changes will be accepted into master for this repository. If you’d like to continue working on this change please move it to dotnet/runtime.

@karelz karelz added this to the 5.0 milestone Dec 19, 2019
@danmoseley
Copy link
Member

@shishirchawla do you plan to move this PR to dotnet/runtime repo so we can get it finished? If not maybe someone else will volunteer.

@jzabroski
Copy link
Contributor

I would be willing to do it if @shishirchawla has lost interest, but I would like to see Shishir get the credit for the work. It's been on my mind.

@shishirchawla
Copy link
Author

Hi @danmosemsft @jzabroski, I just created a PR for this in the run time repo dotnet/runtime#1449

@jzabroski
Copy link
Contributor

AWESOME!! Thank you so much! If you're ever in Boston, 🍹 🍸 at the Mandarin Oriental, on me.

@danmoseley
Copy link
Member

dotnet/runtime#25598

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Text.RegularExpressions * NO MERGE * The PR is not ready for merge yet (see discussion for detailed reasons)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants