-
Notifications
You must be signed in to change notification settings - Fork 4.9k
Adds a new regex option - RegexOptions.AnyNewLine. #41195
Conversation
…Node.AnyEndZ and RegexNode.AnyEol
… Compiler, and minor other fixes.
Thanks @shishirchawla! Added @ViktorHofer for review |
break; | ||
if (rightChars == 1 && CharAt(Textpos()) != '\r' && CharAt(Textpos()) != '\n') | ||
break; | ||
if (rightChars == 2 && (CharAt(Textpos()) != '\r' || CharAt(Textpos()+1) != '\n')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, spaces around +
@@ -22,5 +22,6 @@ public enum RegexOptions | |||
|
|||
ECMAScript = 0x0100, // "e" | |||
CultureInvariant = 0x0200, | |||
AnyNewLine = 0x0400, // "a", Treat "$" as (?=[\r\n]|\z) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in the issue (?=[\r\n]|\z)
matches "any one of \r, \n, or end of input" whereas what is intended is "any one of \r, \n, \r\n, or end of input". Plus, only end of input if Multiline is not also set (?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it needs a more precise definition, which I'm not clear about so hopefully someone can help clarify it in discussion here.
Is this right?
|
What about the behavior of seems like $ where it ignores RegexOptions.Multiline. Should it respect RegexOptions.AnyNewLine, ie
|
Dan, The intent of AnyNewLine is to allow engineers a succinct, portable way to match Unix or Windows style line endings. Your recommendation seems to be in line with the intent, and am thankful for your watchful eye and feedback. Is it too late to revise #28410 to include your addition? I still want to champion this as overall API improvement and see your remarks as further improvement. That said, what becomes the specification of \Z anchor? I think I should write up in plain English how this will end up appearing on https://docs.microsoft.com/en-us/dotnet/standard/base-types/regular-expression-language-quick-reference - I hate judging whether a specification change makes sense in a PR. I feel we can all agree the \n in the current doc isn't Linux friendly or portable. The question is how to make the effects of AnyNewLine intuitive and also easy to read. I'll try to summarize this tomorrow evening EST time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shishirchawla Separate from @danmosemsft comments: It looks like you changed the specification to match old Mac style line endings, too. This explodes the number of test cases you have to write, since every |
in the Regex increases the base number that is being raised to the power of 2. It also reduces performance. Am I missing discussion somewhere about adding support for that? I re-read through all the comments sans watching the API Review video, which I will do tonight when I have access to audio (machine locked down at work / no speakers), and I don't see any discussion of supporting Mac-style.
Finally, I see what you did that resulted in Dan's comments: You added AnyEndZ. This is probably a good improvement but should be discussed outside the PR - looking at this from the level of code is a distraction - I will update the original post in #28410 with these notes.
Please also make sure you cover the test given by @svick as this was one test that tripped up an API Reviewer. https://github.com/dotnet/corefx/issues/28410#issuecomment-392884975
@jzabroski thanks for updating the original proposal. If my two tables above correctly reflect the desired outcome -- perhaps you could include those as well (or a similar table)? It would have made it clearer to me, than trying to parse from the text description. Wrt process, we do need an ack from at least @terrajobst to deviate from what was approved (although as I say, it wasn't 100% clear to me what that was) |
@danmosemsft I agree with you. With due respect, I didn't want to just copy the table you gave because I wanted to verify it myself and independently construct it to make sure you didn't introduce a bug. That is also what I meant in the OP about looking at the documentation and making sure we have the user stories right from a docs perspective, and that it'll be clear to users how this works. In the OP, I introduced today the phrase "Anchor Modifier" which to my knowledge isn't used anywhere in the docs previously, despite MultiLine option being an "Anchor Modifier" in that... it modifies the meaning of the anchor. Hope that's fair. Here is the table I have in mind to fill out:
This puts everything for the reviewer in one place, so they're not flipping back and forth trying to figure out if the spec is right, which I'm imaginging both you and I concentrated hard on recently 👍 |
@jzabroski that sounds even better. Yes don't rely on anything I wrote without verification 😄 |
Hi @jzabroski , I think I saw the following comment - https://github.com/dotnet/corefx/issues/28410#issuecomment-376611805, that mentions the most commonly used line endings ('\n', '\r\n', and '\r') and just went with the assumption that it will be a good idea to deal with all three cases. Please let me know if you’d like me to remove the '\r' option. Regarding the AnyEndZ option, I think it makes sense to have it in parallel with the "$" modifier as when the MultiLine option is not used, the logic basic defaults to "\Z" so to me it makes sense to alter the definition of "\Z" in the same PR. I will go over the test case that you mentioned and make sure that everythings working as expected. |
Yes, but I think @terrajobst needs to approve the AnyEndZ opcode and the right place to do that is in the issue, not the PR - think of this from the perspective of the Microsoft Docs team and their job to update documentation for the end-user developers who will use this API. It's not clear from the Video Review of the API if Immo said something about \Z, and his notes didn't communicate such. The main thing I got from watching the video review is it took people awhile to understand the summary table explaining the current problem and how AnyNewLine should solve it. Separately, last night I read through about 70 pages of Jeffrey Freidl's Mastering Regular Expressions and his coverage of line anchors to take a step back and make sure I wasn't missing anything, and also thinking how to communicate this to end-user developers. I'm not kidding when I say literally >10% of Jeffrey's book covers how to capture the end of a line correctly. I documented some of that here: https://github.com/jzabroski/Home/blob/master/RegularExpressions/README.md |
@terrajobst can you help unblock @shishirchawla here pls? |
Per discussion in the issue, we do want
|
Yes, that is fine. It adds the slight mod of accepting old Mac OS style line endings, which I suppose is OK. Otherwise, the table is perfect. Thanks for doing that :) Been feeling overwhelmed running two open source projects and doing my main job / running a business. If you're ever in Boston, I owe you a beer 🥂 |
@shishirchawla if you haven't lost interest, which would be understandable, I think we have all we need to finish this PR. Mostly just clearer tests I guess. |
Thanks @danmosemsft . I will get started on this soon. |
Thank you for your contribution. As announced in dotnet/coreclr#27549 this repository will be moving to dotnet/runtime on November 13. If you would like to continue working on this PR after this date, the easiest way to move the change to dotnet/runtime is:
|
Hey @shishirchawla don't forget to also click on the link posted by dnfclas bot here: #41195 (comment) |
Hi @danmosemsft /@jzabroski , I have added the missing test case and the required modifications for it. Please let me know if youd like me to add some specific tests. |
yield return new object[] { @"$", "line1\nline2\nline3\nline4\r\n", RegexOptions.RightToLeft | RegexOptions.AnyNewLine, 0, 25, true, "" }; | ||
|
||
// AnyNewLine | Multiline ('.' will match everything except \r and \n) | ||
yield return new object[] { @".*$", "foo\r\nbar", RegexOptions.AnyNewLine | RegexOptions.Multiline, 0, 8, true, "foo" }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @shishirchawla . Looks good to me.
hey @shishirchawla. There's some errors showing up on the CI that need fixing. You can check them out here: https://github.com/dotnet/corefx/pull/41195/checks?check_run_id=297038224 |
@danmosemsft I added your wonderful table to the top of the original issue https://github.com/dotnet/corefx/issues/28410 - thanks again! |
Hey @maryamariyan, I see there are some new compiler errors. Will look at this soon. Thanks for letting know. |
Thank you for your contribution. As announced in #27549 the dotnet/runtime repository will be used going forward for changes to this code base. Closing this PR as no more changes will be accepted into master for this repository. If you’d like to continue working on this change please move it to dotnet/runtime. |
@shishirchawla do you plan to move this PR to dotnet/runtime repo so we can get it finished? If not maybe someone else will volunteer. |
I would be willing to do it if @shishirchawla has lost interest, but I would like to see Shishir get the credit for the work. It's been on my mind. |
Hi @danmosemsft @jzabroski, I just created a PR for this in the run time repo dotnet/runtime#1449 |
AWESOME!! Thank you so much! If you're ever in Boston, 🍹 🍸 at the Mandarin Oriental, on me. |
Adds a new regex option - RegexOptions.AnyNewLine, which supports '\r' (old mac), '\n' (unix), and '\r\n' (windows) as line endings for the '$' (EOL) operator.
Fixes #28410