Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clarify newline behavior of regex #35751

Merged
merged 28 commits into from
Jun 25, 2023
Merged

clarify newline behavior of regex #35751

merged 28 commits into from
Jun 25, 2023

Conversation

@danmoseley danmoseley requested a review from stephentoub June 11, 2023 20:00
@danmoseley danmoseley requested a review from adegeo as a code owner June 11, 2023 20:00
@dotnet-bot dotnet-bot added this to the June 2023 milestone Jun 11, 2023
@danmoseley
Copy link
Member Author

cc @Tisit

@danmoseley danmoseley requested a review from a team June 11, 2023 21:01
@danmoseley
Copy link
Member Author

Is anyone available to review..

docs/standard/base-types/anchors-in-regular-expressions.md Outdated Show resolved Hide resolved

Note that `\Z` matches `\n` but does not match `\r\n` (the CR/LF character combination). To match CR/LF, include `\r?\Z` in the regular expression pattern.
Note that `\Z` is satisfied by `\n` but is not satisfied by `\r\n` (the CR/LF character combination). To treat CR/LF as if it were `\n`, include `\r?\Z` in the regular expression pattern. Note that this will make the `\r` part of the match - if you are accessing the match, and do not want `\r` to be part of it, you may replace `\r?\Z` with a positive lookahead assert, `(?=\r\n\z|\z|(?<!\r)(?=\n\z))`. In this expression, `\z` asserts the end of the input string.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Negative lookbehind inside of a positive lookahead is a really complicated construct. Is it necessary? If the first branch of the alternation is \r\n\z, why can't the later branch just be \n\z, or why not just have the \r be optional inside of the lookahead, e.g. (?=\r\n\z|\n\z|\z) or (?=(?:\r?\n)?\z) or some such thing?

Copy link
Member Author

@danmoseley danmoseley Jun 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assumption is that we want to recognize \r\n but not treat it like two lines (no editor would treat that as two line breaks)

For example given "abc\n" in SingleLine mode the pattern ".\Z" will match at position 2 (before the newline) and position 3 (before the end). Now take "abc\r\n". If we want \r\n to be treated like a newline, then we want to match at 2 (before newline) and 4 (before end)

(?=\r\n\z|\z|(?<!\r)(?=\n\z)) does this but (?=\r\n\z|\n\z|\z) and (?=(?:\r?\n)?\z) will add a third match at position 3.

Copy link
Member

@stephentoub stephentoub Jun 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommending lookarounds is problematic because lookarounds don't work with non-backtracking, nevermind their being complicated. It'd be better, for example, to recommend other patterns that would match the newlines and then have a capture group for just the piece you want.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be a more extended discussion. I think there is value in including this subexpression for reference and for those that need it. Mind if I leave as is for now? I feel like what we have here is a net improvement and I can do more to the docs another day.

Adding support for AnyNewLine is the most useful thing we can do here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you recommend such patterns then you need to call out that they will prevent NonBacktracking from working / will fail with NonBacktracking.

Copy link
Member

@stephentoub stephentoub Jun 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amongst engines that do support backtracking it seems to be fairly common to support \r\n as an atomic unit

Can you elaborate? What is the syntax someone uses with PCRE for example to achieve the "treat \n or \r\n as a newline" semantic? Is that the \R?

Does PCRE support that with its non-backtracking functions?

We could support it as part of matching, but without changing the matching logic I don't think we can do it as written just during parsing.

Copy link
Member Author

@danmoseley danmoseley Jun 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

\R consumes a "generic line break". In PCRE what \R considers that to be can be adjusted. Similarly, what $ and \Z consider a line break can be separately adjusted, at compile time or within the pattern.

https://www.pcre.org/current/doc/html/pcre2pattern.html#newlines https://www.pcre.org/current/doc/html/pcre2pattern.html#newlineseq
https://www.pcre.org/current/doc/html/pcre2api.html#newlines

eg \R might be equivalent to (?>\r\n|\n|\x0b|\f|\r|\x85). And you might have (*ANYCRLF) in your pattern to mean that you want to recognize any of CR, LF, or CRLF.

It makes clear that if CRLF is recognized as a newline, then it will be treated atomically:

When the newline convention (see "Newline conventions" below) recognizes the two-character sequence CRLF as a newline, this is preferred, even if the single characters CR and LF are also recognized as newlines. For example, if the newline convention is "any", a multiline mode circumflex matches before "xyz" in the string "abc\r\nxyz" rather than after CR, even though CR on its own is a valid newline. (It also matches at the very start of the string, of course.)

Copy link
Member Author

@danmoseley danmoseley Jun 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Perl, \R matches generic newline: it's not clear whether you can change what $ recognizes.

In Java, $ and \Z and . as appropriate recognize a variety of line terminators including the atomic \r\n. But there's an option UNIX_LINES to make them only recognize \n. In other words, they have AnyNewLine as proposed, but on by default.
https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html (see Line terminators)

\R in Java seems to always be matching a variety, like Perl.

\R Any Unicode linebreak sequence, is equivalent to \u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@veanes / @olsaarik, I assume for the lookarounds support we discussed, we could by default only permit lookarounds with a max length limited by the pattern rather than by the input, eg no unbounded loops. That would enable the patterns discussed here.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. So a straightforward fragment is essentially the one where unbounded loops do not occur, perhaps even stricter -- no loops (either bounded or nonbounded). One other issue is that in general NFA mode might still be needed during lookaraounds, but that should be orthogonal to the rest and happen automatically. My understanding is also that all captures are disabled during lookarounds and not stored.

docs/standard/base-types/regular-expression-options.md Outdated Show resolved Hide resolved
@danmoseley
Copy link
Member Author

@stephentoub I removed the lookahead here so we can merge the rest - does it look OK?

@danmoseley danmoseley enabled auto-merge (squash) June 25, 2023 01:55
@danmoseley danmoseley merged commit b1a1a20 into dotnet:main Jun 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Extend regex backtracking topic to link to non backtracking engine
4 participants