Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interaction with backreferences / variable-width escape sequences #17

Closed
nikic opened this issue Jun 18, 2015 · 21 comments
Closed

Interaction with backreferences / variable-width escape sequences #17

nikic opened this issue Jun 18, 2015 · 21 comments

Comments

@nikic
Copy link

nikic commented Jun 18, 2015

Using something like new RegExp("(foo)\\1" + RegExp.escape(input)), if input were to start with a number, this would extend the backreference \1 to something like \11. Does this need to be accounted for?

@benjamingr
Copy link
Collaborator

@nikic good question.

At this moment we are not escaping anything that has any context, similarly ?! isn't escaped, - isn't escaped and so on.

Basically, whenever you need context sensitivity you'd do:

new RegExp("(foo)\1(" + RegExp.escape(input) +")")

Accounting for all context cases sounds problematic.

@nikic
Copy link
Author

nikic commented Jun 18, 2015

@benjamingr I think there is a difference between this case and something like ?! though: It could only become an issue if you insert into a malformed regular expression, i.e. you'd have to write "(?" + RegExp.escape(...) + ")", where the syntax (?...) by itself (i.e. not followed by : or any of the other usual group flavors) is not valid.

@bwoebi
Copy link

bwoebi commented Jun 18, 2015

@benjamingr It also is possible to just add a non-capturing group around the whole; that way it is context-independent. At that point it's then also enough to have the default set of escapes which we have currently. … Except that it will bork with [] character classes.

So, generally, I think that's an unavoidable issue and must be handled by the programmer.

@benjamingr
Copy link
Collaborator

@bwoebi I don't think we have those (non capturing groups).

@nikic yes, this is more like that - case or other cases where it can change the context of the RegExp.

@bwoebi
Copy link

bwoebi commented Jun 19, 2015

@benjamingr We don't have non capturing groups, but that's just the "proper" way to delimit that… one also could use .{0} as prefix, which would work in JS too. But point is, that'd brok again with character classes.

Issue is that it's not really possible to have real context sensitive escaping… the only idea I have for that is how e.g. prepared statements work with mysql… you basically have placeholders in your query/regex and then can context-sensitively escape them.

@benjamingr
Copy link
Collaborator

Can you write a short "proof" illustrating that it is impossible (or hard)
to write escape in a way that would allow escaping in a way that would
preserve context? (Without non-capturing groups)

Sensitive escaping might be our plan for RegExp.tag in the future (name not
final).

On Fri, Jun 19, 2015 at 3:45 PM, Bob Weinand [email protected]
wrote:

@benjamingr https://github.com/benjamingr We don't have non capturing
groups, but that's just the "proper" way to delimit that… one also could
use .{0} as prefix, which would work in JS too. But point is, that'd brok
again with character classes.

Issue is that it's not really possible to have real context sensitive
escaping… the only idea I have for that is how e.g. prepared statements
work with mysql… you basically have placeholders in your query/regex and
then can context-sensitively escape them.


Reply to this email directly or view it on GitHub
#17 (comment)
.

@nikic
Copy link
Author

nikic commented Jun 19, 2015

I don't see what would be hard about it, just escape - and digits additionally (at least if occurring as first character). I don't think the problem here is technical, it's more a question of whether or not such cases should be supported.

(Where "escaping" for digits would mean encoding them.)

@bwoebi
Copy link

bwoebi commented Jun 19, 2015

@benjamingr Point is that we cannot reliably prevent that very case @nikic did mention in his initial post. If you insert any characters before the string (note that we cannot escape numbers, that'd mean a backreference), it won't be compatible with a character class as it counts as a symbol there.

@nikic How do you want to escape numbers? By their octal representation? Then RegExp.escape("0") +"0" will end up as "\300" … character number 192 then. Putting a backslash in front would mean a back-reference… so, what now?

EDIT:

Hmm... okay, we can hex-escape that then… that will work. So, turns out I was wrong a bit and it's indeed possible to properly escape it

@nikic
Copy link
Author

nikic commented Jun 19, 2015

@bwoebi As said, by encoding them. I am not terribly familiar with details of JS regex flavor, but presumably it has some kind of fixed-width escape sequence. Probably \xHH or \uHHHH.

@bwoebi
Copy link

bwoebi commented Jun 19, 2015

@nikic yeah, see my EDIT, noticed that just now.

@benjamingr
Copy link
Collaborator

Numbers like 1 can be escaped like [1] but I don't see a way to reliably escape things.

Adding a capturing group will mess with back-references.

Other solutions seem to have issues with quantifiers and context sensitive inserts. I don't see a way we can solve this from the library level - at least without parsing the RegExp ourselves first.

@domenic
Copy link
Member

domenic commented Jun 19, 2015

I'm not sure I fully understand all the issues here, but the sense I'm getting is that maybe just going for an escape-everything approach (#15) would solve things? (EDIT: I saw the doc about how other languages don't like that idea, interesting.)

@domenic
Copy link
Member

domenic commented Jun 19, 2015

In any case, I think the important thing to distill out of this thread is a list of failure scenarios for the current spec, and put those into the readme. Then we can judge how problematic those are by asking if they're ridiculous edge cases or actually quite believable.

@benjamingr
Copy link
Collaborator

Python is moving to a escape only metacharacters approach in its new
regex module (replacing re) by the way.
https://bitbucket.org/mrabarnett/mrab-regex/src/6193ea4246da272cf18a190c46aa116737067780/regex_3/Python/regex.py?at=default#cl-409

(Courtesy of Martijn Pieters)

What this issue points out is that escaping everything wouldn't work and
that escaping needs to be context aware so it is the programmer who has to
handle escaping in these cases.

The two other participants are PHP core contributors who I invited to
participate in the discussion by the way :)

On Fri, Jun 19, 2015 at 6:13 PM, Domenic Denicola [email protected]
wrote:

In any case, I think the important thing to distill out of this thread is
a list of failure scenarios for the current spec, and put those into the
readme. Then we can judge how problematic those are by asking if they're
ridiculous edge cases or actually quite believable.


Reply to this email directly or view it on GitHub
#17 (comment)
.

@nikic
Copy link
Author

nikic commented Jun 19, 2015

Speaking in formal, the suggestion here is basically this:

REPLACE

  1. If c is matched by SyntaxCharacter then do:
    1. Append code unit 0x005C (REVERSE SOLIDUS) to cuList.
  2. Append the elements of the UTF16Encoding (10.1.1) of c to cuList.

WITH

  1. If c is matched by SyntaxCharacter then do:
    1. Append code unit 0x005C (REVERSE SOLIDUS) to cuList.
    2. Append the elements of the UTF16Encoding (10.1.1) of c to cuList.
  2. Otherwise, if c is matched by HexDigit (*) then do:
    1. Append code unit 0x005C (REVERSE SOLIDUS) to cuList.
    2. Append code unit 0x0078 (LATIN SMALL LETTER X) to cuList.
    3. Append two code units that are the result of converting the numeric code point value of c to a String of two hexadecimal digits. Alphabetic hexadecimal digits are presented as lowercase Latin letters.
  3. Otherwise append the elements of the UTF16Encoding (10.1.1) of c to cuList.

(*) Potential addition: "And c is the first element of cpList"

Btw, just for the record PHP also ignores the issue with backreferences and escapes only meta characters (albeit a larger number). Doing that has a lot of precedent in other languages.

@benjamingr
Copy link
Collaborator

In any case, I think the important thing to distill out of this thread is a list of failure scenarios for the current spec, and put those into the readme.

Definitely @domenic , I'm not sure the discussion has exhausted itself yet though.

@nikic this would help with backreferences but would not help for example in matching sets or in other context sensitive cases - for example "[" + escaped +"]"

It would also have the effect of making the regexp created less readable. Wouldn't it make more sense to just insert an empty capturing group into the RegExp if the first literal matches HexDigit?

@ljharb
Copy link
Member

ljharb commented Jun 19, 2015

@benjamingr @bwoebi Isn't ?: how you denote a non-capturing group in JS regex? See RegExp.prototype.source as an example.

@benjamingr
Copy link
Collaborator

@ljharb oh wow yeah, it is. I have no idea how I totally missed that while I used it last week at least twice or when I read the spec for .source today.

Just prefixing and postfixing the RegExp with (?:) would not solve all context sensitivity issues. For example, it would not work if you're inside a set match ([ ]). See any way around it?

@nikic
Copy link
Author

nikic commented Jun 19, 2015

@nikic this would help with backreferences but would not help for example in matching sets or in other context sensitive cases - for example "[" + escaped +"]"

I'm not sure I see the issue you're referring to here, could you elaborate? Apart from use of - in there, but that can be escaped as well.

@benjamingr benjamingr mentioned this issue Jun 19, 2015
@benjamingr
Copy link
Collaborator

@nikic as an expert, do you think the readability impact of the resulting regular expression is an important factor? If so - do you think this guarantee is worth it?

@benjamingr
Copy link
Collaborator

Superseding this with #29

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants