-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add PCRE2_ASCII (RFC) #186
Conversation
As suggested in PCRE2Project#185 and as done with Perl with the '/aa' modifier it is preferably for performance/security[1] reasons to avoid including in \d characters that are outside the commonly expected digits. Add that functionality with the foundations of what was suggested in PCRE2Project#11 [1] https://perldoc.perl.org/perlre#/a-(and-/aa)
I am working on something unrelated at the moment, but I hope to be able to look at this later in the week. As an aside, I'm pleased that another person is becoming familiar with the code, because I am now an old man and won't be here forever. I now think it's unfortunate that PCRE2_UTF changes the way case-independent matching works, because it mixes up two different things. Ideally PCRE2_UTF should just control the way a stream of code units is turned into a stream of characters, without affecting how those characters are processed. Other options (UCP, ASCII, whatever) should be used to control the processing, independently of PCRE2_UTF. But it is perhaps too late to think of changing things that much. |
I don't really like the "add a feature and then add another feature for partially disable that feature" design. Instead I would add a full set of flags (can be set by a separate api call, 0 by default) which allows fine control over these features. E.g. |
I'm not sure we need quite that much fine control, but I understand your concern. I have been thinking about this and will post some ideas soon. The problem with using an api call to set these flags is that it can't also be set in the pattern in the way (*UTF) or (*UCP) are now used. |
It seems to me that there are three independent behaviours that need to be controllable: (A) How code units are interpreted as characters: (B) How \d, \s, etc. are processed: (C) How case-independent matching works: Unfortunately, for historical reasons (different features were added at different times), PCRE2 mixes these behaviours. The default is A1, B1, C1. The PCRE2_UTF option sets A2 and C3. The PCRE2_UCP option sets B2 and C3 and is independent of PCRE2_UTF - setting PCRE2_UCP without PCRE2_UTF can be used (for example) with 16-bit non-UTF files. Although I don't rule out incompatible changes, I would very much like to avoid making any, because they ALWAYS catch somebody out. I have not yet looked at Carlo's code, but I'm guessing that his proposed PCRE2_ASCII option changes to C2 behaviour (which is what one of the Perl options does). However, to allow for all cases, we really need to be able to select any B and C option independently - you might want either C2 or C3 with 16-bit non-UTF data, for example. There are only 5 bits left in the PCRE2 options word. A compatible scheme that uses only two of them might be as follows: PCRE2_CASELESS_ASCII sets 01 (of the two bits) If the two bits are both zero, then, if either UTF or UCP is set, force them to 11. This preserves compatibility with the current behaviour. It is not what I would do if I were starting again from scratch and it has the danger that setting both ASCII and NOMIX is a gotcha. Hmm.... |
While the new flag was meant to be used to move from C3 to C2, that part wasn't implemented yet. What was implemented though (at least partially), was something missing from the matrix above in the "B" dimension and that matches B2 in the modified list: B1: Only ASCII characters match (the default) The result was that The use case I had on mind was Outside of that use case, I can also think that limiting Of course overloading all that meaning in a single flag is difficult, and the fact that it interacts with the other two makes it hard to use and I can see why Zoltan suggested a different design. [1] https://en.wikipedia.org/wiki/Halfwidth_and_fullwidth_forms |
I have continued to think ... luckily there is no rush on this ... I am beginning to think Zoltan's idea might be useful. There are plenty of bits available in the "extra" options and there is an existing api - pcre2_set_compile_extra_options(). Having independent control over \d, \w, and \s would be possible. However, there is also the question of the "in pattern" settings such as (*UTF) that should, if possible, mirror the external settings - but there are also external lockouts such as PCRE2_NEVER_UTF. Will we need more of those? More thought needed..... |
I lied above. There are only 2 bits left in the options argument to pcre2_compile(). Here is a proposal that retains compatibility. It isn't very pretty, but we have to work from where we are.
These new options would be needed only if you want to have different treatment for those four things. Otherwise, as now, you either set UCP to get Unicode treatment for all, or don't set UCP, to get ASCII treatment for all, using the new PCRE2_CASELESS_RESTRICT bit if you want that kind of caselessness. Changes would also be needed to the (*xxx) settings that are allowed in patterns. I'm not sure if we would need all the possibilities, though is should be easy to implement. There could be an overall (*ASCII) that sets them all. All this need changes to pcre2_compile(), both interpretive matching functions, and the JIT code, of course. |
I think that even the CASELESS flag might be enough of a niche case, that it could be relegated to an EXTRA option. One useful feature we might copy from Rust's regex is the Something I am curious about, is why were Also, important to notice that this (and my original) design are still not addressing Zoltan's valid concern that the use of these flags is not idempotent and are context dependent. |
(*UTF) and (*UCP) were invented at user request, for situations where the user controls the pattern, but not the code. The original addition was in 2009, but the ChangeLog doesn't record any details. It's possible that it relates to the use of PCRE from non-C languages. I suggested making PCRE2_CASELESS_RESTRICT a "main" option so that a user could just set that as an alternative to PCRE2_CASELESS, without having to mess with the extra options. But I willing to be persuaded. Anybody else have a view on this? |
I have changed my mind. I now agree the all the new options can be "extra"s. I intend to work on this shortly. |
I have just pushed a commit that implements PCRE2_EXTRA_CASELESS_RESTRICT. I also added tracking apparatus for the extra options as for the main options, in order to implement (?r) within a pattern to change this option dynamically. You can also use /caseless_restrict or /r in pcre2test. I have NOT updated the user documentation because I'm planning on doing it all in one go once I have implemented the other EXTRA options mentioned. The code changes in this commit are all in pcre2_compile() so the new option needed no changes to the matching functions or JIT. There are some new tests but I would not be at all surprised to learn that I have overlooked something, |
I have now pushed commits that implement PCRE2_EXTRA_ASCII_BSD, _BSS, _BSW, and _POSIX. The BS stands for "backslash", as used in some other option names, and I chose these names to make it clear exactly what was being forced to ASCII. The implementation is entirely in pcre2_compile.c. There is no documentation yet - that's coming next. |
Implemented already |
Bad news: I have realized that there is a major bug. The \b and \B escapes, which are documented as being dependent on \w, don't work correctly. This is because the change in behaviour of \b is implemented at match time, not compile time. Fixing this is inevitably going to involve changes to pcre2_match(), pcre2_dfa_match() and (sorry Zoltan) the JIT. My current plan is to add to the OP_WORDCHAR opcode a new one called OP_ASCII_WORDCHAR, so that all the distinguishing happens at compile time. (And the same for the negative ones, of course.) |
Oops, I meant OP_WORD_BOUNDARY, of course. |
I have pushed a commit that fixes this bug in the interpreters, but of course JIT is now broken. In the end, I did it the opposite way to what I described above. I invented OP_UCP_WORD_BOUNDARY and OP_NOT_UCP_WORD_BOUNDARY, so the original opcodes, like all the other opcodes for \d etc, are ASCII-only. |
I can fix the jit code later. |
I have fixed the jit code, please check |
Thanks! The JIT code works fine. |
A prototype (but funtionally implementing the basic expression '\d' of the suggested feature), of ways to disable the expansion of \d even when PCRE2_UCP (or its equivalent verb) might be used.
Implemented in a way that minimizes code change so it could be reviewed more easily, and with a bare bones documentation and no tests, which will obviously be corrected for any (non draft) version.