Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add PCRE2_ASCII (RFC) #186

Closed
wants to merge 1 commit into from
Closed

add PCRE2_ASCII (RFC) #186

wants to merge 1 commit into from

Conversation

carenas
Copy link
Contributor

@carenas carenas commented Jan 8, 2023

A prototype (but funtionally implementing the basic expression '\d' of the suggested feature), of ways to disable the expansion of \d even when PCRE2_UCP (or its equivalent verb) might be used.

Implemented in a way that minimizes code change so it could be reviewed more easily, and with a bare bones documentation and no tests, which will obviously be corrected for any (non draft) version.

As suggested in PCRE2Project#185 and as done with Perl with the '/aa' modifier
it is preferably for performance/security[1] reasons to avoid
including in \d characters that are outside the commonly expected
digits.

Add that functionality with the foundations of what was suggested
in PCRE2Project#11

[1] https://perldoc.perl.org/perlre#/a-(and-/aa)
@PhilipHazel
Copy link
Collaborator

I am working on something unrelated at the moment, but I hope to be able to look at this later in the week. As an aside, I'm pleased that another person is becoming familiar with the code, because I am now an old man and won't be here forever.

I now think it's unfortunate that PCRE2_UTF changes the way case-independent matching works, because it mixes up two different things. Ideally PCRE2_UTF should just control the way a stream of code units is turned into a stream of characters, without affecting how those characters are processed. Other options (UCP, ASCII, whatever) should be used to control the processing, independently of PCRE2_UTF. But it is perhaps too late to think of changing things that much.

@zherczeg
Copy link
Collaborator

I don't really like the "add a feature and then add another feature for partially disable that feature" design. Instead I would add a full set of flags (can be set by a separate api call, 0 by default) which allows fine control over these features. E.g. enable_unicode_d enables the operation even if ucp is not set, disable_unicode_d disables it even if ucp is set. Disable could be the strongest. I know this is a bigger scope of work, but needed in the long run. The actual set of features could be computed with binary logical operations easily.

@PhilipHazel
Copy link
Collaborator

I'm not sure we need quite that much fine control, but I understand your concern. I have been thinking about this and will post some ideas soon. The problem with using an api call to set these flags is that it can't also be set in the pattern in the way (*UTF) or (*UCP) are now used.

@PhilipHazel
Copy link
Collaborator

It seems to me that there are three independent behaviours that need to be controllable:

(A) How code units are interpreted as characters:
(A1) One code unit equals one character.
(A2) Code units are interpreted as UTF sequences.

(B) How \d, \s, etc. are processed:
(B1) Only ASCII characters can match.
(B2) Unicode properties are used for matching.

(C) How case-independent matching works:
(C1) Only ASCII characters are considered.
(C2) ASCII characters can have ASCII other cases, and non-ASCII characters can have non-ASCII other cases (possibly more than one), but they don't ever mix.
(C3) Full Unicode case-independence, including the cases of K and S that have both ASCII and non-ASCII other cases.

Unfortunately, for historical reasons (different features were added at different times), PCRE2 mixes these behaviours. The default is A1, B1, C1. The PCRE2_UTF option sets A2 and C3. The PCRE2_UCP option sets B2 and C3 and is independent of PCRE2_UTF - setting PCRE2_UCP without PCRE2_UTF can be used (for example) with 16-bit non-UTF files.

Although I don't rule out incompatible changes, I would very much like to avoid making any, because they ALWAYS catch somebody out. I have not yet looked at Carlo's code, but I'm guessing that his proposed PCRE2_ASCII option changes to C2 behaviour (which is what one of the Perl options does). However, to allow for all cases, we really need to be able to select any B and C option independently - you might want either C2 or C3 with 16-bit non-UTF data, for example.

There are only 5 bits left in the PCRE2 options word. A compatible scheme that uses only two of them might be as follows:

PCRE2_CASELESS_ASCII sets 01 (of the two bits)
PCRE2_CASELESS_NOMIX sets 10 (meaning "don't mix ASCII and Unicode casing")
PCRE2_CASELESS_UNICODE sets 11

If the two bits are both zero, then, if either UTF or UCP is set, force them to 11. This preserves compatibility with the current behaviour. It is not what I would do if I were starting again from scratch and it has the danger that setting both ASCII and NOMIX is a gotcha. Hmm....

@carenas
Copy link
Contributor Author

carenas commented Jan 11, 2023

While the new flag was meant to be used to move from C3 to C2, that part wasn't implemented yet.

What was implemented though (at least partially), was something missing from the matrix above in the "B" dimension and that matches B2 in the modified list:

B1: Only ASCII characters match (the default)
B2: Only relevant "locale" UTF characters are added
B3: All UTF characters are added (what PCRE2_UCP selects)

The result was that \d would only match [0-9] (and maybe the fullwidth digits[1]), instead of the other ~650 numerals that have \p{Nd}

The use case I had on mind was git, where assuming the content is unicode programming languages, you can expect to have identifiers with utf-8 characters and therefore need those characters for a correct identification with \w and \b, but wouldn't need the same for \d

Outside of that use case, I can also think that limiting \w to only expand characters of the script that is identified by the locale might be useful.

Of course overloading all that meaning in a single flag is difficult, and the fact that it interacts with the other two makes it hard to use and I can see why Zoltan suggested a different design.

[1] https://en.wikipedia.org/wiki/Halfwidth_and_fullwidth_forms

@PhilipHazel
Copy link
Collaborator

I have continued to think ... luckily there is no rush on this ... I am beginning to think Zoltan's idea might be useful. There are plenty of bits available in the "extra" options and there is an existing api - pcre2_set_compile_extra_options(). Having independent control over \d, \w, and \s would be possible. However, there is also the question of the "in pattern" settings such as (*UTF) that should, if possible, mirror the external settings - but there are also external lockouts such as PCRE2_NEVER_UTF. Will we need more of those? More thought needed.....

@PhilipHazel
Copy link
Collaborator

I lied above. There are only 2 bits left in the options argument to pcre2_compile(). Here is a proposal that retains compatibility. It isn't very pretty, but we have to work from where we are.

  1. Use one of the two available bits for a new option PCRE2_CASELESS_RESTRICT. This behaves like PCRE2_CASELESS, except that it restricts ASCII characters to have only ASCII other cases, and non-ASCII characters to have non-ASCII other cases.

  2. Add a number of new PCRE2_EXTRA_xxx options to control the behaviour of \d etc independently:
    PCRE2_EXTRA_ASCII_D \d matches ASCII only
    PCRE2_EXTRA_ASCII_S \s matches ASCII only
    PCRE2_EXTRA_ASCII_W \w matches ASCII only - and this also affects \b which is defined in terms of \w
    PCRE2_EXTRA_ASCII_POSIX the POSIX [,,,] classes match ASCII only

These new options would be needed only if you want to have different treatment for those four things. Otherwise, as now, you either set UCP to get Unicode treatment for all, or don't set UCP, to get ASCII treatment for all, using the new PCRE2_CASELESS_RESTRICT bit if you want that kind of caselessness.

Changes would also be needed to the (*xxx) settings that are allowed in patterns. I'm not sure if we would need all the possibilities, though is should be easy to implement. There could be an overall (*ASCII) that sets them all.

All this need changes to pcre2_compile(), both interpretive matching functions, and the JIT code, of course.

@carenas
Copy link
Contributor Author

carenas commented Jan 24, 2023

I think that even the CASELESS flag might be enough of a niche case, that it could be relegated to an EXTRA option.

One useful feature we might copy from Rust's regex is the (?u) option to disable/enable UCP in parts of the expression.

Something I am curious about, is why were (*UTF) and (*UCP) invented, specially since if we are to invent (*ASCII) as well, we might end up burning that bit in the main option anyway.

Also, important to notice that this (and my original) design are still not addressing Zoltan's valid concern that the use of these flags is not idempotent and are context dependent.

@PhilipHazel
Copy link
Collaborator

(*UTF) and (*UCP) were invented at user request, for situations where the user controls the pattern, but not the code. The original addition was in 2009, but the ChangeLog doesn't record any details. It's possible that it relates to the use of PCRE from non-C languages.

I suggested making PCRE2_CASELESS_RESTRICT a "main" option so that a user could just set that as an alternative to PCRE2_CASELESS, without having to mess with the extra options. But I willing to be persuaded. Anybody else have a view on this?

@PhilipHazel
Copy link
Collaborator

I have changed my mind. I now agree the all the new options can be "extra"s. I intend to work on this shortly.

@PhilipHazel
Copy link
Collaborator

I have just pushed a commit that implements PCRE2_EXTRA_CASELESS_RESTRICT. I also added tracking apparatus for the extra options as for the main options, in order to implement (?r) within a pattern to change this option dynamically. You can also use /caseless_restrict or /r in pcre2test. I have NOT updated the user documentation because I'm planning on doing it all in one go once I have implemented the other EXTRA options mentioned. The code changes in this commit are all in pcre2_compile() so the new option needed no changes to the matching functions or JIT. There are some new tests but I would not be at all surprised to learn that I have overlooked something,

@PhilipHazel
Copy link
Collaborator

I have now pushed commits that implement PCRE2_EXTRA_ASCII_BSD, _BSS, _BSW, and _POSIX. The BS stands for "backslash", as used in some other option names, and I chose these names to make it clear exactly what was being forced to ASCII. The implementation is entirely in pcre2_compile.c. There is no documentation yet - that's coming next.

@carenas
Copy link
Contributor Author

carenas commented Feb 4, 2023

Implemented already

@carenas carenas closed this Feb 4, 2023
@PhilipHazel
Copy link
Collaborator

Bad news: I have realized that there is a major bug. The \b and \B escapes, which are documented as being dependent on \w, don't work correctly. This is because the change in behaviour of \b is implemented at match time, not compile time. Fixing this is inevitably going to involve changes to pcre2_match(), pcre2_dfa_match() and (sorry Zoltan) the JIT. My current plan is to add to the OP_WORDCHAR opcode a new one called OP_ASCII_WORDCHAR, so that all the distinguishing happens at compile time. (And the same for the negative ones, of course.)

@PhilipHazel
Copy link
Collaborator

Oops, I meant OP_WORD_BOUNDARY, of course.

@carenas carenas deleted the ascii branch February 5, 2023 20:24
@PhilipHazel
Copy link
Collaborator

I have pushed a commit that fixes this bug in the interpreters, but of course JIT is now broken. In the end, I did it the opposite way to what I described above. I invented OP_UCP_WORD_BOUNDARY and OP_NOT_UCP_WORD_BOUNDARY, so the original opcodes, like all the other opcodes for \d etc, are ASCII-only.

@zherczeg
Copy link
Collaborator

zherczeg commented Feb 7, 2023

I can fix the jit code later.

@zherczeg
Copy link
Collaborator

zherczeg commented Feb 7, 2023

I have fixed the jit code, please check

@PhilipHazel
Copy link
Collaborator

Thanks! The JIT code works fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants