add PCRE2_ASCII (RFC) #186

carenas · 2023-01-08T23:57:08Z

A prototype (but funtionally implementing the basic expression '\d' of the suggested feature), of ways to disable the expansion of \d even when PCRE2_UCP (or its equivalent verb) might be used.

Implemented in a way that minimizes code change so it could be reviewed more easily, and with a bare bones documentation and no tests, which will obviously be corrected for any (non draft) version.

As suggested in PCRE2Project#185 and as done with Perl with the '/aa' modifier it is preferably for performance/security[1] reasons to avoid including in \d characters that are outside the commonly expected digits. Add that functionality with the foundations of what was suggested in PCRE2Project#11 [1] https://perldoc.perl.org/perlre#/a-(and-/aa)

PhilipHazel · 2023-01-09T09:26:10Z

I am working on something unrelated at the moment, but I hope to be able to look at this later in the week. As an aside, I'm pleased that another person is becoming familiar with the code, because I am now an old man and won't be here forever.

I now think it's unfortunate that PCRE2_UTF changes the way case-independent matching works, because it mixes up two different things. Ideally PCRE2_UTF should just control the way a stream of code units is turned into a stream of characters, without affecting how those characters are processed. Other options (UCP, ASCII, whatever) should be used to control the processing, independently of PCRE2_UTF. But it is perhaps too late to think of changing things that much.

zherczeg · 2023-01-10T06:01:11Z

I don't really like the "add a feature and then add another feature for partially disable that feature" design. Instead I would add a full set of flags (can be set by a separate api call, 0 by default) which allows fine control over these features. E.g. enable_unicode_d enables the operation even if ucp is not set, disable_unicode_d disables it even if ucp is set. Disable could be the strongest. I know this is a bigger scope of work, but needed in the long run. The actual set of features could be computed with binary logical operations easily.

PhilipHazel · 2023-01-10T08:34:06Z

I'm not sure we need quite that much fine control, but I understand your concern. I have been thinking about this and will post some ideas soon. The problem with using an api call to set these flags is that it can't also be set in the pattern in the way (*UTF) or (*UCP) are now used.

PhilipHazel · 2023-01-10T15:14:06Z

It seems to me that there are three independent behaviours that need to be controllable:

(A) How code units are interpreted as characters:
(A1) One code unit equals one character.
(A2) Code units are interpreted as UTF sequences.

(B) How \d, \s, etc. are processed:
(B1) Only ASCII characters can match.
(B2) Unicode properties are used for matching.

(C) How case-independent matching works:
(C1) Only ASCII characters are considered.
(C2) ASCII characters can have ASCII other cases, and non-ASCII characters can have non-ASCII other cases (possibly more than one), but they don't ever mix.
(C3) Full Unicode case-independence, including the cases of K and S that have both ASCII and non-ASCII other cases.

Unfortunately, for historical reasons (different features were added at different times), PCRE2 mixes these behaviours. The default is A1, B1, C1. The PCRE2_UTF option sets A2 and C3. The PCRE2_UCP option sets B2 and C3 and is independent of PCRE2_UTF - setting PCRE2_UCP without PCRE2_UTF can be used (for example) with 16-bit non-UTF files.

Although I don't rule out incompatible changes, I would very much like to avoid making any, because they ALWAYS catch somebody out. I have not yet looked at Carlo's code, but I'm guessing that his proposed PCRE2_ASCII option changes to C2 behaviour (which is what one of the Perl options does). However, to allow for all cases, we really need to be able to select any B and C option independently - you might want either C2 or C3 with 16-bit non-UTF data, for example.

There are only 5 bits left in the PCRE2 options word. A compatible scheme that uses only two of them might be as follows:

PCRE2_CASELESS_ASCII sets 01 (of the two bits)
PCRE2_CASELESS_NOMIX sets 10 (meaning "don't mix ASCII and Unicode casing")
PCRE2_CASELESS_UNICODE sets 11

If the two bits are both zero, then, if either UTF or UCP is set, force them to 11. This preserves compatibility with the current behaviour. It is not what I would do if I were starting again from scratch and it has the danger that setting both ASCII and NOMIX is a gotcha. Hmm....

carenas · 2023-01-11T07:14:30Z

While the new flag was meant to be used to move from C3 to C2, that part wasn't implemented yet.

What was implemented though (at least partially), was something missing from the matrix above in the "B" dimension and that matches B2 in the modified list:

B1: Only ASCII characters match (the default)
B2: Only relevant "locale" UTF characters are added
B3: All UTF characters are added (what PCRE2_UCP selects)

The result was that \d would only match [0-9] (and maybe the fullwidth digits[1]), instead of the other ~650 numerals that have \p{Nd}

The use case I had on mind was git, where assuming the content is unicode programming languages, you can expect to have identifiers with utf-8 characters and therefore need those characters for a correct identification with \w and \b, but wouldn't need the same for \d

Outside of that use case, I can also think that limiting \w to only expand characters of the script that is identified by the locale might be useful.

Of course overloading all that meaning in a single flag is difficult, and the fact that it interacts with the other two makes it hard to use and I can see why Zoltan suggested a different design.

[1] https://en.wikipedia.org/wiki/Halfwidth_and_fullwidth_forms

PhilipHazel · 2023-01-11T09:47:59Z

I have continued to think ... luckily there is no rush on this ... I am beginning to think Zoltan's idea might be useful. There are plenty of bits available in the "extra" options and there is an existing api - pcre2_set_compile_extra_options(). Having independent control over \d, \w, and \s would be possible. However, there is also the question of the "in pattern" settings such as (*UTF) that should, if possible, mirror the external settings - but there are also external lockouts such as PCRE2_NEVER_UTF. Will we need more of those? More thought needed.....

PhilipHazel · 2023-01-23T16:30:28Z

I lied above. There are only 2 bits left in the options argument to pcre2_compile(). Here is a proposal that retains compatibility. It isn't very pretty, but we have to work from where we are.

Use one of the two available bits for a new option PCRE2_CASELESS_RESTRICT. This behaves like PCRE2_CASELESS, except that it restricts ASCII characters to have only ASCII other cases, and non-ASCII characters to have non-ASCII other cases.
Add a number of new PCRE2_EXTRA_xxx options to control the behaviour of \d etc independently:
PCRE2_EXTRA_ASCII_D \d matches ASCII only
PCRE2_EXTRA_ASCII_S \s matches ASCII only
PCRE2_EXTRA_ASCII_W \w matches ASCII only - and this also affects \b which is defined in terms of \w
PCRE2_EXTRA_ASCII_POSIX the POSIX [,,,] classes match ASCII only

These new options would be needed only if you want to have different treatment for those four things. Otherwise, as now, you either set UCP to get Unicode treatment for all, or don't set UCP, to get ASCII treatment for all, using the new PCRE2_CASELESS_RESTRICT bit if you want that kind of caselessness.

Changes would also be needed to the (*xxx) settings that are allowed in patterns. I'm not sure if we would need all the possibilities, though is should be easy to implement. There could be an overall (*ASCII) that sets them all.

All this need changes to pcre2_compile(), both interpretive matching functions, and the JIT code, of course.

carenas · 2023-01-24T23:57:26Z

I think that even the CASELESS flag might be enough of a niche case, that it could be relegated to an EXTRA option.

One useful feature we might copy from Rust's regex is the (?u) option to disable/enable UCP in parts of the expression.

Something I am curious about, is why were (*UTF) and (*UCP) invented, specially since if we are to invent (*ASCII) as well, we might end up burning that bit in the main option anyway.

Also, important to notice that this (and my original) design are still not addressing Zoltan's valid concern that the use of these flags is not idempotent and are context dependent.

PhilipHazel · 2023-01-25T10:01:11Z

(*UTF) and (*UCP) were invented at user request, for situations where the user controls the pattern, but not the code. The original addition was in 2009, but the ChangeLog doesn't record any details. It's possible that it relates to the use of PCRE from non-C languages.

I suggested making PCRE2_CASELESS_RESTRICT a "main" option so that a user could just set that as an alternative to PCRE2_CASELESS, without having to mess with the extra options. But I willing to be persuaded. Anybody else have a view on this?

PhilipHazel · 2023-01-27T08:57:45Z

I have changed my mind. I now agree the all the new options can be "extra"s. I intend to work on this shortly.

PhilipHazel · 2023-01-29T16:54:37Z

I have just pushed a commit that implements PCRE2_EXTRA_CASELESS_RESTRICT. I also added tracking apparatus for the extra options as for the main options, in order to implement (?r) within a pattern to change this option dynamically. You can also use /caseless_restrict or /r in pcre2test. I have NOT updated the user documentation because I'm planning on doing it all in one go once I have implemented the other EXTRA options mentioned. The code changes in this commit are all in pcre2_compile() so the new option needed no changes to the matching functions or JIT. There are some new tests but I would not be at all surprised to learn that I have overlooked something,

PhilipHazel · 2023-02-02T17:25:26Z

I have now pushed commits that implement PCRE2_EXTRA_ASCII_BSD, _BSS, _BSW, and _POSIX. The BS stands for "backslash", as used in some other option names, and I chose these names to make it clear exactly what was being forced to ASCII. The implementation is entirely in pcre2_compile.c. There is no documentation yet - that's coming next.

carenas · 2023-02-04T17:52:28Z

Implemented already

PhilipHazel · 2023-02-05T15:56:40Z

Bad news: I have realized that there is a major bug. The \b and \B escapes, which are documented as being dependent on \w, don't work correctly. This is because the change in behaviour of \b is implemented at match time, not compile time. Fixing this is inevitably going to involve changes to pcre2_match(), pcre2_dfa_match() and (sorry Zoltan) the JIT. My current plan is to add to the OP_WORDCHAR opcode a new one called OP_ASCII_WORDCHAR, so that all the distinguishing happens at compile time. (And the same for the negative ones, of course.)

PhilipHazel · 2023-02-05T16:27:07Z

Oops, I meant OP_WORD_BOUNDARY, of course.

PhilipHazel · 2023-02-06T17:11:58Z

I have pushed a commit that fixes this bug in the interpreters, but of course JIT is now broken. In the end, I did it the opposite way to what I described above. I invented OP_UCP_WORD_BOUNDARY and OP_NOT_UCP_WORD_BOUNDARY, so the original opcodes, like all the other opcodes for \d etc, are ASCII-only.

zherczeg · 2023-02-07T05:31:20Z

I can fix the jit code later.

zherczeg · 2023-02-07T07:06:53Z

I have fixed the jit code, please check

PhilipHazel · 2023-02-07T09:10:39Z

Thanks! The JIT code works fine.

carenas force-pushed the ascii branch from 2f43049 to 94adf14 Compare January 9, 2023 00:29

PhilipHazel mentioned this pull request Jan 27, 2023

add a C23 inspired checked integer multiplication helper #198

Merged

carenas closed this Feb 4, 2023

carenas deleted the ascii branch February 5, 2023 20:24

SolitaryGrass mentioned this pull request May 31, 2023

internal_dfa_match, a stack overflow occurred due to recursive calls. #258

Closed

carenas mentioned this pull request Sep 26, 2024

Simplify range data construction. #496

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add PCRE2_ASCII (RFC) #186

add PCRE2_ASCII (RFC) #186

carenas commented Jan 8, 2023 •

edited

Loading

PhilipHazel commented Jan 9, 2023

zherczeg commented Jan 10, 2023

PhilipHazel commented Jan 10, 2023

PhilipHazel commented Jan 10, 2023

carenas commented Jan 11, 2023

PhilipHazel commented Jan 11, 2023

PhilipHazel commented Jan 23, 2023

carenas commented Jan 24, 2023 •

edited

Loading

PhilipHazel commented Jan 25, 2023

PhilipHazel commented Jan 27, 2023

PhilipHazel commented Jan 29, 2023

PhilipHazel commented Feb 2, 2023

carenas commented Feb 4, 2023

PhilipHazel commented Feb 5, 2023

PhilipHazel commented Feb 5, 2023

PhilipHazel commented Feb 6, 2023

zherczeg commented Feb 7, 2023

zherczeg commented Feb 7, 2023

PhilipHazel commented Feb 7, 2023

add PCRE2_ASCII (RFC) #186

add PCRE2_ASCII (RFC) #186

Conversation

carenas commented Jan 8, 2023 • edited Loading

PhilipHazel commented Jan 9, 2023

zherczeg commented Jan 10, 2023

PhilipHazel commented Jan 10, 2023

PhilipHazel commented Jan 10, 2023

carenas commented Jan 11, 2023

PhilipHazel commented Jan 11, 2023

PhilipHazel commented Jan 23, 2023

carenas commented Jan 24, 2023 • edited Loading

PhilipHazel commented Jan 25, 2023

PhilipHazel commented Jan 27, 2023

PhilipHazel commented Jan 29, 2023

PhilipHazel commented Feb 2, 2023

carenas commented Feb 4, 2023

PhilipHazel commented Feb 5, 2023

PhilipHazel commented Feb 5, 2023

PhilipHazel commented Feb 6, 2023

zherczeg commented Feb 7, 2023

zherczeg commented Feb 7, 2023

PhilipHazel commented Feb 7, 2023

carenas commented Jan 8, 2023 •

edited

Loading

carenas commented Jan 24, 2023 •

edited

Loading