Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for more Unicode properties? #39

Closed
mjambon opened this issue Nov 8, 2021 · 5 comments
Closed

Support for more Unicode properties? #39

mjambon opened this issue Nov 8, 2021 · 5 comments
Labels
enhancement New feature or request

Comments

@mjambon
Copy link

mjambon commented Nov 8, 2021

Regarding which Unicode properties are supported, the manual says:

The property names represented by xx above are limited to the Unicode script names, the general category properties, "Any", which matches any character (including newline), and some special PCRE properties (described in the next section). Other Perl properties such as "InMusicalSymbols" are not currently supported by PCRE. Note that \P{Any} does not match any characters, so always causes a match failure.

We have users who want support for the Bidi_Control property (semgrep/semgrep#3974), which is supported by Perl (also, by Go's regexp library). I'm not familiar with any of these implementations and I'm wondering why PCRE doesn't support all Unicode properties. Is it because they were added late and PCRE needs to catch up or for a technical reason?

Note that we're using PCRE from OCaml for which there hasn't been an effort to migrate to pcre2. So if we extend PCRE2 with support for more Unicode properties, we'll be unable to use it from OCaml unless we also port these changes to the old PCRE or we change the OCaml bindings to support the new API. It's really a separate issue but I thought I should mention it.

@zherczeg
Copy link
Collaborator

zherczeg commented Nov 8, 2021

It is mostly a space problem. For example, a node.js without libicu (full feature set, arm binary) 28MByte, with libicu it is 63 MByte, and pcre2 is less than 500K as far as I remember.

Jokes aside, I have played with libicu in the repan project: https://github.com/zherczeg/repan because I wanted to bring the generic unicode property and codepoint name support for any regex engine by rewriting char classes. It turned out that majority of the properties are tied to a set of control types, plus / minus some codepoints / codepoint ranges, and can be effectively compressed. It can be done in engine level as well, but it is not a small amount of work.

@PhilipHazel PhilipHazel added the enhancement New feature or request label Nov 9, 2021
@PhilipHazel
Copy link
Collaborator

I will look at this when I get some time (not for a while), but no promises.

@PhilipHazel
Copy link
Collaborator

I have just pushed a patch that implements the Bidi_Control and Bidi_Class properties in the PCRE2 compiler and interpreters. NOTE: this is not yet available for JIT, because I've only just made it available for Zoltan to work on. Space is indeed the issue with supporting more Unicode properties.

@PhilipHazel
Copy link
Collaborator

JIT support for Bidi_Control and Bidi_Class is now implemented.

@PhilipHazel
Copy link
Collaborator

Recent work in the development code has added support for a number of additional binary (yes/no) properties in addition to Bidi_Control, for example: Changes_When_Casefolded. These are taken from various Unicode files. I'm going to close this issue now, but feel free to re-open if there are other properties that are wanted.

kopecs added a commit to semgrep/semgrep-docs that referenced this issue Apr 3, 2024
The warning we have here is not accurate for PCRE2, since it supports
more Unicode properties than PCRE.

See also <PCRE2Project/pcre2#39> and
<https://www.pcre.org/current/doc/html/pcre2pattern.html>.
mjambon pushed a commit to semgrep/semgrep-docs that referenced this issue Apr 3, 2024
* Update PCRE -> PCRE2

This reflects the migration we made for user-facing regex in semgrep/semgrep#9919

* Update warning to be correct for library update

The warning we have here is not accurate for PCRE2, since it supports
more Unicode properties than PCRE.

See also <PCRE2Project/pcre2#39> and
<https://www.pcre.org/current/doc/html/pcre2pattern.html>.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants