-
Notifications
You must be signed in to change notification settings - Fork 199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for more Unicode properties? #39
Comments
It is mostly a space problem. For example, a node.js without libicu (full feature set, arm binary) 28MByte, with libicu it is 63 MByte, and pcre2 is less than 500K as far as I remember. Jokes aside, I have played with libicu in the repan project: https://github.com/zherczeg/repan because I wanted to bring the generic unicode property and codepoint name support for any regex engine by rewriting char classes. It turned out that majority of the properties are tied to a set of control types, plus / minus some codepoints / codepoint ranges, and can be effectively compressed. It can be done in engine level as well, but it is not a small amount of work. |
I will look at this when I get some time (not for a while), but no promises. |
I have just pushed a patch that implements the Bidi_Control and Bidi_Class properties in the PCRE2 compiler and interpreters. NOTE: this is not yet available for JIT, because I've only just made it available for Zoltan to work on. Space is indeed the issue with supporting more Unicode properties. |
JIT support for Bidi_Control and Bidi_Class is now implemented. |
Recent work in the development code has added support for a number of additional binary (yes/no) properties in addition to Bidi_Control, for example: Changes_When_Casefolded. These are taken from various Unicode files. I'm going to close this issue now, but feel free to re-open if there are other properties that are wanted. |
The warning we have here is not accurate for PCRE2, since it supports more Unicode properties than PCRE. See also <PCRE2Project/pcre2#39> and <https://www.pcre.org/current/doc/html/pcre2pattern.html>.
* Update PCRE -> PCRE2 This reflects the migration we made for user-facing regex in semgrep/semgrep#9919 * Update warning to be correct for library update The warning we have here is not accurate for PCRE2, since it supports more Unicode properties than PCRE. See also <PCRE2Project/pcre2#39> and <https://www.pcre.org/current/doc/html/pcre2pattern.html>.
Regarding which Unicode properties are supported, the manual says:
We have users who want support for the
Bidi_Control
property (semgrep/semgrep#3974), which is supported by Perl (also, by Go's regexp library). I'm not familiar with any of these implementations and I'm wondering why PCRE doesn't support all Unicode properties. Is it because they were added late and PCRE needs to catch up or for a technical reason?Note that we're using PCRE from OCaml for which there hasn't been an effort to migrate to pcre2. So if we extend PCRE2 with support for more Unicode properties, we'll be unable to use it from OCaml unless we also port these changes to the old PCRE or we change the OCaml bindings to support the new API. It's really a separate issue but I thought I should mention it.
The text was updated successfully, but these errors were encountered: