Rework script extension handling #64

zherczeg · 2021-12-28T06:24:40Z

This is a patch which reworks script extension handling. First it organizes the scripts in two groups as seen in pcre2_ucp.h. The first group has those scripts, which has characters assigned to other scripts in UnicodeData.txt. The second group, starting with ucp_Unknown has no such characters. Furthermore PT_SCX is replaced to PT_SC in PRIV(utt) for the latter group, and PT_SC is forced even if the user expects script extension. The special handling of scriptx is removed, now it is an uint8_t type which contains in index in the PRIV(ucd_script_sets) bitset. The items in this bitset is shortened since the highest non-zero bit is < (int)ucp_Unknown.

There is still a lot of work ahead, but first I am curious whether it is worth to do this change. @PhilipHazel what do you think?

PhilipHazel · 2021-12-28T17:20:05Z

I am not sure it is worth making this change. It will save a bit of memory in the UCD table of bitsets, but this is an insignificant amount in the overall UCD tables. It will also free up one byte in the UCD records, which perhaps could be more important, but is not an issue at present. I may have missed something, but I have a problem with the matching code such as this:

        ok = (prop->script == Lpropvalue ||
              MAPBIT(PRIV(ucd_script_sets) + prop->scriptx, Lpropvalue) != 0);

The value of Lpropvalue is a script number, which can be any of the 163 scripts. However, you have only got space in the bitset for the 66 (I think) scripts that appear in other script's extensions. I did not see any change to the MAPBIT macro, so it looks to me as if that call to to MAPBIT could look at invalid data. Should there not also be a check that Lpropvalue is less than ucp_Unknown? Maybe I missed something.... Also, what is the value of scriptx for characters that have no script extensions? Is there a entry full of zeros for that? All in all, my guess is that this would not make much difference to interpreter performance.

However, I am always open to persuasion - but it is a lot of work, as you say. (I don't think you looked at dfa_match(), and there is also the ucptest program, and unfortunately I've been working in the same area to add script abbreviation support, so there is now conflict in the patch.)

zherczeg · 2021-12-28T17:36:43Z

In pcre2_ucptables.c, all scripts, which has no extended characters are converted to PT_SC, so when PT_SCX is encountered, the index is (should be) always within range (that is < ucp_Unknown).
Your guess was right, the first record of PRIV(ucd_script_sets) is intentionally zero.

The performance gain is twofold:

PT_SC is not used for those scripts, which has no extended characters (no time wasted on PC_SCX extra checks)
PT_SCX checks are simpler (no negative check)

This should make the life of the jiot compiler easier.

zherczeg · 2021-12-28T17:37:44Z

Patch is not ready, I have encountered errors, but I hope the code simplification always has runtime benefit.

PhilipHazel · 2021-12-28T18:03:47Z

OK, I do see the benefits. Can you easily update the patch to fit with the changes I recently made? I will not make any more changes while you are working on this. I am happy to do the updates to dfa_match and ucptest afterwards - and also the various bits of documentation.

zherczeg · 2021-12-28T18:51:49Z

Updating was easy, I just merged the generator and regenerated the pcre2_ucptables.c file. The major missing thing is script run support. I think I get the general concept, but the data is different now, so the code should be reworked as well. Could you help me in doing that?

On testoutput5 I also get this:

 /^[\p{Arabic}]/utf
 \= Expect no match
     \x{650}
-No match
+ 0: \x{650}
     \x{651}
-No match
+ 0: \x{651}
     \x{652}

Not sure what is the problem here.

zherczeg · 2021-12-29T06:25:27Z

I have investigated the /^[\p{Arabic}]/utf failure in testoutput5. It is the first pattern of the More differences from Perl section. It checks the \x{650} - \x{655} character range for no match. However, it seems these characters are part of Arabic script extension:

# Script_Extensions=Arab Syrc

064B..0655    ; Arab Syrc # Mn  [11] ARABIC FATHATAN..ARABIC HAMZA BELOW
0670          ; Arab Syrc # Mn       ARABIC LETTER SUPERSCRIPT ALEF

Is this an expected failure since we support script extensions now?

zherczeg · 2021-12-29T06:30:32Z

I also have a question about script runs. A comment says: Any string containing fewer than 2 characters is a valid script run., but any character with ucp_Unknown script is rejected later. Is this not true for 1 character long strings?

PhilipHazel · 2021-12-29T09:34:59Z

I realized last night that script runs would have to be reworked, and I will do that.

zherczeg · 2021-12-29T09:44:13Z

Thank you. I will do the jit support next. And as you mentioned, the various fails needs to be fixed as well.

PhilipHazel · 2021-12-29T09:47:11Z

The Arabic issue looks like a bug in my previous code. I am looking at script runs - strings of only 1 character return TRUE at the start, before the Unknown test, so that test only applies to longer strings.

zherczeg marked this pull request as draft December 28, 2021 06:38

zherczeg force-pushed the script_improve branch from 12fd2c2 to 19e8e52 Compare December 28, 2021 11:00

Rework script extension handling

c27854a

zherczeg force-pushed the script_improve branch from 19e8e52 to c27854a Compare December 28, 2021 18:47

PhilipHazel marked this pull request as ready for review December 29, 2021 09:27

PhilipHazel merged commit afa4756 into master Dec 29, 2021

zherczeg deleted the script_improve branch December 29, 2021 12:59

SolitaryGrass mentioned this pull request May 31, 2023

internal_dfa_match, a stack overflow occurred due to recursive calls. #258

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework script extension handling #64

Rework script extension handling #64

zherczeg commented Dec 28, 2021

PhilipHazel commented Dec 28, 2021

zherczeg commented Dec 28, 2021

zherczeg commented Dec 28, 2021

PhilipHazel commented Dec 28, 2021

zherczeg commented Dec 28, 2021

zherczeg commented Dec 29, 2021

zherczeg commented Dec 29, 2021

PhilipHazel commented Dec 29, 2021

zherczeg commented Dec 29, 2021

PhilipHazel commented Dec 29, 2021

Rework script extension handling #64

Rework script extension handling #64

Conversation

zherczeg commented Dec 28, 2021

PhilipHazel commented Dec 28, 2021

zherczeg commented Dec 28, 2021

zherczeg commented Dec 28, 2021

PhilipHazel commented Dec 28, 2021

zherczeg commented Dec 28, 2021

zherczeg commented Dec 29, 2021

zherczeg commented Dec 29, 2021

PhilipHazel commented Dec 29, 2021

zherczeg commented Dec 29, 2021

PhilipHazel commented Dec 29, 2021