Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework script extension handling #64

Merged
merged 1 commit into from
Dec 29, 2021
Merged

Rework script extension handling #64

merged 1 commit into from
Dec 29, 2021

Conversation

zherczeg
Copy link
Collaborator

This is a patch which reworks script extension handling. First it organizes the scripts in two groups as seen in pcre2_ucp.h. The first group has those scripts, which has characters assigned to other scripts in UnicodeData.txt. The second group, starting with ucp_Unknown has no such characters. Furthermore PT_SCX is replaced to PT_SC in PRIV(utt) for the latter group, and PT_SC is forced even if the user expects script extension. The special handling of scriptx is removed, now it is an uint8_t type which contains in index in the PRIV(ucd_script_sets) bitset. The items in this bitset is shortened since the highest non-zero bit is < (int)ucp_Unknown.

There is still a lot of work ahead, but first I am curious whether it is worth to do this change. @PhilipHazel what do you think?

@zherczeg zherczeg marked this pull request as draft December 28, 2021 06:38
@PhilipHazel
Copy link
Collaborator

I am not sure it is worth making this change. It will save a bit of memory in the UCD table of bitsets, but this is an insignificant amount in the overall UCD tables. It will also free up one byte in the UCD records, which perhaps could be more important, but is not an issue at present. I may have missed something, but I have a problem with the matching code such as this:

        ok = (prop->script == Lpropvalue ||
              MAPBIT(PRIV(ucd_script_sets) + prop->scriptx, Lpropvalue) != 0);

The value of Lpropvalue is a script number, which can be any of the 163 scripts. However, you have only got space in the bitset for the 66 (I think) scripts that appear in other script's extensions. I did not see any change to the MAPBIT macro, so it looks to me as if that call to to MAPBIT could look at invalid data. Should there not also be a check that Lpropvalue is less than ucp_Unknown? Maybe I missed something.... Also, what is the value of scriptx for characters that have no script extensions? Is there a entry full of zeros for that? All in all, my guess is that this would not make much difference to interpreter performance.

However, I am always open to persuasion - but it is a lot of work, as you say. (I don't think you looked at dfa_match(), and there is also the ucptest program, and unfortunately I've been working in the same area to add script abbreviation support, so there is now conflict in the patch.)

@zherczeg
Copy link
Collaborator Author

In pcre2_ucptables.c, all scripts, which has no extended characters are converted to PT_SC, so when PT_SCX is encountered, the index is (should be) always within range (that is < ucp_Unknown).
Your guess was right, the first record of PRIV(ucd_script_sets) is intentionally zero.

The performance gain is twofold:

  • PT_SC is not used for those scripts, which has no extended characters (no time wasted on PC_SCX extra checks)
  • PT_SCX checks are simpler (no negative check)

This should make the life of the jiot compiler easier.

@zherczeg
Copy link
Collaborator Author

Patch is not ready, I have encountered errors, but I hope the code simplification always has runtime benefit.

@PhilipHazel
Copy link
Collaborator

OK, I do see the benefits. Can you easily update the patch to fit with the changes I recently made? I will not make any more changes while you are working on this. I am happy to do the updates to dfa_match and ucptest afterwards - and also the various bits of documentation.

@zherczeg
Copy link
Collaborator Author

Updating was easy, I just merged the generator and regenerated the pcre2_ucptables.c file. The major missing thing is script run support. I think I get the general concept, but the data is different now, so the code should be reworked as well. Could you help me in doing that?

On testoutput5 I also get this:

 /^[\p{Arabic}]/utf
 \= Expect no match
     \x{650}
-No match
+ 0: \x{650}
     \x{651}
-No match
+ 0: \x{651}
     \x{652}

Not sure what is the problem here.

@zherczeg
Copy link
Collaborator Author

I have investigated the /^[\p{Arabic}]/utf failure in testoutput5. It is the first pattern of the More differences from Perl section. It checks the \x{650} - \x{655} character range for no match. However, it seems these characters are part of Arabic script extension:

# Script_Extensions=Arab Syrc

064B..0655    ; Arab Syrc # Mn  [11] ARABIC FATHATAN..ARABIC HAMZA BELOW
0670          ; Arab Syrc # Mn       ARABIC LETTER SUPERSCRIPT ALEF

Is this an expected failure since we support script extensions now?

@zherczeg
Copy link
Collaborator Author

I also have a question about script runs. A comment says: Any string containing fewer than 2 characters is a valid script run., but any character with ucp_Unknown script is rejected later. Is this not true for 1 character long strings?

@PhilipHazel PhilipHazel marked this pull request as ready for review December 29, 2021 09:27
@PhilipHazel
Copy link
Collaborator

I realized last night that script runs would have to be reworked, and I will do that.

@PhilipHazel PhilipHazel merged commit afa4756 into master Dec 29, 2021
@zherczeg
Copy link
Collaborator Author

Thank you. I will do the jit support next. And as you mentioned, the various fails needs to be fixed as well.

@PhilipHazel
Copy link
Collaborator

The Arabic issue looks like a bug in my previous code. I am looking at script runs - strings of only 1 character return TRUE at the start, before the Unknown test, so that test only applies to longer strings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants