-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework script extension handling #64
Conversation
12fd2c2
to
19e8e52
Compare
I am not sure it is worth making this change. It will save a bit of memory in the UCD table of bitsets, but this is an insignificant amount in the overall UCD tables. It will also free up one byte in the UCD records, which perhaps could be more important, but is not an issue at present. I may have missed something, but I have a problem with the matching code such as this:
The value of Lpropvalue is a script number, which can be any of the 163 scripts. However, you have only got space in the bitset for the 66 (I think) scripts that appear in other script's extensions. I did not see any change to the MAPBIT macro, so it looks to me as if that call to to MAPBIT could look at invalid data. Should there not also be a check that Lpropvalue is less than ucp_Unknown? Maybe I missed something.... Also, what is the value of scriptx for characters that have no script extensions? Is there a entry full of zeros for that? All in all, my guess is that this would not make much difference to interpreter performance. However, I am always open to persuasion - but it is a lot of work, as you say. (I don't think you looked at dfa_match(), and there is also the ucptest program, and unfortunately I've been working in the same area to add script abbreviation support, so there is now conflict in the patch.) |
In The performance gain is twofold:
This should make the life of the jiot compiler easier. |
Patch is not ready, I have encountered errors, but I hope the code simplification always has runtime benefit. |
OK, I do see the benefits. Can you easily update the patch to fit with the changes I recently made? I will not make any more changes while you are working on this. I am happy to do the updates to dfa_match and ucptest afterwards - and also the various bits of documentation. |
19e8e52
to
c27854a
Compare
Updating was easy, I just merged the generator and regenerated the On
Not sure what is the problem here. |
I have investigated the
Is this an expected failure since we support script extensions now? |
I also have a question about script runs. A comment says: |
I realized last night that script runs would have to be reworked, and I will do that. |
Thank you. I will do the jit support next. And as you mentioned, the various fails needs to be fixed as well. |
The Arabic issue looks like a bug in my previous code. I am looking at script runs - strings of only 1 character return TRUE at the start, before the Unknown test, so that test only applies to longer strings. |
This is a patch which reworks script extension handling. First it organizes the scripts in two groups as seen in
pcre2_ucp.h
. The first group has those scripts, which has characters assigned to other scripts inUnicodeData.txt
. The second group, starting withucp_Unknown
has no such characters. Furthermore PT_SCX is replaced to PT_SC inPRIV(utt)
for the latter group, and PT_SC is forced even if the user expects script extension. The special handling of scriptx is removed, now it is anuint8_t
type which contains in index in thePRIV(ucd_script_sets)
bitset. The items in this bitset is shortened since the highest non-zero bit is <(int)ucp_Unknown
.There is still a lot of work ahead, but first I am curious whether it is worth to do this change. @PhilipHazel what do you think?