-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Even in 8-bit mode, perform range computation for char classes if UCP flag is set #527
base: master
Are you sure you want to change the base?
Conversation
@zherczeg, I've been going through your char class code and have figured out how some parts of it work, though there is still a lot which I don't understand well. If I understood your comment on #526 well, you suggested that this is what is needed to fix the unexpected behavior change from #474 which I discovered. Did I understand you correctly? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I worry that this is not enough. The caseless 's' and 'k' (and turkish 'i' soon) might force an xclass, we need to exclude them in some way.
A check around here could help probably, to terminate processing ranges >256 in 8 bit mode.
https://github.com/PCRE2Project/pcre2/blob/master/src/pcre2_compile.c#L6383
Btw quite good patch for first attempt. |
Without |
In 8 bit mode, they cannot, but in 16 bit mode, they can. For me 8 bit + ucp is the same as 16 bit + ucp, except characters > 255 are simply not present, as characters > 65535 are simply not present for 16 bit mode. It looks like perl has a different approach for this. |
The main problem I have with our approach is that we are just grossly misinterpreting characters with UCP and not UTF. It is just not the same character. UCP without UTF makes sense in the 16/32bit libraries were it could represent UCS, but in the 8 bit library we are reading 1 byte and assigning it properties that belong to a different character, just because the ord was the same. |
Carlo, if UCP mode should be 'banned' for the 8-bit library because it doesn't make sense (which I tend to agree with), I suggest a separate issue could be opened to discuss that issue. I don't know the policies followed by PCRE2, but for some other OSS projects which I have worked on before, this would require a deprecation first. If UCP was banned for the 8-bit library (i.e. using the PCRE2_UCP compile flag or (*UCP) would cause compilation to fail with an error), then all code to support that use case could be removed. In the meantime, as much as it doesn't make sense, I think that UCP+8bit should be supported, because there is nothing in the documentation saying that it is illegal. The obvious thing for UCP+8bit to do is to treat each byte as a Unicode code point from U+0000 up to U+00FF. There are a lot of things I don't know well here, so I may be completely off base. |
As far as I know this is not happening. We use the pre-compiled tables for many things, and use ucp for other things. The pre-compiled tables depend on locals, and they can be very different from unicode codepoints. We might need to clean up these things at some point. |
Yes, and I think that was behind the thinking when PCRE2_UCP was allowed in 8-bit mode. Note this change from 10.35:
There must have been a reason for this but I can't remember it. It might have related to 16/32-bit. I think this is a relatively minor issue, because there are not likely to be many (any?) 8-bit use cases where PCRE2_UCP is set without PCRE2_UCP. So we shouldn't spend a lot of time on it. Simplest not to introduce any incompatibilities. |
I agree. Fortunately the new code is perfectly capable of generating the full bitmask for any properties, and handling anything caseless ranges. We just make to ensure that XCLASS is never generated in this case. https://github.com/PCRE2Project/pcre2/blob/master/src/pcre2_compile_class.c#L531 The 8 bit case simply does not worth any optimizations though. The extra checks should be guarded to 8 bit as well. |
That seems reasonable to me, and it's what I've been assuming. Note that this aligns well with Latin-1, which is a popular 8-bit encoding that was grandfathered into Unicode to fill the codepoints U+0080 to U+00FF. There is quite a lot of text out there where using 8-bit+UCP for Latin-1 interpretation would be accurate. |
… flag is set When testing another patch, I discovered that PCRE2Project#474 caused a small change in the behavior of character classes when caseless mode and UCP were enabled. Thank you to Zoltan Herczeg for suggesting a fix. Closes PCRE2ProjectGH-526.
@zherczeg, I've tried to apply your advice, please have a look and tell me if this looks right or not. |
@@ -6479,6 +6485,9 @@ for (;; pptr++) | |||
#ifdef SUPPORT_WIDE_CHARS /* Defined for 16/32 bits, or 8-bit with Unicode */ | |||
if ((xclass_props & XCLASS_REQUIRED) != 0) | |||
{ | |||
/* We should never generate a (useless) xclass in 8-bit library when UTF flag is false */ | |||
PCRE2_ASSERT(PCRE2_CODE_UNIT_WIDTH != 8 || utf); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is usually an #if
since it looks better, even if this variant is working.
/* If code unit width is 8 bits, and UCP flag is set, but UTF flag is not, we still | ||
* generate cranges, but in that case we should not process any crange > 0xFF, | ||
* because it's impossible to encounter code points > 0xFF in the subject string */ | ||
if (utf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need code indentation below? An if (!utf) range = end;
also works. I have no preference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoiding conditional indentation can only help making the logic easier to follow IMHO
Update: The results of the CI build showed that my added I tried further suppressing this, but right now, when I run the test suite with ASan enabled, it is detecting a heap buffer overflow. I will analyze further and figure out why this is the case. |
It looks like the patch has conflicts. We should not forget about this code since it fixes a bug. |
When testing another patch, I discovered that #474 caused a small change in the behavior of character classes when caseless mode and UCP were enabled.
Thank you to Zoltan Herczeg for suggesting a fix.
Closes GH-526.