PCRE2_UTF flag detects U+202F as Mongolian #118

dAu6jARL · 2022-05-09T23:53:52Z

if pcre2_compile() called with PCRE2_UTF option, U+202F(NARROW NO-BREAK SPACE) is detected as Mongolian.
pcre2grep with -u option occurs this error.

sample.text is as below.

foo bar (blank is U+0020)
foobar (blank is U+200B)
foo bar (blank is U+202F)
foobar (blank is U+FEFF)

command is as below.

pcre2grep -u '\p{Mongolian}' sample.text

output is as below.

foo bar (blank is U+202F)

The text was updated successfully, but these errors were encountered:

PhilipHazel · 2022-05-11T16:12:01Z

It appears that Mongolian exists in the list of script extensions for U+202F. Here is output from the ucptest program:

$ ./ucptest 202f
U+202F CS Separator: Space separator, common, Other, [latin, mongolian], [alphabetic, caseignorable, cased, diacritic, graphemebase, idcontinue, idstart, lowercase]

Perl also recognizes U+202F as Mongolian. The Unicode file ScriptExtensions.txt from which PCRE2 gets its data contains this:

202F ; Latn Mong # Zs NARROW NO-BREAK SPACE

So it looks like this is deliberate on the part of Unicode. I am therefore closing this as invalid.

dAu6jARL · 2022-05-12T21:53:21Z

Thank you for your reply.
I need to study more.

PhilipHazel · 2022-05-13T07:41:39Z

Note that \p{Mong} works like \p{scx:Mong}, that is, it checks both the script and the script extensions. If you want to test just the script, use \p{sc:Mong}.

dAu6jARL · 2022-05-14T02:01:24Z

Thank you for your advice. I'll use \p{sc:Xxx} as appropriate.
As it happened, I assumed U+202F as \p{Mongolian} in place name as follows.

Arrêt Nation – Voltaire [56]
https://foursquare.com/v/4ffc7038e4b07354de08b054

Pasta & Tapas Pietro 池袋店
https://foursquare.com/v/613eee8f90eb43793c2e76fd

PhilipHazel added the invalid label May 11, 2022

PhilipHazel closed this as completed May 11, 2022

SolitaryGrass mentioned this issue May 31, 2023

internal_dfa_match, a stack overflow occurred due to recursive calls. #258

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PCRE2_UTF flag detects U+202F as Mongolian #118

PCRE2_UTF flag detects U+202F as Mongolian #118

dAu6jARL commented May 9, 2022

PhilipHazel commented May 11, 2022 •

edited

Loading

dAu6jARL commented May 12, 2022

PhilipHazel commented May 13, 2022

dAu6jARL commented May 14, 2022

PCRE2_UTF flag detects U+202F as Mongolian #118

PCRE2_UTF flag detects U+202F as Mongolian #118

Comments

dAu6jARL commented May 9, 2022

PhilipHazel commented May 11, 2022 • edited Loading

dAu6jARL commented May 12, 2022

PhilipHazel commented May 13, 2022

dAu6jARL commented May 14, 2022

PhilipHazel commented May 11, 2022 •

edited

Loading