-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
isalpha
should use Unicode property Alphabetic
; rename to isletter
#26932
Comments
There are a whole bunch of Unicode character properties that aren't currently in utf8proc, e.g. I suspect that, rather than cramming all of these into utf8proc, it would be better to keep utf8proc focused mainly on normalization and have a separate package of UnicodeProperties with a bunch of optimized 2-stage tables (exposed as e.g. a new |
In the meantime, maybe |
+1 for |
Triage is ok with renaming to |
isalpha
should use Unicode property Alphabetic
isalpha
should use Unicode property Alphabetic
; rename to isletter
If someone wants to make a PR doing this rename, that would be good, I don't think it's going to happen otherwise though. @digital-carver? (or @ararslan if you feel like it) |
Right now, it simply checks whether the given character is in one of the L categories (
isalpha(c::AbstractChar) = UTF8PROC_CATEGORY_LU <= category_code(c) <= UTF8PROC_CATEGORY_LO
). This is almost correct, except that the UnicodeAlphabetic
property belongs to these categories, to aNl
category (number-like letters, eg. Roman numerals), and crucially to a set of characters defined to beOther_Alphabetic
that live inMc
andMn
(spacing and non-spacing marks). A lot of codepoints in Indic texts, for eg. most occurrences of vowels in Tamil texts, are characters found in thisOther_Alphabetic
list.Among the few other (programming) languages I tried this check on, Ruby (
\p{Alpha}
) and Java (Character.isAlphabetic
) get this right (Java documentation explicitly explains theAlphabetic
property, Python 2 and 3 both ("அதிகாலை".isalpha()
) seem to be getting it wrong. Perl also gets theOther_Alphabetic
characters correctly identified under\p{Alpha}
(though it also seems to have additional magic on top).Other_Alphabetic
apparently belongs to 1300 code points according to the Unicode PropList, so there are letters from quite a few language scripts that currently failisalpha
.I'm not sure if
utf8proc
supports querying for either theAlphabetic
or theOther_Alphabetic
property (theutf8proc_property_struct
doesn't seem to have either property), so this might have to be implemented there first. Also, possibly related to #25653 with regards to implementation.The text was updated successfully, but these errors were encountered: