-
Notifications
You must be signed in to change notification settings - Fork 135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
collation order vs encoding order in range matching #88
Comments
Hi,
Reply is in-line below.
On Sat, Jun 03, 2023 at 08:33:26AM -0700, Ben Bolker wrote:
The [TRE documentation](https://laurikari.net/tre/documentation/regex-syntax/) defines a range as
> Two characters separated by -. This is shorthand for the full range of characters between those two (inclusive) in the collating sequence.
([here](https://github.com/laurikari/tre/blob/494b1c9c6827a3205f162f047c3d4bd0a681405d/doc/tre-syntax.html#L245-L248) in the repository)
However, testing with the Estonian locale (in R's imported version of TRE) shows that T is incorrectly matched by [A-Z] ... [this comment](https://github.com/r-devel/r-svn/blob/66813d4c830d830a2e0acb0e2f53f522b8e2dc37/src/extra/tre/tre-parse.c#L300) says
This sort of problem shows up in the ordinary unix commands too. There
was a recent mention on one of the FreeBSD lists that "find" with a
range of "[A-Z]" used the collating order so it matched the lower case
letters too, which would have suprised an old unix "find" user like me,
who's used to that range using the ASCII encoding. It gets even more
complicated when you start dealing with data containing multiple languages
because there doesn't seem to be a single collating order for all the
languages at once. If you happen to know one, I need it in another
project of mine :).
> /* XXX - Should use collation order instead of encoding values in character ranges. */
Would it be correct to change the documentation to say
> The characters to include are determined by Unicode code point ordering.
I think that's the best answer for the short term. I'm unfortunately
only literate in one language, so I hesitate to offer too strong an
opinion on any collating or encoding issue.
as in the [ICU documentation](https://unicode-org.github.io/icu/userguide/strings/regexp.html) ... ?
I'll add that to the list of easy to make changes I'm working on. I'm
currently (2023 June) working on the PR's, but if I haven't done enough
damage by the time I've sorted out those :) I'll see what I can make of the
outstanding issues.
--
Reply to this email directly or view it on GitHub:
#88
You are receiving this because you are subscribed to this thread.
Message ID: ***@***.***>
Cheers,
--
Tom Rushworth
|
For what it's worth there's an extensive Stack Overflow answer demonstrating that many tested regex implementations use locale-/collating-sequence-independent ranges (the only exceptions were some versions of |
Hi,
On Sun, Jun 04, 2023 at 02:03:02PM -0700, Ben Bolker wrote:
For what it's worth there's an [extensive Stack Overflow answer](https://stackoverflow.com/questions/11925537/should-we-consider-using-range-a-z-as-a-bug/44228789#44228789) demonstrating that many tested regex implementations use locale-/collating-sequence-independent ranges (the only exceptions were *some* versions of `grep` and `awk`)
Ah - interesting. I think my unease with all this is that it makes
regex matching locale dependent, and not in a nice way. Consider the
following:
1) You have some crowd sourced data, such as tags for photos or
music.
2) Because it is crowd sourced, it can easily be in different
languages, even if it is all stored in UTF-8 (or whatever),
so it would have to have some sort of locale information
stored with the value.
3) Matching character classes like [:alpha:] correctly in that
data means defining the classes differently for different
pieces of data. You would have to provide a locale for the
data at _match_ time, not for the regex at compile time.
I suppose from an implementation viewpoint, the classes could be handled
by adding a locale parameter to the regex match function and some sort
of plugin accessed by the match function code. The problem is that then
you need the plugins, and your nice clean regex library suddenly has a
very large and somewhat loosely defined dependency.
I didn't look at the details of the language testing in that stack
overflow discussion, but I didn't see any of them passing a locale at
match time, it all seems to be passing or picking a default locale at
compile time, so none of them are going to work properly on multligual
data :(.
As an alternative, I suppose we could hope that all Unicode codepoints
have language independent classification. I don't know enough about
Unicode to know if that is realistic, but if it is then a character
class definition could be implemented that checked by codepoint value
independent of language. Then all we have to do is persuade the
Unicode people to stop adding codepoints :).
Just to add one final complication, how should these
not-yet-perfectly-defined character classes work with approximate
matching?
I do not think there really is an answer right now (2023), but I can
hope for more multilingual programmers than I to come up with one :).
Until then, I think documenting the issue (as you suggested) is the
best we can do.
--
Reply to this email directly or view it on GitHub:
#88 (comment)
You are receiving this because you commented.
Message ID: ***@***.***>
Cheers,
--
Tom Rushworth
|
Honestly, I'm not too worried about this. The practical advice that's always given (which I agree with) is "don't rely on ranges to identify alphabetic characters, use
|
The TRE documentation defines a range as
(here in the repository)
However, testing with the Estonian locale (in R's imported version of TRE) shows that T is incorrectly matched by [A-Z] ... this comment says
Would it be correct to change the documentation to say
as in the ICU documentation ... ?
The text was updated successfully, but these errors were encountered: