Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

collation order vs encoding order in range matching #88

Open
bbolker opened this issue Jun 3, 2023 · 4 comments
Open

collation order vs encoding order in range matching #88

bbolker opened this issue Jun 3, 2023 · 4 comments

Comments

@bbolker
Copy link

bbolker commented Jun 3, 2023

The TRE documentation defines a range as

Two characters separated by -. This is shorthand for the full range of characters between those two (inclusive) in the collating sequence.

(here in the repository)

However, testing with the Estonian locale (in R's imported version of TRE) shows that T is incorrectly matched by [A-Z] ... this comment says

/* XXX - Should use collation order instead of encoding values in character ranges. */

Would it be correct to change the documentation to say

The characters to include are determined by Unicode code point ordering.

as in the ICU documentation ... ?

@trushworth
Copy link
Collaborator

trushworth commented Jun 4, 2023 via email

@bbolker
Copy link
Author

bbolker commented Jun 4, 2023

For what it's worth there's an extensive Stack Overflow answer demonstrating that many tested regex implementations use locale-/collating-sequence-independent ranges (the only exceptions were some versions of grep and awk)

@trushworth
Copy link
Collaborator

trushworth commented Jun 4, 2023 via email

@bbolker
Copy link
Author

bbolker commented Jun 4, 2023

Honestly, I'm not too worried about this. The practical advice that's always given (which I agree with) is "don't rely on ranges to identify alphabetic characters, use [:alpha:] instead"). I think if you wanted to handle the crazy edge case of "find alphabetic characters that are between A and Z in my locale", you'd have to hack it yourself ... (the general advice is that if you want to match only uppercase ASCII letters reliably, you have to enumerate the class as [ABCDEFGHIJKLMNOPQRSTUVWXYZ])

  • [:alpha:] is arguably better anyway because it handles accented characters nicely
  • I think that the definition of 'alphabetic' is effectively already handled in Unicode: I'd assume that any Unicode category as a 'letter' would be alphabetic ...
  • I take your point about multilingual input, which is yet another reason to define ranges in Unicode and not try to make ranges dependent on collating sequences

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants