Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add minimal Unicode support to lang.string #1129

Open
jclark opened this issue Jun 23, 2022 · 1 comment
Open

Add minimal Unicode support to lang.string #1129

jclark opened this issue Jun 23, 2022 · 1 comment
Labels
Area/LangLib Relates to lang.* libraries Type/Improvement Enhancement to language design
Milestone

Comments

@jclark
Copy link
Collaborator

jclark commented Jun 23, 2022

lang.string has so far excluded all Unicode functionality, so as to avoid needing to have information about Unicode character properties that change with each Unicode version. However regular expressions #130 needs this information, so this exclusion no longer makes sense.

@jclark
Copy link
Collaborator Author

jclark commented Jun 23, 2022

General category

Regexps allowing testing the Unicode general category of a property using \p; so we should make this information available via function calls.

Go has a tasteful design here. Here are Unicode general categories for which functions are provided

  • L - IsLetter
  • Ll - IsLower
  • Lu - IsUpper
  • Lt - IsTitle
  • Cc - IsControl
  • N - IsNumber
  • Nd - IsDigit
  • M - IsMark
  • P - IsPunct
  • S - IsSymbol
  • L, M, N, P, S, Zs - IsGraphic

I think also we should have:

  • C - IsOther

We also need something for white space. Relevant characters are:

  1. CR, LF (normal line terminators)
  2. TAB
  3. SPACE
  4. VT, LF (more obscure ASCII characters often treated as space)
  5. NEL (0x85 - EBCDIC legacy)
  6. Unicode LineSep, ParaSep (0x2028, 0x2029) category Zl, Zp
  7. Unicode Space separator (category Zs) other than SPACE

Currently

  • Ballerina line terminator is 1
  • Ballerina white space is 1, 2 or 3
  • Unicode Zs is 3 or 7
  • Unicode White_Space property is 1 through 7
  • ECMAScript whitespace (which is used for ECMAScript \s) excludes 5, but includes 0xFEFF (ZWNBSP)
  • ECMAScript line terminator is 1 or 6
  • Java regex \s is 1 through 4
  • Java line terminator is 1, 5 or 6

Case

Regexps do case-folding. So we should make this data accessible via functions.

  • equalsIgnoreCase (we already have equalsIgnoreCaseAscii)

Script

XXX Should have something here. Regexps are providing this using \p.

Unicode versioning

Should have way to detect Unicode version.

@jclark jclark added this to the Swan Lake Update 4 milestone Sep 20, 2022
@jclark jclark modified the milestones: 2023R1, 2013R2 Apr 25, 2023
@anupama-pathirage anupama-pathirage added Type/Improvement Enhancement to language design Area/LangLib Relates to lang.* libraries labels Nov 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area/LangLib Relates to lang.* libraries Type/Improvement Enhancement to language design
Projects
None yet
Development

No branches or pull requests

2 participants