Add minimal Unicode support to lang.string #1129

jclark · 2022-06-23T10:01:31Z

lang.string has so far excluded all Unicode functionality, so as to avoid needing to have information about Unicode character properties that change with each Unicode version. However regular expressions #130 needs this information, so this exclusion no longer makes sense.

jclark · 2022-06-23T10:08:36Z

General category

Regexps allowing testing the Unicode general category of a property using \p; so we should make this information available via function calls.

Go has a tasteful design here. Here are Unicode general categories for which functions are provided

L - IsLetter
Ll - IsLower
Lu - IsUpper
Lt - IsTitle
Cc - IsControl
N - IsNumber
Nd - IsDigit
M - IsMark
P - IsPunct
S - IsSymbol
L, M, N, P, S, Zs - IsGraphic

I think also we should have:

C - IsOther

We also need something for white space. Relevant characters are:

CR, LF (normal line terminators)
TAB
SPACE
VT, LF (more obscure ASCII characters often treated as space)
NEL (0x85 - EBCDIC legacy)
Unicode LineSep, ParaSep (0x2028, 0x2029) category Zl, Zp
Unicode Space separator (category Zs) other than SPACE

Currently

Ballerina line terminator is 1
Ballerina white space is 1, 2 or 3
Unicode Zs is 3 or 7
Unicode White_Space property is 1 through 7
ECMAScript whitespace (which is used for ECMAScript \s) excludes 5, but includes 0xFEFF (ZWNBSP)
ECMAScript line terminator is 1 or 6
Java regex \s is 1 through 4
Java line terminator is 1, 5 or 6

Case

Regexps do case-folding. So we should make this data accessible via functions.

equalsIgnoreCase (we already have equalsIgnoreCaseAscii)

Script

XXX Should have something here. Regexps are providing this using \p.

Unicode versioning

Should have way to detect Unicode version.

jclark mentioned this issue Jun 23, 2022

Add regular expression support to language #130

Closed

jclark added this to the Swan Lake Update 4 milestone Sep 20, 2022

jclark modified the milestones: 2023R1, 2013R2 Apr 25, 2023

anupama-pathirage added Type/Improvement Enhancement to language design Area/LangLib Relates to lang.* libraries labels Nov 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add minimal Unicode support to lang.string #1129

Add minimal Unicode support to lang.string #1129

jclark commented Jun 23, 2022 •

edited

Loading

jclark commented Jun 23, 2022 •

edited

Loading

Add minimal Unicode support to lang.string #1129

Add minimal Unicode support to lang.string #1129

Comments

jclark commented Jun 23, 2022 • edited Loading

jclark commented Jun 23, 2022 • edited Loading

General category

Case

Script

Unicode versioning

jclark commented Jun 23, 2022 •

edited

Loading

jclark commented Jun 23, 2022 •

edited

Loading