Unicode 16 first cut security data #777

markusicu · 2024-04-12T22:33:54Z

generate files
basic Identifier_Type invariant tests
bug fix: ID_Type Limited_Use trumps Exclusion
ID_Type(A9CF)=Limited_Use Uncommon_Use (UTC-179-C39)
- [179-A124] Action Item for Markus Scherer, PAG: Change the Identifier_Type of U+A9CF JAVANESE PANGRANGKEP to only Limited_Use Uncommon_Use, removing Exclusion. For Unicode Version 16.0. See document L2/24-064 item 8.1.
set new characters to Uncommon_Use if they were initially generated as Recommended
add to removals.txt criteria for Recommended vs. Uncommon_Use

Best reviewed one commit at a time.

Ken-Whistler

What I could understand looks fine. I'm not familiar with the invariants code or some of the other internal data files here.

macchiati · 2024-04-23T17:41:38Z

I assume this will be modified further as a result of the conversation about the default IDNA type/status

markusicu · 2024-05-03T04:13:25Z

@dwanders-A & SAH/SEW: should any of the following not be Uncommon_Use? I am sending more details via email.

1C89..1C8A    ; Uncommon_Use                   # 16.0   [2] CYRILLIC CAPITAL LETTER TJE..CYRILLIC SMALL LETTER TJE

A7CB..A7CD    ; Uncommon_Use                   # 16.0   [3] LATIN CAPITAL LETTER RAMS HORN..LATIN SMALL LETTER S WITH DIAGONAL STROKE
A7DA..A7DC    ; Uncommon_Use                   # 16.0   [3] LATIN CAPITAL LETTER LAMBDA..LATIN CAPITAL LETTER LAMBDA WITH STROKE

10EC2..10EC4  ; Uncommon_Use                   # 16.0   [3] ARABIC LETTER DAL WITH TWO DOTS VERTICALLY BELOW..ARABIC LETTER KAF WITH TWO DOTS VERTICALLY BELOW

116D0..116E3  ; Uncommon_Use                   # 16.0  [20] MYANMAR PAO DIGIT ZERO..MYANMAR EASTERN PWO KAREN DIGIT NINE

unicodetools/data/security/dev/data/source/removals.txt

markusicu · 2024-05-03T17:25:44Z

@dwanders-A & SAH/SEW: should any of the following not be Uncommon_Use? I am sending more details via email.

Debbie replied:

The casepair for Cyrillic TJE is in modern use, as are the three Latin lambda characters.

LATIN CAPITAL LETTER RAMS HORN is apparently in modern use and the casepair for LATIN LETTER S WITH DIAGONAL STROKE is also in modern use. The Myanmar digits are also in modern use, based on the proposal.

Are the characters above in customary widespread use? Hmmm, well, once they are in fonts, they will / may be.

The following are found in manuscripts, apparently, so they are not modern: [3] ARABIC LETTER DAL WITH TWO DOTS VERTICALLY BELOW..ARABIC LETTER KAF WITH TWO DOTS VERTICALLY BELOW

macchiati · 2024-05-03T18:08:49Z

@dwanders-A

The question should have been: Are they in common/customary modern use? And whenever we are in doubt, the default should be Uncommon_Use.

If it is in modern use, but very infrequently, such as in technical documents, then we need to know that too. For example, just from the https://en.wikipedia.org/wiki/Latin_gamma we have that

A lowercase Latin gamma that lies above the [baseline](https://en.wikipedia.org/wiki/Baseline_(typography)) rather than crossing it (ɤ, called "ram's horns"), represents the [close-mid back unrounded vowel](https://en.wikipedia.org/wiki/Close-mid_back_unrounded_vowel). In certain [nonstandard variations of the IPA](https://en.wikipedia.org/wiki/Obsolete_and_nonstandard_symbols_in_the_International_Phonetic_Alphabet) the uppercase form is used.

What that indicates is that it should be marked as either Uncommon_Use or Technical.

It is important to note whenever there is some doubt, we should default it to Uncommon_Use unless we have reasonable evidence that it is in common use. This data does not affect the use of the character for normal purposes — writing books, articles, text messages, and so on; it is specially designed for identifiers and similar constructs.

We can always set "upgrade" it later on whenever someone presents a reasonable case for it being in common/customary use. For example, if someone finds that the uppercase form is used in the normal orthography for a modern language X, which has significant, active population using it, and makes a proposal to that effect, that would be justification for dropping the Identifier_Type of Uncommon_Use or Technical.

markusicu · 2024-05-03T21:48:25Z

@macchiati @asmusf et al. -- no new Recommended characters; could you (or someone) please approve this PR?

macchiati

Looks good as a first cut

asmusf

Seems fine

markusicu added 4 commits April 12, 2024 13:28

gen security files for 16 first cut

d9f5465

basic Identifier_Type invariant tests

1ac969d

bug fix: ID_Type Limited_Use trumps Exclusion

e8a45e6

ID_Type(A9CF)=Limited_Use Uncommon_Use

75c4921

markusicu requested review from echeran, eggrobin, macchiati, asmusf, josh-hadley and Ken-Whistler April 12, 2024 22:33

TestInvariants ID_Type exceptions

89594ed

Ken-Whistler reviewed Apr 12, 2024

View reviewed changes

markusicu added 3 commits May 2, 2024 20:23

Merge branch 'main' into security-16-first

9cc916a

regen security files

cdd8528

no new Recommended chars for now

81a4488

markusicu mentioned this pull request May 3, 2024

why does security/.../removals.txt not work with Age? #800

Open

markusicu requested a review from Ken-Whistler May 3, 2024 04:21

eggrobin reviewed May 3, 2024

View reviewed changes

unicodetools/data/security/dev/data/source/removals.txt Outdated Show resolved Hide resolved

UTC-179-C39 for the Identifier_Type of A9CF

6fa71f0

criteria for Recommended vs. Uncommon_Use

7ee1292

markusicu requested a review from eggrobin May 3, 2024 21:47

macchiati approved these changes May 3, 2024

View reviewed changes

markusicu merged commit 6b2b8ff into unicode-org:main May 3, 2024
23 checks passed

markusicu deleted the security-16-first branch May 3, 2024 22:07

asmusf reviewed May 3, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode 16 first cut security data #777

Unicode 16 first cut security data #777

markusicu commented Apr 12, 2024 •

edited

Loading

Ken-Whistler left a comment

macchiati commented Apr 23, 2024

markusicu commented May 3, 2024

markusicu commented May 3, 2024

macchiati commented May 3, 2024

markusicu commented May 3, 2024

macchiati left a comment

asmusf left a comment

Unicode 16 first cut security data #777

Unicode 16 first cut security data #777

Conversation

markusicu commented Apr 12, 2024 • edited Loading

Ken-Whistler left a comment

Choose a reason for hiding this comment

macchiati commented Apr 23, 2024

markusicu commented May 3, 2024

markusicu commented May 3, 2024

macchiati commented May 3, 2024

markusicu commented May 3, 2024

macchiati left a comment

Choose a reason for hiding this comment

asmusf left a comment

Choose a reason for hiding this comment

markusicu commented Apr 12, 2024 •

edited

Loading