Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode 16 first cut security data #777

Merged
merged 10 commits into from
May 3, 2024

Conversation

markusicu
Copy link
Member

@markusicu markusicu commented Apr 12, 2024

  • generate files
  • basic Identifier_Type invariant tests
  • bug fix: ID_Type Limited_Use trumps Exclusion
  • ID_Type(A9CF)=Limited_Use Uncommon_Use (UTC-179-C39)
    • [179-A124] Action Item for Markus Scherer, PAG: Change the Identifier_Type of U+A9CF JAVANESE PANGRANGKEP to only Limited_Use Uncommon_Use, removing Exclusion. For Unicode Version 16.0. See document L2/24-064 item 8.1.
  • set new characters to Uncommon_Use if they were initially generated as Recommended
  • add to removals.txt criteria for Recommended vs. Uncommon_Use

Best reviewed one commit at a time.

Copy link
Contributor

@Ken-Whistler Ken-Whistler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I could understand looks fine. I'm not familiar with the invariants code or some of the other internal data files here.

@macchiati
Copy link
Member

I assume this will be modified further as a result of the conversation about the default IDNA type/status

@markusicu
Copy link
Member Author

@dwanders-A & SAH/SEW: should any of the following not be Uncommon_Use? I am sending more details via email.

1C89..1C8A    ; Uncommon_Use                   # 16.0   [2] CYRILLIC CAPITAL LETTER TJE..CYRILLIC SMALL LETTER TJE

A7CB..A7CD    ; Uncommon_Use                   # 16.0   [3] LATIN CAPITAL LETTER RAMS HORN..LATIN SMALL LETTER S WITH DIAGONAL STROKE
A7DA..A7DC    ; Uncommon_Use                   # 16.0   [3] LATIN CAPITAL LETTER LAMBDA..LATIN CAPITAL LETTER LAMBDA WITH STROKE

10EC2..10EC4  ; Uncommon_Use                   # 16.0   [3] ARABIC LETTER DAL WITH TWO DOTS VERTICALLY BELOW..ARABIC LETTER KAF WITH TWO DOTS VERTICALLY BELOW

116D0..116E3  ; Uncommon_Use                   # 16.0  [20] MYANMAR PAO DIGIT ZERO..MYANMAR EASTERN PWO KAREN DIGIT NINE

@markusicu
Copy link
Member Author

@dwanders-A & SAH/SEW: should any of the following not be Uncommon_Use? I am sending more details via email.

Debbie replied:

The casepair for Cyrillic TJE is in modern use, as are the three Latin lambda characters.

LATIN CAPITAL LETTER RAMS HORN is apparently in modern use and the casepair for LATIN LETTER S WITH DIAGONAL STROKE is also in modern use. The Myanmar digits are also in modern use, based on the proposal.

Are the characters above in customary widespread use? Hmmm, well, once they are in fonts, they will / may be.

The following are found in manuscripts, apparently, so they are not modern: [3] ARABIC LETTER DAL WITH TWO DOTS VERTICALLY BELOW..ARABIC LETTER KAF WITH TWO DOTS VERTICALLY BELOW

@macchiati
Copy link
Member

@dwanders-A

The question should have been: Are they in common/customary modern use? And whenever we are in doubt, the default should be Uncommon_Use.

If it is in modern use, but very infrequently, such as in technical documents, then we need to know that too. For example, just from the https://en.wikipedia.org/wiki/Latin_gamma we have that

A lowercase Latin gamma that lies above the [baseline](https://en.wikipedia.org/wiki/Baseline_(typography)) rather than crossing it (ɤ, called "ram's horns"), represents the [close-mid back unrounded vowel](https://en.wikipedia.org/wiki/Close-mid_back_unrounded_vowel). In certain [nonstandard variations of the IPA](https://en.wikipedia.org/wiki/Obsolete_and_nonstandard_symbols_in_the_International_Phonetic_Alphabet) the uppercase form is used.

What that indicate​s is that it should be marked as either Uncommon_Use or Technical.

It is important to note whenever there is some doubt, we should default it to Uncommon_Use unless we have reasonable evidence that it is in common use. This data does not affect the use of the character for normal purposes — writing books, articles, text messages, and so on; it is specially designed for identifiers and similar constructs.

We can always set "upgrade" it later on ​w​henever someone ​p​resents a reasonable case for it being in common/customary use. For example,​ if ​s​omeone finds that the uppercase form is used in the normal orthography for a modern language X, which has significant, active population using it​, and makes a proposal to that effect, that would be justification for dropping the Identifier_Type of Uncommon_Use or Technical.

@markusicu markusicu requested a review from eggrobin May 3, 2024 21:47
@markusicu
Copy link
Member Author

@macchiati @asmusf et al. -- no new Recommended characters; could you (or someone) please approve this PR?

Copy link
Member

@macchiati macchiati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good as a first cut

@markusicu markusicu merged commit 6b2b8ff into unicode-org:main May 3, 2024
23 checks passed
@markusicu markusicu deleted the security-16-first branch May 3, 2024 22:07
Copy link

@asmusf asmusf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems fine

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants