Skip to content

Commit

Permalink
criteria for Recommended vs. Uncommon_Use
Browse files Browse the repository at this point in the history
  • Loading branch information
markusicu committed May 3, 2024
1 parent 6fa71f0 commit 7ee1292
Showing 1 changed file with 47 additions and 18 deletions.
65 changes: 47 additions & 18 deletions unicodetools/data/security/dev/data/source/removals.txt
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,46 @@
# High-Level exclusions
[:^xid-continue:]; not-xid

# remove combining marks that are not used in normal languages
# Remove combining marks that are not used in normal languages.

# PAG meeting 2024-04-18 before Unicode 16 beta:
# [Mark]: Policy is that by default
# new characters in scripts that are not Excluded or Limited Use,
# are marked as Uncommon_Use & communicate to SEW
# to ask if there are any exceptions (needed in customary modern widespread use).
# ----
# https://www.unicode.org/reports/tr31/#Table_Recommended_Scripts
# ----
# TODO: We should work our way backwards to
# review Recommended characters added at least in Unicode 13, 14, 15, 15.1.

# Possible data sources:
# - character encoding proposal docs
# - EGIDS = https://en.wikipedia.org/wiki/Expanded_Graded_Intergenerational_Disruption_Scale
# For ICANN work, according to Asmus:
# We found level 4, which has some institutional support,
# a good cutoff for assuming that the language (and therefore its writing system)
# is in everyday use in the community.
# However, for any language at that boundary, we always look for additional info,
# sometimes making exceptions for level 5.
# (Sometimes, research shows a language, while vigorous, is only used orally,
# so then we downgrade it for domain names).
# - Data from icann.org/idn under Root Zone LGR (look for "proposal documents").
# Each proposal evaluates which languages written in the script are common enough to
# support top-level domain names.
# A machine readable version is found in the XML files for the current version of the RZ-LGR
# (each character is annotated with a reference identifying the language that requires it).
# - ethnologue.com
#
# Asmus recommends for characters to be Recommended to look for positive evidence of
# - large population
# - stable, well supported language
# - evidence it's (commonly) written in that script
# - digitally supported
# - not a specialized use in the writing system
# One or the other factor, except 5, may be offset by other factors.
# Consider whether the community conducts its business in writing in that language,
# and if so, in that script.

035C..0362 ; technical # subhead=Double diacritics

Expand Down Expand Up @@ -732,24 +771,14 @@ AB63; uncommon-use # LATIN SMALL LETTER UO
# Question: should be default for anything new; add exceptions otherwise
# \p{Age=13} ; uncommon_use

# PAG meeting 2024-04-18 before Unicode 16 beta:
# [Mark]: Policy is that by default
# new characters in scripts that are not Excluded or Limited Use,
# are marked as Uncommon_Use & communicate to SEW
# to ask if there are any exceptions (needed in customary modern widespread use).
# ----
# https://www.unicode.org/reports/tr31/#Table_Recommended_Scripts
# ----
# TODO: This should work with the following set pattern but doesn't;
# and neither with \p{Age=16}. Why?
# [\P{Age=15.1}&[\p{script=Zyyy}\p{script=Zinh}\p{script=Arab}\p{script=Armn}\p{script=Beng}\p{script=Bopo}\p{script=Cyrl}\p{script=Deva}\p{script=Ethi}\p{script=Geor}\p{script=Grek}\p{script=Gujr}\p{script=Guru}\p{script=Hang}\p{script=Hani}\p{script=Hebr}\p{script=Hira}\p{script=Kana}\p{script=Knda}\p{script=Khmr}\p{script=Laoo}\p{script=Latn}\p{script=Mlym}\p{script=Mymr}\p{script=Orya}\p{script=Sinh}\p{script=Taml}\p{script=Telu}\p{script=Thaa}\p{script=Thai}\p{script=Tibt}]] ; uncommon_use
# ----
# TODO: We should work our way backwards to
# review Recommended characters added at least since Unicode 13 (inclusive).

# For now, hardcode the set of characters that would otherwise become Recommended in Unicode 16.
# For Unicode 16, the following characters would become Recommended without these overrides.
# They are all used in languages with EGIDS level 5 or higher.
# 1C89..1C8A [2] CYRILLIC CAPITAL LETTER TJE..CYRILLIC SMALL LETTER TJE
# A7CB..A7CD [3] LATIN CAPITAL LETTER RAMS HORN..LATIN SMALL LETTER S WITH DIAGONAL STROKE
# A7DA..A7DC [3] LATIN CAPITAL LETTER LAMBDA..LATIN CAPITAL LETTER LAMBDA WITH STROKE
# 10EC2..10EC4 [3] ARABIC LETTER DAL WITH TWO DOTS VERTICALLY BELOW..ARABIC LETTER KAF WITH TWO DOTS VERTICALLY BELOW
# 116D0..116E3 [20] MYANMAR PAO DIGIT ZERO..MYANMAR EASTERN PWO KAREN DIGIT NINE
[\u1C89-\u1C8A \uA7CB-\uA7CD \uA7DA-\uA7DC \U00010EC2-\U00010EC4 \U000116D0-\U000116E3] ; uncommon_use
# End hardcoded set.

# 19-329 Section 4
0192 ; uncommon_use
Expand Down

2 comments on commit 7ee1292

@asmusf
Copy link

@asmusf asmusf commented on 7ee1292 May 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lower EGIDS value is "better". Should be noted somewhere?

"without these overrides" is slightly unclear to me, do you mean "in the absence of any positive decision to override based on the factors above"?

Otherwise, I like.

@markusicu
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lower EGIDS value is "better". Should be noted somewhere?

Leaving that up to the Wikipedia page for now.

"without these overrides" is slightly unclear to me, do you mean "in the absence of any positive decision to override based on the factors above"?

no, without the overrides just below the comments, on line 781

Please sign in to comment.