Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to unicode 16 #33

Merged
merged 1 commit into from
Sep 18, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ gen/casing
gen/segmentation
gen/collation
gen/blocks
gen/UCD.zip

src/unicodedb/compositions
src/unicodedb/decompositions
Expand Down
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -193,15 +193,15 @@ something that's missing, please open an issue or PR
require just temporarily commenting out
all checks for missing unicode points.
* Overwrite `./gen/UCD` data with
[latest unicode UCD](http://unicode.org/Public/UCD/latest/ucd/UCD.zip).
[latest unicode UCD](https://unicode.org/Public/UCD/latest/ucd/UCD.zip).
* Run `nimble gen` to generate the new data.
* Run tests. Add checks for missing unicode points back.
* Run `nimble test`. Add checks for missing unicode points back.
A handful of unicode points may have change its data, check
the unicode changelog page, make sure they are correct and skip them.
* Note: starting Unicode 15 they added multiple @missing lines
which breaks the assumption of a default prop for missing CPs
and these lines need to be parsed (see DerivedBidiClass for example).
So if they add this to more files, the data gen need fixing.
So if they add this to more files, the data gen needs fixing.
Look for lines containing `# @missing` with a range other than `0000..10FFFF`. See [Missing_Conventions](https://www.unicode.org/reports/tr44/tr44-30.html#Missing_Conventions)

## Tests
Expand Down
20 changes: 16 additions & 4 deletions gen/UCD/Blocks.txt
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
# Blocks-15.0.0.txt
# Date: 2022-01-28, 20:58:00 GMT [KW]
# © 2022 Unicode®, Inc.
# For terms of use, see https://www.unicode.org/terms_of_use.html
# Blocks-16.0.0.txt
# Date: 2024-02-02
# © 2024 Unicode®, Inc.
# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
# For terms of use and license, see https://www.unicode.org/terms_of_use.html
#
# Unicode Character Database
# For documentation, see https://www.unicode.org/reports/tr44/
Expand Down Expand Up @@ -217,6 +218,7 @@ FFF0..FFFF; Specials
10500..1052F; Elbasan
10530..1056F; Caucasian Albanian
10570..105BF; Vithkuqi
105C0..105FF; Todhri
10600..1077F; Linear A
10780..107BF; Latin Extended-F
10800..1083F; Cypriot Syllabary
Expand All @@ -239,6 +241,7 @@ FFF0..FFFF; Specials
10C00..10C4F; Old Turkic
10C80..10CFF; Old Hungarian
10D00..10D3F; Hanifi Rohingya
10D40..10D8F; Garay
10E60..10E7F; Rumi Numeral Symbols
10E80..10EBF; Yezidi
10EC0..10EFF; Arabic Extended-C
Expand All @@ -258,12 +261,14 @@ FFF0..FFFF; Specials
11280..112AF; Multani
112B0..112FF; Khudawadi
11300..1137F; Grantha
11380..113FF; Tulu-Tigalari
11400..1147F; Newa
11480..114DF; Tirhuta
11580..115FF; Siddham
11600..1165F; Modi
11660..1167F; Mongolian Supplement
11680..116CF; Takri
116D0..116FF; Myanmar Extended-C
11700..1174F; Ahom
11800..1184F; Dogra
118A0..118FF; Warang Citi
Expand All @@ -274,6 +279,7 @@ FFF0..FFFF; Specials
11AB0..11ABF; Unified Canadian Aboriginal Syllabics Extended-A
11AC0..11AFF; Pau Cin Hau
11B00..11B5F; Devanagari Extended-A
11BC0..11BFF; Sunuwar
11C00..11C6F; Bhaiksuki
11C70..11CBF; Marchen
11D00..11D5F; Masaram Gondi
Expand All @@ -288,12 +294,15 @@ FFF0..FFFF; Specials
12F90..12FFF; Cypro-Minoan
13000..1342F; Egyptian Hieroglyphs
13430..1345F; Egyptian Hieroglyph Format Controls
13460..143FF; Egyptian Hieroglyphs Extended-A
14400..1467F; Anatolian Hieroglyphs
16100..1613F; Gurung Khema
16800..16A3F; Bamum Supplement
16A40..16A6F; Mro
16A70..16ACF; Tangsa
16AD0..16AFF; Bassa Vah
16B00..16B8F; Pahawh Hmong
16D40..16D7F; Kirat Rai
16E40..16E9F; Medefaidrin
16F00..16F9F; Miao
16FE0..16FFF; Ideographic Symbols and Punctuation
Expand All @@ -308,6 +317,7 @@ FFF0..FFFF; Specials
1B170..1B2FF; Nushu
1BC00..1BC9F; Duployan
1BCA0..1BCAF; Shorthand Format Controls
1CC00..1CEBF; Symbols for Legacy Computing Supplement
1CF00..1CFCF; Znamenny Musical Notation
1D000..1D0FF; Byzantine Musical Symbols
1D100..1D1FF; Musical Symbols
Expand All @@ -325,6 +335,7 @@ FFF0..FFFF; Specials
1E290..1E2BF; Toto
1E2C0..1E2FF; Wancho
1E4D0..1E4FF; Nag Mundari
1E5D0..1E5FF; Ol Onal
1E7E0..1E7FF; Ethiopic Extended-B
1E800..1E8DF; Mende Kikakui
1E900..1E95F; Adlam
Expand Down Expand Up @@ -352,6 +363,7 @@ FFF0..FFFF; Specials
2B740..2B81F; CJK Unified Ideographs Extension D
2B820..2CEAF; CJK Unified Ideographs Extension E
2CEB0..2EBEF; CJK Unified Ideographs Extension F
2EBF0..2EE5F; CJK Unified Ideographs Extension I
2F800..2FA1F; CJK Compatibility Ideographs Supplement
30000..3134F; CJK Unified Ideographs Extension G
31350..323AF; CJK Unified Ideographs Extension H
Expand Down
38 changes: 34 additions & 4 deletions gen/UCD/CaseFolding.txt
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# CaseFolding-15.0.0.txt
# Date: 2022-02-02, 23:35:35 GMT
# © 2022 Unicode®, Inc.
# CaseFolding-16.0.0.txt
# Date: 2024-04-30, 21:48:11 GMT
# © 2024 Unicode®, Inc.
# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
# For terms of use, see https://www.unicode.org/terms_of_use.html
# For terms of use and license, see https://www.unicode.org/terms_of_use.html
#
# Unicode Character Database
# For documentation, see https://www.unicode.org/reports/tr44/
Expand Down Expand Up @@ -603,6 +603,7 @@
1C86; C; 044A; # CYRILLIC SMALL LETTER TALL HARD SIGN
1C87; C; 0463; # CYRILLIC SMALL LETTER TALL YAT
1C88; C; A64B; # CYRILLIC SMALL LETTER UNBLENDED UK
1C89; C; 1C8A; # CYRILLIC CAPITAL LETTER TJE
1C90; C; 10D0; # GEORGIAN MTAVRULI CAPITAL LETTER AN
1C91; C; 10D1; # GEORGIAN MTAVRULI CAPITAL LETTER BAN
1C92; C; 10D2; # GEORGIAN MTAVRULI CAPITAL LETTER GAN
Expand Down Expand Up @@ -929,6 +930,7 @@
1FCC; S; 1FC3; # GREEK CAPITAL LETTER ETA WITH PROSGEGRAMMENI
1FD2; F; 03B9 0308 0300; # GREEK SMALL LETTER IOTA WITH DIALYTIKA AND VARIA
1FD3; F; 03B9 0308 0301; # GREEK SMALL LETTER IOTA WITH DIALYTIKA AND OXIA
1FD3; S; 0390; # GREEK SMALL LETTER IOTA WITH DIALYTIKA AND OXIA
1FD6; F; 03B9 0342; # GREEK SMALL LETTER IOTA WITH PERISPOMENI
1FD7; F; 03B9 0308 0342; # GREEK SMALL LETTER IOTA WITH DIALYTIKA AND PERISPOMENI
1FD8; C; 1FD0; # GREEK CAPITAL LETTER IOTA WITH VRACHY
Expand All @@ -937,6 +939,7 @@
1FDB; C; 1F77; # GREEK CAPITAL LETTER IOTA WITH OXIA
1FE2; F; 03C5 0308 0300; # GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND VARIA
1FE3; F; 03C5 0308 0301; # GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND OXIA
1FE3; S; 03B0; # GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND OXIA
1FE4; F; 03C1 0313; # GREEK SMALL LETTER RHO WITH PSILI
1FE6; F; 03C5 0342; # GREEK SMALL LETTER UPSILON WITH PERISPOMENI
1FE7; F; 03C5 0308 0342; # GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND PERISPOMENI
Expand Down Expand Up @@ -1238,9 +1241,13 @@ A7C5; C; 0282; # LATIN CAPITAL LETTER S WITH HOOK
A7C6; C; 1D8E; # LATIN CAPITAL LETTER Z WITH PALATAL HOOK
A7C7; C; A7C8; # LATIN CAPITAL LETTER D WITH SHORT STROKE OVERLAY
A7C9; C; A7CA; # LATIN CAPITAL LETTER S WITH SHORT STROKE OVERLAY
A7CB; C; 0264; # LATIN CAPITAL LETTER RAMS HORN
A7CC; C; A7CD; # LATIN CAPITAL LETTER S WITH DIAGONAL STROKE
A7D0; C; A7D1; # LATIN CAPITAL LETTER CLOSED INSULAR G
A7D6; C; A7D7; # LATIN CAPITAL LETTER MIDDLE SCOTS S
A7D8; C; A7D9; # LATIN CAPITAL LETTER SIGMOID S
A7DA; C; A7DB; # LATIN CAPITAL LETTER LAMBDA
A7DC; C; 019B; # LATIN CAPITAL LETTER LAMBDA WITH STROKE
A7F5; C; A7F6; # LATIN CAPITAL LETTER REVERSED HALF H
AB70; C; 13A0; # CHEROKEE SMALL LETTER A
AB71; C; 13A1; # CHEROKEE SMALL LETTER E
Expand Down Expand Up @@ -1328,6 +1335,7 @@ FB02; F; 0066 006C; # LATIN SMALL LIGATURE FL
FB03; F; 0066 0066 0069; # LATIN SMALL LIGATURE FFI
FB04; F; 0066 0066 006C; # LATIN SMALL LIGATURE FFL
FB05; F; 0073 0074; # LATIN SMALL LIGATURE LONG S T
FB05; S; FB06; # LATIN SMALL LIGATURE LONG S T
FB06; F; 0073 0074; # LATIN SMALL LIGATURE ST
FB13; F; 0574 0576; # ARMENIAN SMALL LIGATURE MEN NOW
FB14; F; 0574 0565; # ARMENIAN SMALL LIGATURE MEN ECH
Expand Down Expand Up @@ -1522,6 +1530,28 @@ FF3A; C; FF5A; # FULLWIDTH LATIN CAPITAL LETTER Z
10CB0; C; 10CF0; # OLD HUNGARIAN CAPITAL LETTER EZS
10CB1; C; 10CF1; # OLD HUNGARIAN CAPITAL LETTER ENT-SHAPED SIGN
10CB2; C; 10CF2; # OLD HUNGARIAN CAPITAL LETTER US
10D50; C; 10D70; # GARAY CAPITAL LETTER A
10D51; C; 10D71; # GARAY CAPITAL LETTER CA
10D52; C; 10D72; # GARAY CAPITAL LETTER MA
10D53; C; 10D73; # GARAY CAPITAL LETTER KA
10D54; C; 10D74; # GARAY CAPITAL LETTER BA
10D55; C; 10D75; # GARAY CAPITAL LETTER JA
10D56; C; 10D76; # GARAY CAPITAL LETTER SA
10D57; C; 10D77; # GARAY CAPITAL LETTER WA
10D58; C; 10D78; # GARAY CAPITAL LETTER LA
10D59; C; 10D79; # GARAY CAPITAL LETTER GA
10D5A; C; 10D7A; # GARAY CAPITAL LETTER DA
10D5B; C; 10D7B; # GARAY CAPITAL LETTER XA
10D5C; C; 10D7C; # GARAY CAPITAL LETTER YA
10D5D; C; 10D7D; # GARAY CAPITAL LETTER TA
10D5E; C; 10D7E; # GARAY CAPITAL LETTER RA
10D5F; C; 10D7F; # GARAY CAPITAL LETTER NYA
10D60; C; 10D80; # GARAY CAPITAL LETTER FA
10D61; C; 10D81; # GARAY CAPITAL LETTER NA
10D62; C; 10D82; # GARAY CAPITAL LETTER PA
10D63; C; 10D83; # GARAY CAPITAL LETTER HA
10D64; C; 10D84; # GARAY CAPITAL LETTER OLD KA
10D65; C; 10D85; # GARAY CAPITAL LETTER OLD NA
118A0; C; 118C0; # WARANG CITI CAPITAL LETTER NGAA
118A1; C; 118C1; # WARANG CITI CAPITAL LETTER A
118A2; C; 118C2; # WARANG CITI CAPITAL LETTER WI
Expand Down
Loading
Loading