Skip to content

Commit

Permalink
Adopt GB18030-2022
Browse files Browse the repository at this point in the history
This implements the Unicode Technical Committee recommendation around GB18030-2022 in a matter suitable for this standard, taking into account existing practice and the closeness between GBK and gb18030.

In particular, using the text file attached to https://www.unicode.org/L2/L2023/23003r-gb18030-recommendations.pdf this does the following:

1. Merges the first set of 18 mappings, which are bidirectional, directly into index gb18030, replacing existing PUA entries. This ends up impacting GBK and gb18030.
2. The second set of 18 mappings (from PUA to bytes) are encoded as an encoder only table, for both GBK and gb18030.
3. The third set of 18 mappings (from bytes to code points) are ignored, as they are already covered by index gb18030 ranges. (Presumably they are included because the recommendation covers the transition from "Previous Mappings" to "Current Mappings" to "Recommended Mappings", whereas we are going directly from "Previous Mappings" to "Recommended Mappings".)

The reason for changing GBK as well is because Chromium and WebKit have already code in the wild that impacts GBK to some degree (although the encoder only table is excluded for GBK only at the moment, including that would make the most sense compatibility-wise) and no fallout has been recorded. Additionally GBK is already positioned as a rough subset of gb18030 in this standard, with the decoder being shared completely.

Tests: encoding/legacy-mb-schinese has some GB18030-2022 coverage already. This is completed with web-platform-tests/wpt#48239 and web-platform-tests/wpt#48240.

This supersedes #335. This fixes #27 and fixes #312.

This also updates the description of index gb18030 ranges to account for #22 (the change from GB18030-2000 to -2005) which it until now did not.
  • Loading branch information
annevk authored Oct 4, 2024
1 parent e20f586 commit 2c3853e
Show file tree
Hide file tree
Showing 36 changed files with 129 additions and 58 deletions.
79 changes: 75 additions & 4 deletions encoding.bs
Original file line number Diff line number Diff line change
Expand Up @@ -832,7 +832,7 @@ specification, excluding <a>index single-byte</a>, which have their own table:
<td><a href=index-gb18030.txt>index-gb18030.txt</a>
<td><a href=gb18030.html>index gb18030 visualization</a>
<td><a href=gb18030-bmp.html>index gb18030 BMP coverage</a>
<td>This matches the GB18030-2005 standard for code points encoded as two bytes, except for
<td>This matches the GB18030-2022 standard for code points encoded as two bytes, except for
0xA3 0xA0 which maps to U+3000 to be compatible with deployed content. This index covers the
CJK Unified Ideographs block of Unicode in its entirety. Entries from that block that are above or
to the left of (the first) U+3000 in the visualization are in the Unicode order.
Expand All @@ -845,9 +845,13 @@ specification, excluding <a>index single-byte</a>, which have their own table:
<td colspan=3><a href=index-gb18030-ranges.txt>index-gb18030-ranges.txt</a>
<td>This <a>index</a> works different from all others. Listing all code points would result
in over a million items whereas they can be represented neatly in 207 ranges combined with trivial
limit checks. It therefore only superficially matches the GB18030-2005 standard for code points
encoded as four bytes. See also <a>index gb18030 ranges code point</a> and
<a>index gb18030 ranges pointer</a> below.
limit checks. It therefore only superficially matches the GB18030-2000 standard for code points
encoded as four bytes. The change for the GB18030-2005 revision is handled inline by the
<a>index gb18030 ranges code point</a> and <a>index gb18030 ranges pointer</a> algorithms below
that accompany this index. And the changes for the GB18030-2022 revision are handled differently
again to not further increase the number of byte sequences mapping to Private Use code points. The
relevant Private Use code points are mapped in the <a>gb18030 encoder</a> directly through a side
table to preserve compatibility with how they were mapped before.
<tr>
<td><dfn export>index jis0208</dfn>
<td><a href=index-jis0208.txt>index-jis0208.txt</a>
Expand Down Expand Up @@ -2434,6 +2438,73 @@ consumers of content generated with <a>GBK</a>'s <a for=/>encoder</a>.
<li><p>If <a>is GBK</a> is true and <var>code point</var> is
U+20AC, return byte 0x80.

<li>
<p>If there is a row in the table below whose first column is <var>code point</var>, then return
the two bytes on the same row listed in the second column:

<table>
<tr>
<th>Code point
<th>Bytes
<tr>
<td>U+E78D
<td>0xA6 0xD9
<tr>
<td>U+E78E
<td>0xA6 0xDA
<tr>
<td>U+E78F
<td>0xA6 0xDB
<tr>
<td>U+E790
<td>0xA6 0xDC
<tr>
<td>U+E791
<td>0xA6 0xDD
<tr>
<td>U+E792
<td>0xA6 0xDE
<tr>
<td>U+E793
<td>0xA6 0xDF
<tr>
<td>U+E794
<td>0xA6 0xEC
<tr>
<td>U+E795
<td>0xA6 0xED
<tr>
<td>U+E796
<td>0xA6 0xF3
<tr>
<td>U+E81E
<td>0xFE 0x59
<tr>
<td>U+E826
<td>0xFE 0x61
<tr>
<td>U+E82B
<td>0xFE 0x66
<tr>
<td>U+E82C
<td>0xFE 0x67
<tr>
<td>U+E832
<td>0xFE 0x6D
<tr>
<td>U+E843
<td>0xFE 0x7E
<tr>
<td>U+E854
<td>0xFE 0x90
<tr>
<td>U+E864
<td>0xFE 0xA0
</table>

<p class=note>This asymmetric encoder table preserves compatibility with the GB18030-2005
standard. See also the explanation at <a>index gb18030 ranges</a>.

<li><p>Let <var>pointer</var> be the <a>index pointer</a> for
<var>code point</var> in <a>index gb18030</a>.

Expand Down
2 changes: 1 addition & 1 deletion index-big5.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# https://encoding.spec.whatwg.org/
#
# Identifier: 8dfc771062e7be0810919082c2c06baa2236147909e0ecc235b1cb9ad782ac82
# Date: 2018-01-06
# Date: 2024-09-18

942 0x43F0 䏰 (<CJK Ideograph Extension A>)
943 0x4C32 䰲 (<CJK Ideograph Extension A>)
Expand Down
2 changes: 1 addition & 1 deletion index-euc-kr.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# https://encoding.spec.whatwg.org/
#
# Identifier: 1d97134cbf187263585bc8f593ca4196654ed4c7a673f5672eaad4f5d9fdc4ba
# Date: 2018-01-06
# Date: 2024-09-18

0 0xAC02 갂 (HANGUL SYLLABLE GAGG)
1 0xAC03 갃 (HANGUL SYLLABLE GAGS)
Expand Down
2 changes: 1 addition & 1 deletion index-gb18030-ranges.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# https://encoding.spec.whatwg.org/
#
# Identifier: f963aaa1653f630c523e7b04729fb4e4458f35806c45eb5c179445623138f0c0
# Date: 2018-01-06
# Date: 2024-09-18

0 0x0080
36 0x00A5
Expand Down
40 changes: 20 additions & 20 deletions index-gb18030.txt
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# For details on index index-gb18030.txt see the Encoding Standard
# https://encoding.spec.whatwg.org/
#
# Identifier: 715f084846f5c6fc9dd31046d0a4d604bd2d88bfe3a22833cea048415e413c70
# Date: 2018-01-06
# Identifier: ff1c9a923b5d24f9761b3a2de2c0f07b395f9f6f36519508944de4f0415be81c
# Date: 2024-09-18

0 0x4E02 丂 (<CJK Ideograph>)
1 0x4E04 丄 (<CJK Ideograph>)
Expand Down Expand Up @@ -7186,13 +7186,13 @@
7179 0x03C7 χ (GREEK SMALL LETTER CHI)
7180 0x03C8 ψ (GREEK SMALL LETTER PSI)
7181 0x03C9 ω (GREEK SMALL LETTER OMEGA)
7182 0xE78D  (<Private Use>)
7183 0xE78E  (<Private Use>)
7184 0xE78F  (<Private Use>)
7185 0xE790  (<Private Use>)
7186 0xE791  (<Private Use>)
7187 0xE792  (<Private Use>)
7188 0xE793  (<Private Use>)
7182 0xFE10 ︐ (PRESENTATION FORM FOR VERTICAL COMMA)
7183 0xFE12 ︒ (PRESENTATION FORM FOR VERTICAL IDEOGRAPHIC FULL STOP)
7184 0xFE11 ︑ (PRESENTATION FORM FOR VERTICAL IDEOGRAPHIC COMMA)
7185 0xFE13 ︓ (PRESENTATION FORM FOR VERTICAL COLON)
7186 0xFE14 ︔ (PRESENTATION FORM FOR VERTICAL SEMICOLON)
7187 0xFE15 ︕ (PRESENTATION FORM FOR VERTICAL EXCLAMATION MARK)
7188 0xFE16 ︖ (PRESENTATION FORM FOR VERTICAL QUESTION MARK)
7189 0xFE35 ︵ (PRESENTATION FORM FOR VERTICAL LEFT PARENTHESIS)
7190 0xFE36 ︶ (PRESENTATION FORM FOR VERTICAL RIGHT PARENTHESIS)
7191 0xFE39 ︹ (PRESENTATION FORM FOR VERTICAL LEFT TORTOISE SHELL BRACKET)
Expand All @@ -7205,14 +7205,14 @@
7198 0xFE42 ﹂ (PRESENTATION FORM FOR VERTICAL RIGHT CORNER BRACKET)
7199 0xFE43 ﹃ (PRESENTATION FORM FOR VERTICAL LEFT WHITE CORNER BRACKET)
7200 0xFE44 ﹄ (PRESENTATION FORM FOR VERTICAL RIGHT WHITE CORNER BRACKET)
7201 0xE794  (<Private Use>)
7202 0xE795  (<Private Use>)
7201 0xFE17 ︗ (PRESENTATION FORM FOR VERTICAL LEFT WHITE LENTICULAR BRACKET)
7202 0xFE18 ︘ (PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRAKCET)
7203 0xFE3B ︻ (PRESENTATION FORM FOR VERTICAL LEFT BLACK LENTICULAR BRACKET)
7204 0xFE3C ︼ (PRESENTATION FORM FOR VERTICAL RIGHT BLACK LENTICULAR BRACKET)
7205 0xFE37 ︷ (PRESENTATION FORM FOR VERTICAL LEFT CURLY BRACKET)
7206 0xFE38 ︸ (PRESENTATION FORM FOR VERTICAL RIGHT CURLY BRACKET)
7207 0xFE31 ︱ (PRESENTATION FORM FOR VERTICAL EM DASH)
7208 0xE796  (<Private Use>)
7208 0xFE19 ︙ (PRESENTATION FORM FOR VERTICAL HORIZONTAL ELLIPSIS)
7209 0xFE33 ︳ (PRESENTATION FORM FOR VERTICAL LOW LINE)
7210 0xFE34 ︴ (PRESENTATION FORM FOR VERTICAL WAVY LOW LINE)
7211 0xE797  (<Private Use>)
Expand Down Expand Up @@ -23779,27 +23779,27 @@
23772 0x3447 㑇 (<CJK Ideograph Extension A>)
23773 0x2E88 ⺈ (CJK RADICAL KNIFE ONE)
23774 0x2E8B ⺋ (CJK RADICAL SEAL)
23775 0xE81E  (<Private Use>)
23775 0x9FB4 龴 (<CJK Ideograph>)
23776 0x359E 㖞 (<CJK Ideograph Extension A>)
23777 0x361A 㘚 (<CJK Ideograph Extension A>)
23778 0x360E 㘎 (<CJK Ideograph Extension A>)
23779 0x2E8C ⺌ (CJK RADICAL SMALL ONE)
23780 0x2E97 ⺗ (CJK RADICAL HEART TWO)
23781 0x396E 㥮 (<CJK Ideograph Extension A>)
23782 0x3918 㤘 (<CJK Ideograph Extension A>)
23783 0xE826  (<Private Use>)
23783 0x9FB5 龵 (<CJK Ideograph>)
23784 0x39CF 㧏 (<CJK Ideograph Extension A>)
23785 0x39DF 㧟 (<CJK Ideograph Extension A>)
23786 0x3A73 㩳 (<CJK Ideograph Extension A>)
23787 0x39D0 㧐 (<CJK Ideograph Extension A>)
23788 0xE82B  (<Private Use>)
23789 0xE82C  (<Private Use>)
23788 0x9FB6 龶 (<CJK Ideograph>)
23789 0x9FB7 龷 (<CJK Ideograph>)
23790 0x3B4E 㭎 (<CJK Ideograph Extension A>)
23791 0x3C6E 㱮 (<CJK Ideograph Extension A>)
23792 0x3CE0 㳠 (<CJK Ideograph Extension A>)
23793 0x2EA7 ⺧ (CJK RADICAL COW)
23794 0xE831  (<Private Use>)
23795 0xE832  (<Private Use>)
23795 0x9FB8 龸 (<CJK Ideograph>)
23796 0x2EAA ⺪ (CJK RADICAL BOLT OF CLOTH)
23797 0x4056 䁖 (<CJK Ideograph Extension A>)
23798 0x415F 䅟 (<CJK Ideograph Extension A>)
Expand All @@ -23816,7 +23816,7 @@
23809 0x44D6 䓖 (<CJK Ideograph Extension A>)
23810 0x4661 䙡 (<CJK Ideograph Extension A>)
23811 0x464C 䙌 (<CJK Ideograph Extension A>)
23812 0xE843  (<Private Use>)
23812 0x9FB9 龹 (<CJK Ideograph>)
23813 0x4723 䜣 (<CJK Ideograph Extension A>)
23814 0x4729 䜩 (<CJK Ideograph Extension A>)
23815 0x477C 䝼 (<CJK Ideograph Extension A>)
Expand All @@ -23833,7 +23833,7 @@
23826 0x499B 䦛 (<CJK Ideograph Extension A>)
23827 0x49B7 䦷 (<CJK Ideograph Extension A>)
23828 0x49B6 䦶 (<CJK Ideograph Extension A>)
23829 0xE854  (<Private Use>)
23829 0x9FBA 龺 (<CJK Ideograph>)
23830 0xE855  (<Private Use>)
23831 0x4CA3 䲣 (<CJK Ideograph Extension A>)
23832 0x4C9F 䲟 (<CJK Ideograph Extension A>)
Expand All @@ -23849,7 +23849,7 @@
23842 0x4D18 䴘 (<CJK Ideograph Extension A>)
23843 0x4D19 䴙 (<CJK Ideograph Extension A>)
23844 0x4DAE 䶮 (<CJK Ideograph Extension A>)
23845 0xE864  (<Private Use>)
23845 0x9FBB 龻 (<CJK Ideograph>)
23846 0xE468  (<Private Use>)
23847 0xE469  (<Private Use>)
23848 0xE46A  (<Private Use>)
Expand Down
2 changes: 1 addition & 1 deletion index-ibm866.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# https://encoding.spec.whatwg.org/
#
# Identifier: db6fe14a559d1601a7667338d83704773d5708dbc641e1ad3c5e21405770f05e
# Date: 2018-01-06
# Date: 2024-09-18

0 0x0410 А (CYRILLIC CAPITAL LETTER A)
1 0x0411 Б (CYRILLIC CAPITAL LETTER BE)
Expand Down
2 changes: 1 addition & 1 deletion index-iso-2022-jp-katakana.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# https://encoding.spec.whatwg.org/
#
# Identifier: 6ffc12c11f6eab1ccb3dada740d9b0db096ef0b0783c3bd5ec951dcb4a44b95e
# Date: 2018-01-06
# Date: 2024-09-18

0 0x3002 。 (IDEOGRAPHIC FULL STOP)
1 0x300C 「 (LEFT CORNER BRACKET)
Expand Down
2 changes: 1 addition & 1 deletion index-iso-8859-10.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# https://encoding.spec.whatwg.org/
#
# Identifier: 02c2b5590d8ccda9931008c471f6ee2c590b2c8fe5e6ccb3b08638115d778507
# Date: 2018-01-06
# Date: 2024-09-18

0 0x0080 € (<control>)
1 0x0081  (<control>)
Expand Down
2 changes: 1 addition & 1 deletion index-iso-8859-13.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# https://encoding.spec.whatwg.org/
#
# Identifier: 40736338e964ab520407cebcb01329f8d450abf6ce12bf88b74b655b60e43300
# Date: 2018-01-06
# Date: 2024-09-18

0 0x0080 € (<control>)
1 0x0081  (<control>)
Expand Down
2 changes: 1 addition & 1 deletion index-iso-8859-14.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# https://encoding.spec.whatwg.org/
#
# Identifier: 2c8651cfc08b1f35b17919ee5379f2fa006af3ec809f11b3b7f470785580542b
# Date: 2018-01-06
# Date: 2024-09-18

0 0x0080 € (<control>)
1 0x0081  (<control>)
Expand Down
2 changes: 1 addition & 1 deletion index-iso-8859-15.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# https://encoding.spec.whatwg.org/
#
# Identifier: a560aba47bccd7510a6ac77f671fe75dca3800f05cf6d676910c311a8f8ff079
# Date: 2018-01-06
# Date: 2024-09-18

0 0x0080 € (<control>)
1 0x0081  (<control>)
Expand Down
2 changes: 1 addition & 1 deletion index-iso-8859-16.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# https://encoding.spec.whatwg.org/
#
# Identifier: 55676320d2d1b6e6909f5b3d741a7cf0cefc84e920aa4474afc091459111c2e3
# Date: 2018-01-06
# Date: 2024-09-18

0 0x0080 € (<control>)
1 0x0081  (<control>)
Expand Down
2 changes: 1 addition & 1 deletion index-iso-8859-2.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# https://encoding.spec.whatwg.org/
#
# Identifier: 9569c67f22d0b57790e1c407c6eecf227e4562322dc296de43cdab7a0152ec73
# Date: 2018-01-06
# Date: 2024-09-18

0 0x0080 € (<control>)
1 0x0081  (<control>)
Expand Down
2 changes: 1 addition & 1 deletion index-iso-8859-3.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# https://encoding.spec.whatwg.org/
#
# Identifier: af8f1e12df79b768322b5e83613698cdc619438270a2fc359554331c805054a3
# Date: 2018-01-06
# Date: 2024-09-18

0 0x0080 € (<control>)
1 0x0081  (<control>)
Expand Down
2 changes: 1 addition & 1 deletion index-iso-8859-4.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# https://encoding.spec.whatwg.org/
#
# Identifier: 72f29c92344d351fe9e74a946e7e0468d76d542c6894ff82982cb652ebe0feb7
# Date: 2018-01-06
# Date: 2024-09-18

0 0x0080 € (<control>)
1 0x0081  (<control>)
Expand Down
2 changes: 1 addition & 1 deletion index-iso-8859-5.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# https://encoding.spec.whatwg.org/
#
# Identifier: fa9b1f3f5242df43e2e7bca80e9b6997c67944f20a4af91ee06bacc4e132d9c9
# Date: 2018-01-06
# Date: 2024-09-18

0 0x0080 € (<control>)
1 0x0081  (<control>)
Expand Down
2 changes: 1 addition & 1 deletion index-iso-8859-6.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# https://encoding.spec.whatwg.org/
#
# Identifier: 85bb7b5c2dc75975afebe5743935ba4ed5a09c1e9e34e9bfb2ff80293f5d8bbc
# Date: 2018-01-06
# Date: 2024-09-18

0 0x0080 € (<control>)
1 0x0081  (<control>)
Expand Down
2 changes: 1 addition & 1 deletion index-iso-8859-7.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# https://encoding.spec.whatwg.org/
#
# Identifier: f53d8aeba36314ef950eef02ffcf11dff540638ce27dfe7a86b6ccc6875afb24
# Date: 2018-01-06
# Date: 2024-09-18

0 0x0080 € (<control>)
1 0x0081  (<control>)
Expand Down
2 changes: 1 addition & 1 deletion index-iso-8859-8.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# https://encoding.spec.whatwg.org/
#
# Identifier: 7657a9ca3fa875990da960d3f812eea28dcd0ae6ed55a18d5394303c86f5484b
# Date: 2018-01-06
# Date: 2024-09-18

0 0x0080 € (<control>)
1 0x0081  (<control>)
Expand Down
2 changes: 1 addition & 1 deletion index-jis0208.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# https://encoding.spec.whatwg.org/
#
# Identifier: cbaa91f3deb7d0841faf5c33041fc15a285da0e87e64ab802c4bf04b7c4da861
# Date: 2018-01-06
# Date: 2024-09-18

0 0x3000   (IDEOGRAPHIC SPACE)
1 0x3001 、 (IDEOGRAPHIC COMMA)
Expand Down
2 changes: 1 addition & 1 deletion index-jis0212.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# https://encoding.spec.whatwg.org/
#
# Identifier: 83bf90dd1c591a4355730d8c4567efc499d74da7490531019ef22a879991cfb7
# Date: 2018-01-06
# Date: 2024-09-18

108 0x02D8 ˘ (BREVE)
109 0x02C7 ˇ (CARON)
Expand Down
2 changes: 1 addition & 1 deletion index-koi8-r.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# https://encoding.spec.whatwg.org/
#
# Identifier: c5497cd9071cb352c0e56b219154e539badf63de40b71578f09e2e11fe7d50ae
# Date: 2018-01-06
# Date: 2024-09-18

0 0x2500 ─ (BOX DRAWINGS LIGHT HORIZONTAL)
1 0x2502 │ (BOX DRAWINGS LIGHT VERTICAL)
Expand Down
2 changes: 1 addition & 1 deletion index-koi8-u.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# https://encoding.spec.whatwg.org/
#
# Identifier: 19a4da2c3f245118bbc8019326f45a07832949938ff903f03d62ac4da1f61f40
# Date: 2018-01-06
# Date: 2024-09-18

0 0x2500 ─ (BOX DRAWINGS LIGHT HORIZONTAL)
1 0x2502 │ (BOX DRAWINGS LIGHT VERTICAL)
Expand Down
Loading

0 comments on commit 2c3853e

Please sign in to comment.