-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reflect changes in GB 18030-2022 #312
Comments
https://bugs.webkit.org/show_bug.cgi?id=257770 rdar://110353061 Reviewed by Myles C. Maxfield. This was already done internally in ICU in rdar://107702106 This reflects changes published as GB-18030-2022 This was proposed as a change to the standard at whatwg/encoding#312 This fixes an assertion when running encoding tests on macOS Sonoma and iOS 17, and I added test coverage specific to the 18 changed code points. * LayoutTests/imported/w3c/web-platform-tests/encoding/legacy-mb-schinese/gb18030/gb18030-encoder-expected.txt: * LayoutTests/imported/w3c/web-platform-tests/encoding/legacy-mb-schinese/gb18030/gb18030-encoder.html: * Source/WTF/wtf/PlatformHave.h: * Source/WebCore/PAL/pal/text/EncodingTables.cpp: (PAL::gb18030): Canonical link: https://commits.webkit.org/264918@main
We, at least, definitely want to update (in fact, we already did). The definition of this encoding has been updated upstream. There needs to be a consistent behavior between the web browser on a platform and native apps on the platform. As the adage goes, "the future is longer than the past" and there will be more content produced with the new mappings than there is existing content. We can't just close our eyes and hope that all authors use UTF-8, especially when there are laws requiring that ~all products sold in certain places must conform. |
@litherum Did WebKit implement Unicode Technical Committee recommendation on this topic? |
Yes. |
https://bugs.webkit.org/show_bug.cgi?id=257770 rdar://110353061 Reviewed by Myles C. Maxfield. This was already done internally in ICU in rdar://107702106 This reflects changes published as GB-18030-2022 This was proposed as a change to the standard at whatwg/encoding#312 This fixes an assertion when running encoding tests on macOS Sonoma and iOS 17, and I added test coverage specific to the 18 changed code points. * LayoutTests/imported/w3c/web-platform-tests/encoding/legacy-mb-schinese/gb18030/gb18030-encoder-expected.txt: * LayoutTests/imported/w3c/web-platform-tests/encoding/legacy-mb-schinese/gb18030/gb18030-encoder.html: * Source/WTF/wtf/PlatformHave.h: * Source/WebCore/PAL/pal/text/EncodingTables.cpp: (PAL::gb18030): Canonical link: https://commits.webkit.org/264918@main
We ended up moving away from that recommendation and going with the exact GB 18030-2022 mappings instead. |
We have since moved to the Unicode recommendation. Either way, something related to GB 18030-2022 should probably be reflected in this standard. |
These are the Unicode recommendations:
Notably these only impact gb18030, not GBK. So I don't think we want to change index-gb18030 in the Encoding Standard. Although we might want to update the note about it reflecting GB18030-2005 in some way? Instead we would have to directly patch the gb18030 encoder and decoder. Albeit likely somewhat ugly that does not seem too bad. I think I would also duplicate the code points that can be transcoded in two directions for simplicity, although we could also create a mini index for them. |
I created #336 which I hope addresses this. |
This implements the Unicode Technical Committee recommendation around GB18030-2022 in a matter suitable for this standard, taking into account existing practice and the closeness between GBK and gb18030. In particular, using the text file attached to https://www.unicode.org/L2/L2023/23003r-gb18030-recommendations.pdf this does the following: 1. Merges the first set of 18 mappings, which are bidirectional, directly into index gb18030, replacing existing PUA entries. This ends up impacting GBK and gb18030. 2. The second set of 18 mappings (from PUA to bytes) are encoded as an encoder only table, for both GBK and gb18030. 3. The third set of 18 mappings (from bytes to code points) are ignored, as they are already covered by index gb18030 ranges. (Presumably they are included because the recommendation covers the transition from "Previous Mappings" to "Current Mappings" to "Recommended Mappings", whereas we are going directly from "Previous Mappings" to "Recommended Mappings".) The reason for changing GBK as well is because Chromium and WebKit have already code in the wild that impacts GBK to some degree (although the encoder only table is excluded for GBK only at the moment, including that would make the most sense compatibility-wise) and no fallout has been recorded. Additionally GBK is already positioned as a rough subset of gb18030 in this standard, with the decoder being shared completely. Tests: encoding/legacy-mb-schinese has some GB18030-2022 coverage already. The aim is to complete that with web-platform-tests/wpt#48239 and web-platform-tests/wpt#48240. This supersedes #335. This fixes #27 and fixes #312.
https://encoding.spec.whatwg.org/index-gb18030.txt contains 18 code points that have been changed by GB 18030-2022. We should probably update.
The text was updated successfully, but these errors were encountered: