From 26f6e568b20f8505952be611b1b3174b811be5d5 Mon Sep 17 00:00:00 2001 From: Anne van Kesteren Date: Fri, 26 Feb 2021 12:18:39 +0100 Subject: [PATCH] Document gb18030 decoder end-of-queue oddity Closes #253. --- encoding.bs | 41 +++++++++++++++++++++-------------------- 1 file changed, 21 insertions(+), 20 deletions(-) diff --git a/encoding.bs b/encoding.bs index 7a00e95..27a4110 100644 --- a/encoding.bs +++ b/encoding.bs @@ -42,34 +42,34 @@ specification does not provide a mechanism for extending any aspect of encodings

Security background

-

There is a set of encoding security issues when the producer and consumer do not agree -on the encoding in use, or on the way a given encoding is to be implemented. For instance, -an attack was reported in 2011 where a Shift_JIS lead byte 0x82 was used to -“mask” a 0x22 trail byte in a JSON resource of which an attacker could control some field. -The producer did not see the problem even though this is an illegal byte combination. The -consumer decoded it as a single U+FFFD and therefore changed the overall interpretation as -U+0022 is an important delimiter. Decoders of encodings that use multiple bytes for scalar -values now require that in case of an illegal byte combination, a scalar value in the -range U+0000 to U+007F, inclusive, cannot be “masked”. For the aforementioned sequence the -output would be U+FFFD U+0022. +

There is a set of encoding security issues when the producer and consumer do not agree on the +encoding in use, or on the way a given encoding is to be implemented. For instance, an attack was +reported in 2011 where a Shift_JIS lead byte 0x82 was used to “mask” a 0x22 trail byte in a +JSON resource of which an attacker could control some field. The producer did not see the problem +even though this is an illegal byte combination. The consumer decoded it as a single U+FFFD and +therefore changed the overall interpretation as U+0022 is an important delimiter. Decoders of +encodings that use multiple bytes for scalar values now require that in case of an illegal byte +combination, a scalar value in the range U+0000 to U+007F, inclusive, cannot be “masked”. For the +aforementioned sequence the output would be U+FFFD U+0022. (As an unfortunate exception to this, the +gb18030 decoder will “mask” up to one such byte at end-of-queue.)

This is a larger issue for encodings that map anything that is an ASCII byte to something that is not an ASCII code point, when there is no lead byte present. These are “ASCII-incompatible” encodings and other than ISO-2022-JP and UTF-16BE/LE, which are unfortunately required due to deployed content, they are not supported. (Investigation is ongoing -whether more labels of other such encodings can be mapped to the replacement -encoding, rather than the unknown encoding fallback.) An example attack is injecting -carefully crafted content into a resource and then encouraging the user to override the -encoding, resulting in e.g. script execution. +whether more labels of other such encodings can be mapped to the replacement encoding, rather +than the unknown encoding fallback.) An example attack is injecting carefully crafted content into a +resource and then encouraging the user to override the encoding, resulting in, e.g., script +execution. -

Encoders used by URLs found in HTML and HTML's form feature can also result in slight -information loss when an encoding is used that cannot represent all scalar values. E.g. -when a resource uses the windows-1252 encoding a server will not be able to -distinguish between an end user entering “💩” and “💩” into a form. +

Encoders used by URLs found in HTML and HTML's form feature can also result in slight information +loss when an encoding is used that cannot represent all scalar values. E.g., when a resource uses +the windows-1252 encoding a server will not be able to distinguish between an end user +entering “💩” and “💩” into a form. -

The problems outlined here go away when exclusively using UTF-8, which is one of the -many reasons that is now the mandatory encoding for all things. +

The problems outlined here go away when exclusively using UTF-8, which is one of the many reasons +that is now the mandatory encoding for all things.

See also the Browser UI chapter. @@ -3485,6 +3485,7 @@ Shawn Steele, Simon Montagu, Simon Pieters, Simon Sapin, +Stephen Checkoway, 寺田健 (Takeshi Terada), Vyacheslav Matva, Wolf Lammen, and