Document gb18030 decoder end-of-queue oddity

Closes #253.
whatwg · Feb 27, 2021 · 26f6e56 · 26f6e56
1 parent 4d54adc
commit 26f6e56
Showing 1 changed file with 21 additions and 20 deletions.
diff --git a/encoding.bs b/encoding.bs
@@ -42,34 +42,34 @@ specification does not provide a mechanism for extending any aspect of encodings
 
 <h2 id=security-background>Security background</h2>
 
-<p>There is a set of encoding security issues when the producer and consumer do not agree
-on the encoding in use, or on the way a given encoding is to be implemented. For instance,
-an attack was reported in 2011 where a <a>Shift_JIS</a> lead byte 0x82 was used to
-“mask” a 0x22 trail byte in a JSON resource of which an attacker could control some field.
-The producer did not see the problem even though this is an illegal byte combination. The
-consumer decoded it as a single U+FFFD and therefore changed the overall interpretation as
-U+0022 is an important delimiter. Decoders of encodings that use multiple bytes for scalar
-values now require that in case of an illegal byte combination, a scalar value in the
-range U+0000 to U+007F, inclusive, cannot be “masked”. For the aforementioned sequence the
-output would be U+FFFD U+0022.
+<p>There is a set of encoding security issues when the producer and consumer do not agree on the
+encoding in use, or on the way a given encoding is to be implemented. For instance, an attack was
+reported in 2011 where a <a>Shift_JIS</a> lead byte 0x82 was used to “mask” a 0x22 trail byte in a
+JSON resource of which an attacker could control some field. The producer did not see the problem
+even though this is an illegal byte combination. The consumer decoded it as a single U+FFFD and
+therefore changed the overall interpretation as U+0022 is an important delimiter. Decoders of
+encodings that use multiple bytes for scalar values now require that in case of an illegal byte
+combination, a scalar value in the range U+0000 to U+007F, inclusive, cannot be “masked”. For the
+aforementioned sequence the output would be U+FFFD U+0022. (As an unfortunate exception to this, the
+<a>gb18030 decoder</a> will “mask” up to one such byte at <a>end-of-queue</a>.)
 
 <p>This is a larger issue for encodings that map anything that is an <a>ASCII byte</a> to something
 that is not an <a>ASCII code point</a>, when there is no lead byte present. These are
 “ASCII-incompatible” encodings and other than <a>ISO-2022-JP</a> and <a>UTF-16BE/LE</a>, which are
 unfortunately required due to deployed content, they are not supported. (Investigation is
 <a href=https://github.com/whatwg/encoding/issues/8 lt="Add more labels to the replacement encoding">ongoing</a>
-whether more labels of other such encodings can be mapped to the <a>replacement</a>
-encoding, rather than the unknown encoding fallback.) An example attack is injecting
-carefully crafted content into a resource and then encouraging the user to override the
-encoding, resulting in e.g. script execution.
+whether more labels of other such encodings can be mapped to the <a>replacement</a> encoding, rather
+than the unknown encoding fallback.) An example attack is injecting carefully crafted content into a
+resource and then encouraging the user to override the encoding, resulting in, e.g., script
+execution.
 
-<p>Encoders used by URLs found in HTML and HTML's form feature can also result in slight
-information loss when an encoding is used that cannot represent all scalar values. E.g.
-when a resource uses the <a>windows-1252</a> encoding a server will not be able to
-distinguish between an end user entering “💩” and “&amp;#128169;” into a form.
+<p>Encoders used by URLs found in HTML and HTML's form feature can also result in slight information
+loss when an encoding is used that cannot represent all scalar values. E.g., when a resource uses
+the <a>windows-1252</a> encoding a server will not be able to distinguish between an end user
+entering “💩” and “&amp;#128169;” into a form.
 
-<p>The problems outlined here go away when exclusively using UTF-8, which is one of the
-many reasons that is now the mandatory encoding for all things.
+<p>The problems outlined here go away when exclusively using UTF-8, which is one of the many reasons
+that is now the mandatory encoding for all things.
 
 <p class=note>See also the <a href=#browser-ui>Browser UI</a> chapter.
 
@@ -3485,6 +3485,7 @@ Shawn Steele,
 Simon Montagu,
 Simon Pieters,
 Simon Sapin,
+Stephen Checkoway,
 寺田健 (Takeshi Terada),
 Vyacheslav Matva,
 Wolf Lammen, and