Skip to content

Commit

Permalink
Document gb18030 decoder end-of-queue oddity
Browse files Browse the repository at this point in the history
Closes #253.
  • Loading branch information
annevk committed Feb 27, 2021
1 parent 4d54adc commit 26f6e56
Showing 1 changed file with 21 additions and 20 deletions.
41 changes: 21 additions & 20 deletions encoding.bs
Original file line number Diff line number Diff line change
Expand Up @@ -42,34 +42,34 @@ specification does not provide a mechanism for extending any aspect of encodings

<h2 id=security-background>Security background</h2>

<p>There is a set of encoding security issues when the producer and consumer do not agree
on the encoding in use, or on the way a given encoding is to be implemented. For instance,
an attack was reported in 2011 where a <a>Shift_JIS</a> lead byte 0x82 was used to
“mask” a 0x22 trail byte in a JSON resource of which an attacker could control some field.
The producer did not see the problem even though this is an illegal byte combination. The
consumer decoded it as a single U+FFFD and therefore changed the overall interpretation as
U+0022 is an important delimiter. Decoders of encodings that use multiple bytes for scalar
values now require that in case of an illegal byte combination, a scalar value in the
range U+0000 to U+007F, inclusive, cannot be “masked”. For the aforementioned sequence the
output would be U+FFFD U+0022.
<p>There is a set of encoding security issues when the producer and consumer do not agree on the
encoding in use, or on the way a given encoding is to be implemented. For instance, an attack was
reported in 2011 where a <a>Shift_JIS</a> lead byte 0x82 was used to “mask” a 0x22 trail byte in a
JSON resource of which an attacker could control some field. The producer did not see the problem
even though this is an illegal byte combination. The consumer decoded it as a single U+FFFD and
therefore changed the overall interpretation as U+0022 is an important delimiter. Decoders of
encodings that use multiple bytes for scalar values now require that in case of an illegal byte
combination, a scalar value in the range U+0000 to U+007F, inclusive, cannot be “masked”. For the
aforementioned sequence the output would be U+FFFD U+0022. (As an unfortunate exception to this, the
<a>gb18030 decoder</a> will “mask” up to one such byte at <a>end-of-queue</a>.)

<p>This is a larger issue for encodings that map anything that is an <a>ASCII byte</a> to something
that is not an <a>ASCII code point</a>, when there is no lead byte present. These are
“ASCII-incompatible” encodings and other than <a>ISO-2022-JP</a> and <a>UTF-16BE/LE</a>, which are
unfortunately required due to deployed content, they are not supported. (Investigation is
<a href=https://github.com/whatwg/encoding/issues/8 lt="Add more labels to the replacement encoding">ongoing</a>
whether more labels of other such encodings can be mapped to the <a>replacement</a>
encoding, rather than the unknown encoding fallback.) An example attack is injecting
carefully crafted content into a resource and then encouraging the user to override the
encoding, resulting in e.g. script execution.
whether more labels of other such encodings can be mapped to the <a>replacement</a> encoding, rather
than the unknown encoding fallback.) An example attack is injecting carefully crafted content into a
resource and then encouraging the user to override the encoding, resulting in, e.g., script
execution.

<p>Encoders used by URLs found in HTML and HTML's form feature can also result in slight
information loss when an encoding is used that cannot represent all scalar values. E.g.
when a resource uses the <a>windows-1252</a> encoding a server will not be able to
distinguish between an end user entering “💩” and “&amp;#128169;” into a form.
<p>Encoders used by URLs found in HTML and HTML's form feature can also result in slight information
loss when an encoding is used that cannot represent all scalar values. E.g., when a resource uses
the <a>windows-1252</a> encoding a server will not be able to distinguish between an end user
entering “💩” and “&amp;#128169;” into a form.

<p>The problems outlined here go away when exclusively using UTF-8, which is one of the
many reasons that is now the mandatory encoding for all things.
<p>The problems outlined here go away when exclusively using UTF-8, which is one of the many reasons
that is now the mandatory encoding for all things.

<p class=note>See also the <a href=#browser-ui>Browser UI</a> chapter.

Expand Down Expand Up @@ -3485,6 +3485,7 @@ Shawn Steele,
Simon Montagu,
Simon Pieters,
Simon Sapin,
Stephen Checkoway,
寺田健 (Takeshi Terada),
Vyacheslav Matva,
Wolf Lammen, and
Expand Down

0 comments on commit 26f6e56

Please sign in to comment.