Skip to content
This repository has been archived by the owner on Jul 30, 2019. It is now read-only.

UTF-8 All The Things #1273

Merged
merged 9 commits into from
Mar 29, 2018
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions sections/attributes.include
Original file line number Diff line number Diff line change
Expand Up @@ -128,13 +128,13 @@
<th><code>charset</code></th>
<td><{meta}></td>
<td><a>Character encoding declaration</a></td>
<td><a>Encoding label</a>*</td>
<td><a>utf-8</a></td>
</tr>
<tr>
<th><code>charset</code></th>
<td><{script}></td>
<td>Character encoding of the external script resource</td>
<td><a>Encoding label</a>*</td>
<td><a>utf-8</a></td>
</tr>
<tr>
<th><code>checked</code></th>
Expand Down Expand Up @@ -1378,4 +1378,4 @@ complicated than indicated in the table above.</small></p>
<td><a>Event handler content attribute</a></td>
</tr>
</tbody>
</table>
</table>
10 changes: 6 additions & 4 deletions sections/iana.include
Original file line number Diff line number Diff line change
Expand Up @@ -37,8 +37,8 @@
:: The <code>charset</code> parameter may be provided to specify the
<a>document's character encoding</a>, overriding any
[=character encoding declarations=] in the document other than a Byte Order Mark (BOM).
The parameter's value must be one of the <a lt="character encoding">labels</a> of the <a>character encoding</a>
used to serialize the file. [[!ENCODING]]
The parameter's value must be an <a>ASCII case-insensitive</a> match for the string
"<code>utf-8</code>". [[!ENCODING]]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we just enforce a string, there is no normative dependency on the encoding spec here.

: Encoding considerations:
:: 8bit (see the section on [=character encoding declarations=])
: Security considerations:
Expand Down Expand Up @@ -264,8 +264,10 @@
<dt>Optional parameters:</dt>
<dd>
<dl>
<dt><code data-x="">charset</code></dt>
<dd>The charset parameter may be provided. The parameter's value must be "<code>utf-8</code>". This parameter serves no purpose; it is only allowed for compatibility with legacy servers.</dd>
<dt><code data-x="">charset</code></dt>
<dd>The charset parameter may be provided. The parameter's value must be "<code>utf-8</code>".
This parameter serves no purpose; it is only allowed for compatibility with legacy servers.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"This parameter is for compatibility with legacy servers", no?

</dd>
</dl>
</dd>

Expand Down
16 changes: 16 additions & 0 deletions sections/obsolete.include
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,10 @@

Authors should not specify a <{img/border}> attribute on an <{img}> element. If the
attribute is present, its value must be the string "<code>0</code>". CSS should be used instead.

Authors should not specify a <code>charset</code> attribute on a <{script}> element. If the
attribute is present, its value must be an [=ASCII case-insensitive=] match for the string
"<code>utf-8</code>". [[!ENCODING]]

Authors should not specify a <{script/language}> attribute on a <{script}> element. If
the attribute is present, its value must be an [=ASCII case-insensitive=] match for the string
Expand Down Expand Up @@ -66,6 +70,9 @@
* The presence of a <{img/border}> attribute on an <{img}> element if its value is the string
"<code>0</code>".

* The presence of a <code>charset</code> attribute on a <{script}> element if its value is an
[=ASCII case-insensitive=] match for "<code>utf-8</code>".

* The presence of a <{script/language}> attribute on a <{script}> element if its value is an
[=ASCII case-insensitive=] match for the string "<code>JavaScript</code>" and if there is no
<{script/type}> attribute or there is and its value is an [=ASCII case-insensitive=] match
Expand Down Expand Up @@ -174,6 +181,11 @@
: <dfn element-attr for="link"><code>charset</code></dfn> on <{link}> elements
:: Use an HTTP <code>Content-Type</code> header on the linked resource instead.

: <dfn element-attr for="script"><code>charset</code></dfn> on <{script}> elements
(except as noted in the previous section)
:: Omit the attribute. Both documents and scripts are required to use <a>UTF-8</a>. It is
redundant to specify it on the <{script}> element since it inherits from the document.

: <dfn element-attr for="a"><code>coords</code></dfn> on <{a}> elements
: <dfn element-attr for="a"><code>shape</code></dfn> on <{a}> elements
:: Use <code>area</code> instead of <{a}> for image maps.
Expand Down Expand Up @@ -1351,11 +1363,15 @@

<pre class="idl" data-highlight="webidl">
partial interface HTMLScriptElement {
[CEReactions] attribute DOMString charset;
[CEReactions] attribute DOMString event;
[CEReactions] attribute DOMString htmlFor;
};
</pre>

The <dfn attribute for="HTMLScriptElement"><code>charset</code></dfn> IDL attribute of the
<{script}> element must reflect the element's <code>charset</code> content attribute.

The <dfn attribute for="HTMLScriptElement"><code>event</code></dfn> IDL attribute of the
<{script}> element must reflect the element's <{script/event}> content attribute.

Expand Down
55 changes: 22 additions & 33 deletions sections/semantics-document-metadata.include
Original file line number Diff line number Diff line change
Expand Up @@ -829,12 +829,10 @@
The <dfn element-attr for="meta"><code>charset</code></dfn> attribute specifies the character
encoding used by the document. This is a <a>character encoding declaration</a>. If the
attribute is present in an <a>XML document</a>, its value must be an
<a>ASCII case-insensitive</a> match for the string "<code>utf-8</code>" (and the
document is therefore forced to use <a>UTF-8</a> as its encoding).
<a>ASCII case-insensitive</a> match for the string "<code>utf-8</code>".

<p class="note">The <code>charset</code> attribute on the
<{meta}> element has no effect in XML documents, and is only allowed in order to
facilitate migration to and from XHTML.</p>
<p class="note">The <code>charset</code> attribute on the <{meta}> element has no effect in XML
documents. It is allowed in order to facilitate migration to and from XHTML.</p>

There must not be more than one <{meta}> element with a <code>charset</code> attribute
per document.
Expand Down Expand Up @@ -1221,10 +1219,9 @@
This state's user agent requirements are all handled by the parsing section of the specification.

For <{meta}> elements with an <code>http-equiv</code> attribute in the <a state for="http-equiv">encoding declaration state</a>, the <code>content</code> attribute must have a value that is an
<a>ASCII case-insensitive</a> match for a string that consists of: the literal string
<a>ASCII case-insensitive</a> match for a string that consists of the literal string
"<code>text/html;</code>", optionally followed by any number of [=space characters=],
followed by the literal string "<code>charset=</code>", followed by one of the <a lt="character encoding">labels</a>
of the <a>character encoding</a> of the <a>character encoding declaration</a>.
followed by the literal string "<code>charset=utf-8</code>".

A document must not contain both a <{meta}> element with an <code>http-equiv</code>
attribute in the <a state for="http-equiv">encoding declaration state</a> and a <{meta}> element with the
Expand Down Expand Up @@ -1417,24 +1414,31 @@

A <dfn>character encoding declaration</dfn> is a mechanism by which the <a>character encoding</a>
used to store or transmit a document is specified.

The only acceptable character encoding declaration for the modern web is <a>UTF-8</a>.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to rephrase this less didactically?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No :-)

That is, I'm not sure how else to phrase it in order to get the point across. Suggestions welcome.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problems outlined here go away when exclusively using UTF-8, which is one of the many reasons that is now the mandatory encoding for all things.
https://www.w3.org/TR/2018/CR-encoding-20180327/

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, this is hardly a high-order problem.


This must be identified by the <a>character encoding</a> label's value being an
<a>ASCII case-insensitive</a> match for the string "<code>utf-8</code>".

Regardless of whether a character encoding declaration is present or not, the actual character
encoding used to encode the document must be <a>UTF-8</a>. [[!ENCODING]]

The following restrictions apply to [=character encoding declarations=]:

* The character encoding name given must be an <a>ASCII case-insensitive</a> match for one of the
<a lt="character encoding">labels</a> of the <a>character encoding</a> used to serialize the file. [[!ENCODING]]
* The character encoding declaration must be serialized without the use of
<a>character references</a> or character escapes of any kind.
* The element containing the character encoding declaration must be serialized completely
within <dfn>the first 1024 bytes</dfn> of the document.
* Due to a number of restrictions on <{meta}> elements, there can only be one
<code>meta</code>-based character encoding declaration per document.

In addition, due to a number of restrictions on <{meta}> elements, there can only be one
<code>meta</code>-based character encoding declaration per document.
Authoring tools should default to using <a>UTF-8</a> for newly-created documents. [[!ENCODING]]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we require utf-8 then this (and similar requirements) have to be a must.


If an <a>HTML document</a> does not start with a BOM, and its <a>encoding</a> is not explicitly
given by <a>Content-Type metadata</a>, and the document is not <a>an `iframe` `srcdoc` document</a>, then the character encoding used must be an
<a>ASCII-compatible encoding</a>, and the encoding must be specified using a <code>meta</code>
element with a <code>charset</code> attribute or a <{meta}> element with an
<code>http-equiv</code> attribute in the <a state for="http-equiv" lt="content-type">encoding declaration state</a>.
given by <a>Content-Type metadata</a>, and the document is not <a>an `iframe` `srcdoc` document</a>,
then the encoding must be specified using a <code>meta</code> element with a <code>charset</code>
attribute or a <{meta}> element with an <code>http-equiv</code> attribute in the
<a state for="http-equiv" lt="content-type">encoding declaration state</a>.

<p class="note">
A character encoding declaration is required (either in the <a>Content-Type metadata</a> or
Expand All @@ -1449,23 +1453,8 @@

If an <a>HTML document</a> contains a <{meta}> element with a <code>charset</code>
attribute or a <{meta}> element with an <code>http-equiv</code> attribute in the
<a state for="http-equiv" lt="content-type">encoding declaration state</a>, then the character encoding used must be an
<a>ASCII-compatible encoding</a>.

Authors should use <a>UTF-8</a>. Conformance checkers may advise authors against using legacy encodings.
[[!ENCODING]]

Authoring tools should default to using <a>UTF-8</a> for newly-created documents. [[!ENCODING]]

Authors must not use encodings that are not defined in the WHATWG Encoding specification. Additionally,
authors should not use <a>ISO-2022-JP</a>. [[!ENCODING]]

<p class="note">
Some encodings that are not defined in the WHATWG Encoding specification use bytes in the range 0x20
to 0x7E, inclusive, to encode characters other than the corresponding characters in the range
U+0020 to U+007E, inclusive, and represent a potential security vulnerability: A user agent
might end up interpreting supposedly benign plain text content as HTML tags and JavaScript.
</p>
<a state for="http-equiv" lt="content-type">encoding declaration state</a>, then the character
encoding used must be an <a>ASCII-compatible encoding</a>.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why doesn't this just require utf-8?


<p class="note">
Using non-UTF-8 encodings can have unexpected results on form submission and URL encodings,
Expand Down
10 changes: 0 additions & 10 deletions sections/semantics-scriptings.include
Original file line number Diff line number Diff line change
Expand Up @@ -135,15 +135,6 @@
<{script/async}>, <{script/defer}>, <{script/crossorigin}>, and <{script/integrity}> attributes must
not be specified.

The <dfn element-attr for="script"><code>charset</code></dfn> attribute gives the character
encoding of the external script resource. The attribute must not be specified if the
<{script/src}> attribute is not present, or if the script is not a <a>classic script</a>.
(<a>Module scripts</a> are always interpreted as UTF-8.) If the attribute is set, its value
must be an <a>ASCII case-insensitive</a> match for one of the
<a lt="character encoding">labels</a> of an <a>encoding</a>, and must specify the same
<a>encoding</a> as the <code>charset</code> parameter of the <a>Content-Type metadata</a> of the external
file, if any. [[!ENCODING]]

The <dfn element-attr for="script"><code>async</code></dfn> and
<dfn element-attr for="script"><code>defer</code></dfn> attributes are <a>boolean attributes</a>
that indicate how a script should be loaded and executed.
Expand Down Expand Up @@ -204,7 +195,6 @@
The IDL attributes
<dfn attribute for="HTMLScriptElement"><code>src</code></dfn>,
<dfn attribute for="HTMLScriptElement"><code>type</code></dfn>,
<dfn attribute for="HTMLScriptElement"><code>charset</code></dfn>,
<dfn attribute for="HTMLScriptElement"><code>defer</code></dfn>, and
<dfn attribute for="HTMLScriptElement"><code>integrity</code></dfn>, must each <a>reflect</a> the
respective content attributes of the same name.
Expand Down