-
Notifications
You must be signed in to change notification settings - Fork 548
UTF-8 All The Things #1273
UTF-8 All The Things #1273
Changes from 2 commits
ea41151
1209402
bde9c69
466be71
42746fb
8d1c583
2940df2
ff85cce
2e40c7b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -37,8 +37,8 @@ | |
:: The <code>charset</code> parameter may be provided to specify the | ||
<a>document's character encoding</a>, overriding any | ||
[=character encoding declarations=] in the document other than a Byte Order Mark (BOM). | ||
The parameter's value must be one of the <a lt="character encoding">labels</a> of the <a>character encoding</a> | ||
used to serialize the file. [[!ENCODING]] | ||
The parameter's value must be an <a>ASCII case-insensitive</a> match for the string | ||
"<code>utf-8</code>". [[!ENCODING]] | ||
: Encoding considerations: | ||
:: 8bit (see the section on [=character encoding declarations=]) | ||
: Security considerations: | ||
|
@@ -264,8 +264,10 @@ | |
<dt>Optional parameters:</dt> | ||
<dd> | ||
<dl> | ||
<dt><code data-x="">charset</code></dt> | ||
<dd>The charset parameter may be provided. The parameter's value must be "<code>utf-8</code>". This parameter serves no purpose; it is only allowed for compatibility with legacy servers.</dd> | ||
<dt><code data-x="">charset</code></dt> | ||
<dd>The charset parameter may be provided. The parameter's value must be "<code>utf-8</code>". | ||
This parameter serves no purpose; it is only allowed for compatibility with legacy servers. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "This parameter is for compatibility with legacy servers", no? |
||
</dd> | ||
</dl> | ||
</dd> | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -829,12 +829,10 @@ | |
The <dfn element-attr for="meta"><code>charset</code></dfn> attribute specifies the character | ||
encoding used by the document. This is a <a>character encoding declaration</a>. If the | ||
attribute is present in an <a>XML document</a>, its value must be an | ||
<a>ASCII case-insensitive</a> match for the string "<code>utf-8</code>" (and the | ||
document is therefore forced to use <a>UTF-8</a> as its encoding). | ||
<a>ASCII case-insensitive</a> match for the string "<code>utf-8</code>". | ||
|
||
<p class="note">The <code>charset</code> attribute on the | ||
<{meta}> element has no effect in XML documents, and is only allowed in order to | ||
facilitate migration to and from XHTML.</p> | ||
<p class="note">The <code>charset</code> attribute on the <{meta}> element has no effect in XML | ||
documents. It is allowed in order to facilitate migration to and from XHTML.</p> | ||
|
||
There must not be more than one <{meta}> element with a <code>charset</code> attribute | ||
per document. | ||
|
@@ -1221,10 +1219,9 @@ | |
This state's user agent requirements are all handled by the parsing section of the specification. | ||
|
||
For <{meta}> elements with an <code>http-equiv</code> attribute in the <a state for="http-equiv">encoding declaration state</a>, the <code>content</code> attribute must have a value that is an | ||
<a>ASCII case-insensitive</a> match for a string that consists of: the literal string | ||
<a>ASCII case-insensitive</a> match for a string that consists of the literal string | ||
"<code>text/html;</code>", optionally followed by any number of [=space characters=], | ||
followed by the literal string "<code>charset=</code>", followed by one of the <a lt="character encoding">labels</a> | ||
of the <a>character encoding</a> of the <a>character encoding declaration</a>. | ||
followed by the literal string "<code>charset=utf-8</code>". | ||
|
||
A document must not contain both a <{meta}> element with an <code>http-equiv</code> | ||
attribute in the <a state for="http-equiv">encoding declaration state</a> and a <{meta}> element with the | ||
|
@@ -1417,24 +1414,31 @@ | |
|
||
A <dfn>character encoding declaration</dfn> is a mechanism by which the <a>character encoding</a> | ||
used to store or transmit a document is specified. | ||
|
||
The only acceptable character encoding declaration for the modern web is <a>UTF-8</a>. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it possible to rephrase this less didactically? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No :-) That is, I'm not sure how else to phrase it in order to get the point across. Suggestions welcome. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yeah, this is hardly a high-order problem. |
||
|
||
This must be identified by the <a>character encoding</a> label's value being an | ||
<a>ASCII case-insensitive</a> match for the string "<code>utf-8</code>". | ||
|
||
Regardless of whether a character encoding declaration is present or not, the actual character | ||
encoding used to encode the document must be <a>UTF-8</a>. [[!ENCODING]] | ||
|
||
The following restrictions apply to [=character encoding declarations=]: | ||
|
||
* The character encoding name given must be an <a>ASCII case-insensitive</a> match for one of the | ||
<a lt="character encoding">labels</a> of the <a>character encoding</a> used to serialize the file. [[!ENCODING]] | ||
* The character encoding declaration must be serialized without the use of | ||
<a>character references</a> or character escapes of any kind. | ||
* The element containing the character encoding declaration must be serialized completely | ||
within <dfn>the first 1024 bytes</dfn> of the document. | ||
* Due to a number of restrictions on <{meta}> elements, there can only be one | ||
<code>meta</code>-based character encoding declaration per document. | ||
|
||
In addition, due to a number of restrictions on <{meta}> elements, there can only be one | ||
<code>meta</code>-based character encoding declaration per document. | ||
Authoring tools should default to using <a>UTF-8</a> for newly-created documents. [[!ENCODING]] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we require utf-8 then this (and similar requirements) have to be a must. |
||
|
||
If an <a>HTML document</a> does not start with a BOM, and its <a>encoding</a> is not explicitly | ||
given by <a>Content-Type metadata</a>, and the document is not <a>an `iframe` `srcdoc` document</a>, then the character encoding used must be an | ||
<a>ASCII-compatible encoding</a>, and the encoding must be specified using a <code>meta</code> | ||
element with a <code>charset</code> attribute or a <{meta}> element with an | ||
<code>http-equiv</code> attribute in the <a state for="http-equiv" lt="content-type">encoding declaration state</a>. | ||
given by <a>Content-Type metadata</a>, and the document is not <a>an `iframe` `srcdoc` document</a>, | ||
then the encoding must be specified using a <code>meta</code> element with a <code>charset</code> | ||
attribute or a <{meta}> element with an <code>http-equiv</code> attribute in the | ||
<a state for="http-equiv" lt="content-type">encoding declaration state</a>. | ||
|
||
<p class="note"> | ||
A character encoding declaration is required (either in the <a>Content-Type metadata</a> or | ||
|
@@ -1449,23 +1453,8 @@ | |
|
||
If an <a>HTML document</a> contains a <{meta}> element with a <code>charset</code> | ||
attribute or a <{meta}> element with an <code>http-equiv</code> attribute in the | ||
<a state for="http-equiv" lt="content-type">encoding declaration state</a>, then the character encoding used must be an | ||
<a>ASCII-compatible encoding</a>. | ||
|
||
Authors should use <a>UTF-8</a>. Conformance checkers may advise authors against using legacy encodings. | ||
[[!ENCODING]] | ||
|
||
Authoring tools should default to using <a>UTF-8</a> for newly-created documents. [[!ENCODING]] | ||
|
||
Authors must not use encodings that are not defined in the WHATWG Encoding specification. Additionally, | ||
authors should not use <a>ISO-2022-JP</a>. [[!ENCODING]] | ||
|
||
<p class="note"> | ||
Some encodings that are not defined in the WHATWG Encoding specification use bytes in the range 0x20 | ||
to 0x7E, inclusive, to encode characters other than the corresponding characters in the range | ||
U+0020 to U+007E, inclusive, and represent a potential security vulnerability: A user agent | ||
might end up interpreting supposedly benign plain text content as HTML tags and JavaScript. | ||
</p> | ||
<a state for="http-equiv" lt="content-type">encoding declaration state</a>, then the character | ||
encoding used must be an <a>ASCII-compatible encoding</a>. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why doesn't this just require utf-8? |
||
|
||
<p class="note"> | ||
Using non-UTF-8 encodings can have unexpected results on form submission and URL encodings, | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we just enforce a string, there is no normative dependency on the encoding spec here.