From ea4115169fc3ab541d4a51bdb6c2d3bd70127be5 Mon Sep 17 00:00:00 2001 From: edent Date: Thu, 1 Mar 2018 22:13:48 +0000 Subject: [PATCH 1/6] UTF-8 All The Things * Update spec to insist on UTF-8 * Fixes #1039 --- sections/attributes.include | 6 +-- sections/iana.include | 12 +++-- sections/obsolete.include | 16 ++++++ sections/semantics-document-metadata.include | 55 ++++++++------------ sections/semantics-scriptings.include | 10 ---- 5 files changed, 48 insertions(+), 51 deletions(-) diff --git a/sections/attributes.include b/sections/attributes.include index 5d3a0a940c..4135d153ff 100644 --- a/sections/attributes.include +++ b/sections/attributes.include @@ -128,13 +128,13 @@ charset <{meta}> Character encoding declaration - Encoding label* + utf-8 charset <{script}> Character encoding of the external script resource - Encoding label* + utf-8 checked @@ -1378,4 +1378,4 @@ complicated than indicated in the table above.

Event handler content attribute - + \ No newline at end of file diff --git a/sections/iana.include b/sections/iana.include index 292f1beb7d..440b5f504f 100644 --- a/sections/iana.include +++ b/sections/iana.include @@ -37,8 +37,8 @@ :: The charset parameter may be provided to specify the document's character encoding, overriding any [=character encoding declarations=] in the document other than a Byte Order Mark (BOM). - The parameter's value must be one of the labels of the character encoding - used to serialize the file. [[!ENCODING]] + The parameter's value must be an ASCII case-insensitive match for the string + "utf-8". [[!ENCODING]] : Encoding considerations: :: 8bit (see the section on [=character encoding declarations=]) : Security considerations: @@ -264,8 +264,10 @@
Optional parameters:
-
charset
-
The charset parameter may be provided. The parameter's value must be "utf-8". This parameter serves no purpose; it is only allowed for compatibility with legacy servers.
+
charset
+
The charset parameter may be provided. The parameter's value must be "utf-8". + This parameter serves no purpose; it is only allowed for compatibility with legacy servers. +
@@ -288,7 +290,7 @@
Magic number(s):
-
text/ping resources always consist of the four bytes 0x50 0x49 0x4E 0x47 (`PING`).
+
text/ping resources always consist of the four bytes 0x50 0x49 0x4E 0x47 (PING).
File extension(s):
No specific file extension is recommended for this type.
diff --git a/sections/obsolete.include b/sections/obsolete.include index b0da204c66..3878d87f5c 100644 --- a/sections/obsolete.include +++ b/sections/obsolete.include @@ -28,6 +28,10 @@ Authors should not specify a <{img/border}> attribute on an <{img}> element. If the attribute is present, its value must be the string "0". CSS should be used instead. + + Authors should not specify a charset attribute on a <{script}> element. If the + attribute is present, its value must be an [=ASCII case-insensitive=] match for the string + "utf-8". [[!ENCODING]] Authors should not specify a <{script/language}> attribute on a <{script}> element. If the attribute is present, its value must be an [=ASCII case-insensitive=] match for the string @@ -66,6 +70,9 @@ * The presence of a <{img/border}> attribute on an <{img}> element if its value is the string "0". + * The presence of a charset attribute on a <{script}> element if its value is an + [=ASCII case-insensitive=] match for "utf-8". + * The presence of a <{script/language}> attribute on a <{script}> element if its value is an [=ASCII case-insensitive=] match for the string "JavaScript" and if there is no <{script/type}> attribute or there is and its value is an [=ASCII case-insensitive=] match @@ -174,6 +181,11 @@ : charset on <{link}> elements :: Use an HTTP Content-Type header on the linked resource instead. + : charset on <{script}> elements + (except as noted in the previous section) + :: Omit the attribute. Both documents and scripts are required to use UTF-8. It is + redundant to specify it on the <{script}> element since it inherits from the document. + : coords on <{a}> elements : shape on <{a}> elements :: Use area instead of <{a}> for image maps. @@ -1351,11 +1363,15 @@
     partial interface HTMLScriptElement {
+      [CEReactions] attribute DOMString charset;
       [CEReactions] attribute DOMString event;
       [CEReactions] attribute DOMString htmlFor;
     };
   
+ The charset IDL attribute of the + <{script}> element must reflect the element's charset content attribute. + The event IDL attribute of the <{script}> element must reflect the element's <{script/event}> content attribute. diff --git a/sections/semantics-document-metadata.include b/sections/semantics-document-metadata.include index 021daf7b7c..26e6ccb369 100644 --- a/sections/semantics-document-metadata.include +++ b/sections/semantics-document-metadata.include @@ -829,12 +829,10 @@ The charset attribute specifies the character encoding used by the document. This is a character encoding declaration. If the attribute is present in an XML document, its value must be an - ASCII case-insensitive match for the string "utf-8" (and the - document is therefore forced to use UTF-8 as its encoding). + ASCII case-insensitive match for the string "utf-8". -

The charset attribute on the - <{meta}> element has no effect in XML documents, and is only allowed in order to - facilitate migration to and from XHTML.

+

The charset attribute on the <{meta}> element has no effect in XML + documents. It is allowed in order to facilitate migration to and from XHTML.

There must not be more than one <{meta}> element with a charset attribute per document. @@ -1221,10 +1219,9 @@ This state's user agent requirements are all handled by the parsing section of the specification. For <{meta}> elements with an http-equiv attribute in the encoding declaration state, the content attribute must have a value that is an - ASCII case-insensitive match for a string that consists of: the literal string + ASCII case-insensitive match for a string that consists of the literal string "text/html;", optionally followed by any number of [=space characters=], - followed by the literal string "charset=", followed by one of the labels - of the character encoding of the character encoding declaration. + followed by the literal string "charset=utf-8". A document must not contain both a <{meta}> element with an http-equiv attribute in the encoding declaration state and a <{meta}> element with the @@ -1417,24 +1414,31 @@ A character encoding declaration is a mechanism by which the character encoding used to store or transmit a document is specified. + + The only acceptable character encoding declaration for the modern web is UTF-8. + + This must be identified by the character encoding label's value being an + ASCII case-insensitive match for the string "utf-8". + + Regardless of whether a character encoding declaration is present or not, the actual character + encoding used to encode the document must be UTF-8. [[!ENCODING]] The following restrictions apply to [=character encoding declarations=]: - * The character encoding name given must be an ASCII case-insensitive match for one of the - labels of the character encoding used to serialize the file. [[!ENCODING]] * The character encoding declaration must be serialized without the use of character references or character escapes of any kind. * The element containing the character encoding declaration must be serialized completely within the first 1024 bytes of the document. + * Due to a number of restrictions on <{meta}> elements, there can only be one + meta-based character encoding declaration per document. - In addition, due to a number of restrictions on <{meta}> elements, there can only be one - meta-based character encoding declaration per document. + Authoring tools should default to using UTF-8 for newly-created documents. [[!ENCODING]] If an HTML document does not start with a BOM, and its encoding is not explicitly - given by Content-Type metadata, and the document is not an `iframe` `srcdoc` document, then the character encoding used must be an - ASCII-compatible encoding, and the encoding must be specified using a meta - element with a charset attribute or a <{meta}> element with an - http-equiv attribute in the encoding declaration state. + given by Content-Type metadata, and the document is not an `iframe` `srcdoc` document, + then the encoding must be specified using a meta element with a charset + attribute or a <{meta}> element with an http-equiv attribute in the + encoding declaration state.

A character encoding declaration is required (either in the Content-Type metadata or @@ -1449,23 +1453,8 @@ If an HTML document contains a <{meta}> element with a charset attribute or a <{meta}> element with an http-equiv attribute in the - encoding declaration state, then the character encoding used must be an - ASCII-compatible encoding. - - Authors should use UTF-8. Conformance checkers may advise authors against using legacy encodings. - [[!ENCODING]] - - Authoring tools should default to using UTF-8 for newly-created documents. [[!ENCODING]] - - Authors must not use encodings that are not defined in the WHATWG Encoding specification. Additionally, - authors should not use ISO-2022-JP. [[!ENCODING]] - -

- Some encodings that are not defined in the WHATWG Encoding specification use bytes in the range 0x20 - to 0x7E, inclusive, to encode characters other than the corresponding characters in the range - U+0020 to U+007E, inclusive, and represent a potential security vulnerability: A user agent - might end up interpreting supposedly benign plain text content as HTML tags and JavaScript. -

+ encoding declaration state, then the character + encoding used must be an ASCII-compatible encoding.

Using non-UTF-8 encodings can have unexpected results on form submission and URL encodings, diff --git a/sections/semantics-scriptings.include b/sections/semantics-scriptings.include index 259254f606..93bc0c4fa3 100644 --- a/sections/semantics-scriptings.include +++ b/sections/semantics-scriptings.include @@ -135,15 +135,6 @@ <{script/async}>, <{script/defer}>, <{script/crossorigin}>, and <{script/integrity}> attributes must not be specified. - The charset attribute gives the character - encoding of the external script resource. The attribute must not be specified if the - <{script/src}> attribute is not present, or if the script is not a classic script. - (Module scripts are always interpreted as UTF-8.) If the attribute is set, its value - must be an ASCII case-insensitive match for one of the - labels of an encoding, and must specify the same - encoding as the charset parameter of the Content-Type metadata of the external - file, if any. [[!ENCODING]] - The async and defer attributes are boolean attributes that indicate how a script should be loaded and executed. @@ -204,7 +195,6 @@ The IDL attributes src, type, - charset, defer, and integrity, must each reflect the respective content attributes of the same name. From bde9c6986beb96a4e704c8f66c78c1f02fe026e6 Mon Sep 17 00:00:00 2001 From: edent Date: Wed, 14 Mar 2018 11:20:50 +0000 Subject: [PATCH 2/6] Fixes Fixes https://github.com/w3c/html/pull/1273/files#r173224772 --- sections/iana.include | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sections/iana.include b/sections/iana.include index 440b5f504f..ca9a587bdb 100644 --- a/sections/iana.include +++ b/sections/iana.include @@ -38,7 +38,7 @@ document's character encoding, overriding any [=character encoding declarations=] in the document other than a Byte Order Mark (BOM). The parameter's value must be an ASCII case-insensitive match for the string - "utf-8". [[!ENCODING]] + "utf-8". : Encoding considerations: :: 8bit (see the section on [=character encoding declarations=]) : Security considerations: From 466be71052186b237e6a631b300295e065eb051d Mon Sep 17 00:00:00 2001 From: edent Date: Wed, 14 Mar 2018 11:22:04 +0000 Subject: [PATCH 3/6] Fixes Fixes https://github.com/w3c/html/pull/1273/files#r171796008 --- sections/semantics-document-metadata.include | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sections/semantics-document-metadata.include b/sections/semantics-document-metadata.include index 26e6ccb369..5ef3edf5cc 100644 --- a/sections/semantics-document-metadata.include +++ b/sections/semantics-document-metadata.include @@ -1432,7 +1432,7 @@ * Due to a number of restrictions on <{meta}> elements, there can only be one meta-based character encoding declaration per document. - Authoring tools should default to using UTF-8 for newly-created documents. [[!ENCODING]] + Authoring tools must default to using UTF-8 for newly-created documents. [[!ENCODING]] If an HTML document does not start with a BOM, and its encoding is not explicitly given by Content-Type metadata, and the document is not an `iframe` `srcdoc` document, From 42746fbaafa255461d43f491a43c43242daecc10 Mon Sep 17 00:00:00 2001 From: edent Date: Wed, 14 Mar 2018 11:31:09 +0000 Subject: [PATCH 4/6] Fixes Fixes https://github.com/w3c/html/pull/1273/files#r171796372 --- sections/semantics-document-metadata.include | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sections/semantics-document-metadata.include b/sections/semantics-document-metadata.include index 5ef3edf5cc..ac7468b81f 100644 --- a/sections/semantics-document-metadata.include +++ b/sections/semantics-document-metadata.include @@ -1454,7 +1454,7 @@ If an HTML document contains a <{meta}> element with a charset attribute or a <{meta}> element with an http-equiv attribute in the encoding declaration state, then the character - encoding used must be an ASCII-compatible encoding. + encoding used must be UTF-8.

Using non-UTF-8 encodings can have unexpected results on form submission and URL encodings, From 2940df28b5f9bbda631217b01fad0032968b50e9 Mon Sep 17 00:00:00 2001 From: Terence Eden Date: Thu, 29 Mar 2018 09:52:55 +0100 Subject: [PATCH 5/6] Update text Fixes https://github.com/w3c/html/pull/1273#discussion_r177964465 --- sections/iana.include | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sections/iana.include b/sections/iana.include index ca9a587bdb..46a2b06c18 100644 --- a/sections/iana.include +++ b/sections/iana.include @@ -266,7 +266,7 @@

charset
The charset parameter may be provided. The parameter's value must be "utf-8". - This parameter serves no purpose; it is only allowed for compatibility with legacy servers. + This parameter exists only for compatibility with legacy servers.
From ff85cce4831ec563c64d642efc350528cf8c479a Mon Sep 17 00:00:00 2001 From: Terence Eden Date: Thu, 29 Mar 2018 09:58:23 +0100 Subject: [PATCH 6/6] Reflect UTF-8 changes --- sections/changes.include | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/sections/changes.include b/sections/changes.include index e4ab045534..d2cee2f799 100644 --- a/sections/changes.include +++ b/sections/changes.include @@ -24,6 +24,10 @@
Change to match reality. Fixed issue 1212
Caption end tag can be ommitted.
Substantive change to match reality. Fixed issue 1158
+
+ Mandate UTF-8 +
+
Substantive change to match reality. UTF-8 is now mandatory for all new pages. Fixed issue 1039

Changes between the HTML 5.3 Second Public Working Draft