Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WoT Profile] Unclear character set constraints and non-UTF-8 html #1664

Open
himorin opened this issue Mar 1, 2023 · 8 comments
Open

[WoT Profile] Unclear character set constraints and non-UTF-8 html #1664

himorin opened this issue Mar 1, 2023 · 8 comments
Labels
needs-resolution i18n expects this item to be resolved to their satisfaction. s:wot-profile missing link t:char_choosing 4.5 Choosing character encodings wg:wot https://www.w3.org/groups/wg/wot

Comments

@himorin
Copy link
Contributor

himorin commented Mar 1, 2023

This is a tracker issue. Only discuss things here if they are i18n WG internal meta-discussions about the issue. Contribute to the actual discussion at the following link:

§ w3c/wot-profile#386

@himorin himorin added pending Issue not yet sent to WG, or raised by tracker tool & needing labels. s:wot-profile missing link t:char_choosing 4.5 Choosing character encodings labels Mar 1, 2023
@aphillips
Copy link
Contributor

Note that a number of the Media Types you mention are already constrained to use UTF-8 and do not require (in some cases allow) a charset parameter.

Is your comment:

All non-binary formats shall have constraint of charset as UTF-8.

... meant to be a suggestion to add to the quoted paragraph?

@himorin
Copy link
Contributor Author

himorin commented Mar 1, 2023

Actually, a table has Constraint column, and some of which have charset=UTF-8 specifically.
I understand that html spec (WHATWG) limits to utf-8, and RFC 8259 states no charset is registered for json mime type, but reading 6.6.1, I'm really not sure whether current writing / description is appropriate or friendly to reader of specification (e.g. just have a line as 'UTF-8 is mandatory for all payloads')...

@himorin
Copy link
Contributor Author

himorin commented Mar 8, 2023

@aphillips thank you for your (and WG's) comments during call.
I'm still wondering how to write the last line (actually), but how about edited text?

@aphillips
Copy link
Contributor

@himorin Thanks for working on this.

For the table I would change this:

Relation-Type Constraint Remarks
service-doc human readable documentation, supported formats are Unicode Text, markdown, HTML and PDF.

to use the remarks more clearly:

Relation-Type Constraint Remarks
service-doc supported media types are: text/plain, text/html, text/markdown and text/pdf Human readable documentation

And I would go on to add a paragraph under the table:

The types text/plain, text/html, and text/markdown MUST include a charset parameter (for example, text/plain;charset=utf-8) and the linked files MUST use the UTF-8 character encoding. The type text/pdf uses Unicode in its encoding.

Note well: RFC2854 defines text/html and is not obsolete. When the charset parameter is missing, the default encoding is Latin-1 (and specifically iso-8859-1). In practice browsers treat Latin-1 as windows-1252 and HTML5 sniffs the encoding in various ways (weighted towards trying to find UTF-8). However, it is still a good idea to use charset=UTF-8.

Annoyingly, the definition for type text/markdown in RFC7763 is actually unhelpful, but it requires a charset parameter and does not make UTF-8 (or any other encoding) the default because (and I quote):

[...] its syntax rules operate on characters (specifically, on punctuation) rather than code points. Many Markdown processors will get along just fine by operating on characters in the US-ASCII repertoire (specifically punctuation), blissfully oblivious to other characters or codes.

Therefore, in 6.6.2 I would include the charset=UTF-8 on all three of the first rows. I would then add a similar paragraph to the one in 6.6.1 saying approximately:

The types text/plain, text/html, and text/markdown MUST include a charset parameter (for example, text/plain;charset=utf-8) and the linked files MUST use the UTF-8 character encoding. The types application/json, and application/ld+json are already restricted to UTF-8. The type text/pdf uses Unicode in its encoding. Binary types, such as image/jpeg or application/octet-stream, do not have a character encoding associated with them or define the encoding internally.

@himorin
Copy link
Contributor Author

himorin commented Mar 9, 2023

@aphillips Thank you for deep consideration.
I've thought of that style of table a bit, but haven't went to that direction since that overlaps with next table... If we are to propose adding media types into a table of link relation, I'd rather propose to merge two, something like:

Relation-Type Supported Media Types Constraints Remarks
icon image/png, image/jpeg
service-doc text/plain, text/html, text/markdown, text/pdf Linked files MUST use the UTF-8 character encoding. Human readable documentation

Keeping two separated tables, both of which contain similar information (mime types), could be confusing for readers, and also difficult to compile information. With the last paragraph in @aphillips comment, attached below the integrated table, seems to be easier to tell all at one time.

@himorin
Copy link
Contributor Author

himorin commented Mar 13, 2023

ahhh, in addition to utf-8 as mandatory, do we need to change optional for hreflang into required for text/plain and text/markdown with service-doc and blank for anything else?

@himorin
Copy link
Contributor Author

himorin commented Mar 20, 2023

@aphillips how about this??


Section 6. Links is not clear and unorganized on several points:

  1. Link relation type is strongly connected with media types as constraints, but these mime types have additional constraints to these, which results in scattered descriptions and writings of specification.
  2. Constraint for service-doc link relation type is written as

human readable documentation, supported formats are Unicode Text, markdown, HTML and PDF.

but wording Unicode is not clear. Considering restrictions placed at mime types, it should be clearly stated with UTF-8 is mandatory over all applicable types.
3. hreflang is marked as optional, but should be mandatory for text/plain, text/markdown, and possibly on text/html.

We would propose to rewrite this section into one table for clarification and ease for noticing all of constraints with reorganizing all of attached text for description totally, something like:

Relation-Type Supported Media Types Constraints Remarks
icon image/png, image/jpeg
service-doc text/plain, text/html, text/markdown, text/pdf Linked files MUST use the UTF-8 character encoding. hreflang is mandatory for text/plain and text/markdown Human readable documentation.

@himorin
Copy link
Contributor Author

himorin commented Apr 4, 2023

hi @aphillips , could you kindly take a time to have a look on this??

@xfq xfq added needs-resolution i18n expects this item to be resolved to their satisfaction. and removed pending Issue not yet sent to WG, or raised by tracker tool & needing labels. labels Apr 8, 2023
@w3cbot w3cbot added the wg:wot https://www.w3.org/groups/wg/wot label Feb 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-resolution i18n expects this item to be resolved to their satisfaction. s:wot-profile missing link t:char_choosing 4.5 Choosing character encodings wg:wot https://www.w3.org/groups/wg/wot
Projects
None yet
Development

No branches or pull requests

4 participants