separate implicit information from failure handling #48

mattgarrish · 2017-08-24T11:04:12Z

I find the current approach of making publications valid by having user agents fill in holes worrisome. It will make validation for authors that much harder, because there will be no indication that information is missing, or clarity of what it is.

Can we start being more specific about what information an author can intentionally omit, in other words?

Reliance on implicit computation will never be perfect, but as an example here's how the title section could read:

The Web Publication's infoset requires a title.

An author MAY omit the title from the manifest only when the first resource in the default reading order contains a non-empty title element (e.g., an HTML or SVG document). In this case, the user agent MUST use this title as the title of the Web Publication.

In the case of an invalid Web Publication that does not contain a non-empty title, the user agent MUST provide one. This specification does not mandate how the title is computed. The user agent might:

use the non-empty title of a subsequent primary resource in the default reading order;

provide a language-specific placeholder title (e.g., 'Untitled Publication');

use the URL of the manifest;

calculate a title using its own algorithms.

(Personally, I'd rather a title only be implied if there is exactly one resource in the default reading order, but can live with taking the title from the first resource.)

The text was updated successfully, but these errors were encountered:

iherman · 2017-08-24T11:29:28Z

@mattgarrish I am fine with this type of reformulation of, e.g., title.

The TOC approach of using nav elements is a bit different, though, because the TOC is not a required infoset entry...

lrosenthol · 2017-08-24T11:48:02Z

On Thu, Aug 24, 2017 at 7:04 AM, Matt Garrish ***@***.***> wrote: I find the current approach of making publications valid by having user agents fill in holes worrisome.

Why? It's what the rest of the web does today (not to mention many other technologies).

It will make validation for authors that much harder, because there will be no indication that information is missing, or clarity of what it is. Authors don't validate. And even if they do, then an actual validator can

report on such missing items (either as errors or warnings).

Can we start being more specific about what information an author can intentionally omit, in other words?

Why are we assuming intentional omission vs. simply not having the data (those are two different things)? It's omitted - the reason isn't important.

An author MAY omit the title from the manifest only when the first resource in the default reading order contains a non-empty title element (e.g., an HTML or SVG document). In this case, the user agent MUST use this title as the title of the Web Publication. Is "non-empty" the proper terminology? Do we mean that there is one but

it's an empty/null string? Or that it's not provided at all? Or both?

In the case of an invalid Web Publication You can't talk about invalid publications - invalid can only be determined

in the context of validation. Just say that it doesn't have a title.

mattgarrish · 2017-08-24T11:51:12Z

But it's still recommended, which in validation parlance is a warning. It's fine to have the UA go looking for one, but it should still be a clear recovery step with a warning.

The wording could be cribbed like this, for example:

The table of contents is recommended in the infoset.

An author MAY omit the table of contents from the manifest if it is included in an HTML nav element identified by the role 'doc-toc'. This nav element MUST be in a file named index.html or in the first resource in the default reading order. In this case, the user agent MUST use this navigation element as the table of contents for the Web Publication.

If a user agent requires a table of contents and one is not specified, it MAY provide one of its own making. This specification does not mandate how such a table of contents is created. The user agent might:

attempt to locate table of contents nav element in a subsequent primary resource in the default reading order;

use the titles of the primary resources in the default reading order;

calculate a table of contents using its own algorithms.

The warning kicks in if the toc isn't specified in the manifest or explicitly found in the first resource.

I didn't go back to your PR, so not sure if this exactly matches what you have. I'm also not yet clear if the table of contents requirement is for a reference to an html element in all cases, a json object in the manifest, a combination, or what, so the above will need reworking.

(Updated proposed text to account for named file.)

mattgarrish · 2017-08-24T12:00:53Z

Why are we assuming intentional omission vs. simply not having the data (those are two different things)?

Because we're talking about implicit collection of the information from another source than the manifest. There's a distinction between knowing it is being left out to be discovered elsewhere v. accidental omission.

Redundancy avoidance is what we look to be trying to achieve. But that's done with author understanding.

But like I said above, it'll never be perfect because author intent also has to be assumed. But it seems that people want auto-collection of some information without it triggering invalid manifests.

The alternative is to say manifest are invalid/warning-filled if any of these requirements are missing. I'm fine with that, too, but the current wording says the UA has to make up for any deficiencies so no manifest ever has issues. That strikes me as a strange state of affairs.

iherman · 2017-08-24T12:04:58Z

@mattgarrish your formulation in #48 (comment) if fine with me, except that, at least in the original proposal of @dauwhe and @BigBlueHat, a well-known name (index.html) takes precedence. Which actually makes sense, because the default order's first resource may be, say, the cover page and not the toc...

I will try to reformulate the text, but only later (I am on a call right now...).

mattgarrish · 2017-08-24T12:23:59Z

a well-known name (index.html) takes precedence

I'm not completely sold on this, but, yes, it was just a quick copy of the above so appreciate the point about it not being the first resource. I suppose between an index.html and first file (for single-resource publications) that's enough implicit harvesting.

My concern is that we not send user agents on fishing expeditions whereby they have to retrieve and parse document after document in the hopes of finding something.

mattgarrish · 2017-08-24T12:44:36Z

I've made a few tweaks to the proposed wording above, if it helps.

iherman · 2017-08-24T13:04:37Z

@mattgarrish, I have just pushed a new commit to #47

lrosenthol · 2017-08-24T13:11:31Z

On Thu, Aug 24, 2017 at 8:00 AM, Matt Garrish ***@***.***> wrote: Why are we assuming intentional omission vs. simply not having the data (those are two different things)? Because we're talking about implicit collection of the information from another source than the manifest. There's a distinction between knowing it is being left out to be discovered elsewhere v. accidental omission.

There is also the third category, that the information doesn't actually exist. And software can't tell the difference between the three cases - it's simply not there, but we don't know (or care) why.

But like I said above, it'll never be perfect because author intent also has to be assumed. But it seems that people want auto-collection of some information without it triggering invalid manifests.

Yes, that is true about what some of us believe is correct for this standard. However, the reasons (at least for me) have *nothing* to do with author intent.

The alternative is to say manifest are invalid/warning-filled if any of these requirements are missing. I'm fine with that, too, but the current wording says the UA has to make up for any deficiencies so no manifest ever has issues. That strikes me as a strange state of affairs.

That's the model of the web today - where the UA is supposed to handle all error conditions in a graceful manner. Whether we like it or not, that is the world we are living in...

mattgarrish · 2017-08-24T13:26:00Z

That's the model of the web today - where the UA is supposed to handle all
error conditions in a graceful manner.

Right, I'm not saying it's wrong to have fallback handling. All I'm suggesting is that we should be careful not to make everything valid with an expectation that the UA handle the problems.

If you validate an HTML file you get a list of everything you did wrong and omitted. If you choose to ignore it, the UA will kick in and correct. That's not the same as what we currently have.

I'm fine with switching title and language back to recommended if we do this, if that's your concern, at least until the debates about their priority are concluded. I put them to required when I refactored the prose because there was no way they weren't part of the infoset under the old rules.

If we only issue warnings when not present or captured by allowed harvesting techniques, the UA can still build a title and the manifest publication isn't invalid.

mattgarrish · 2017-08-24T13:30:48Z

Here's how it might be revised:

It is recommended that the Web Publication's infoset include a title.

An author MAY omit the title from the manifest when the first resource in the default reading order contains a non-empty title element (e.g., an HTML or SVG document). In this case, the user agent MUST use this title as the title of the Web Publication.

If a user agent requires a title and one is not specified, it MAY provide one of its own making. This specification does not mandate how such a title is created. The user agent might:

use the non-empty title of a subsequent primary resource in the default reading order;

provide a language-specific placeholder title (e.g., 'Untitled Publication');

use the URL of the manifest;

calculate a title using its own algorithms.

(And, yes, to your comment further up, we can be more specific about what 'non-empty' means. I'll try to clean up items like that in a real PR.)

lrosenthol · 2017-08-24T13:37:48Z

On Thu, Aug 24, 2017 at 9:30 AM, Matt Garrish ***@***.***> wrote: Here's how it might be revised: It is recommended that the Web Publication's infoset include a title. You don't need to say that - we have a standards word for that SHOULD.

See https://www.ietf.org/rfc/rfc2119.txt

…

An author MAY omit the title from the manifest when the first resource in the default reading order contains a non-empty title element (e.g., an HTML or SVG document). In this case, the user agent MUST use this title as the title of the Web Publication. If a user agent requires a title and one is not specified, it MAY provide one of its own making. This specification does not mandate how such a title is created. The user agent might: - use the non-empty title of a subsequent primary resource in the default reading order; - provide a language-specific placeholder title (e.g., 'Untitled Publication'); - use the URL of the manifest; - calculate a title using its own algorithms. (And, yes, to your comment further up, we can be more specific about what 'non-empty' means. I'll try to clean up items like that in a real PR.) — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#48 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE1vNUNOVJw2Xc9MT8myewkqzXiVreI-ks5sbXsJgaJpZM4PBOjB> .

lrosenthol · 2017-08-24T13:41:35Z

[email sent prematurely - trying again]

On Thu, Aug 24, 2017 at 9:30 AM, Matt Garrish ***@***.***> wrote: Here's how it might be revised: It is recommended that the Web Publication's infoset include a title. You don't need to say that - we have a standards word for that SHOULD.

See https://www.ietf.org/rfc/rfc2119.txt A Web Publication's infoset SHOULD have a title.

An author MAY omit the title from the manifest when the first resource in the default reading order contains a non-empty title element (e.g., an HTML or SVG document). In this case, the user agent MUST use this title as the title of the Web Publication. If the title is not present in the manifest, then the user agent MUST

determine the title either from the first resource in the default reading order (if it contains a valid title element) or using one of the suggested methods in the following list: (list follows...) Don't talk about authors - they aren't important to the standard. All we care about is the file format and the user agent.

…

lrosenthol · 2017-08-24T13:42:58Z

On Thu, Aug 24, 2017 at 9:26 AM, Matt Garrish ***@***.***> wrote: If we only issue warnings when not present or captured by allowed harvesting techniques, the UA can still build a title and the manifest publication isn't invalid. Stop trying to build a standard based on the behavior of validation...

Just state the requirements and recommendations of the file format and those of the user agent. That's it. No authors. No validators.

mattgarrish · 2017-08-24T13:54:20Z

A Web Publication's infoset SHOULD have a title.

Except that this is already stated in RFC terms in the requirements section. That's what I'm trying to avoid duplicating.

Ideally these repeated conformance statements should be dropped and each section should start with an explanation of the property/structure and only provide requirements for how it is constructed.

I'll put together an actual PR for title/language so we can discuss other wording issues there.

BigBlueHat · 2017-08-24T15:29:22Z

Still catching up with all this, but @iherman it was never the intent of @dauwhe or myself to "mint" a name for the ToC document (re: your comment #48 (comment)).

We used index.html in the examples because GitHub serves those (as do most web servers) as the default resource from a file system directory such as https://dauwhe.github.io/html-first/MobyDickNav/

The confusion may come from this line in the explainer:

Define the URL of a web publication to be the URL of this “index” resource which contains the nav.

The point was that the publication (and it's canonical URL) would serve this "index (ToC) resource"--not that it would be named index or index.html. Additionally, we fully expected things like rel="canonical" (or whatever we explore here) to be used (and useful) for additional identifiers (even the canonical one).

The rules section in the explainer also mentions "index" resource--which again was meant to be conceptually an index/ToC.

Hope that's clearer. 😄

iherman · 2017-08-24T15:33:16Z

@BigBlueHat it is... but, nevertheless, I am not sure we can live with 'just' the first entry in the default reading order. Alternatively, we can say that the UA takes the "first entry in the default reading order that has a TOC with the right role". Ie, the UA goes through all the entries until it finds the TOC.

mattgarrish · 2017-08-28T16:34:58Z

Closing this issue as my concerns in opening this issue were satisfactorily addressed in PR #51.

mattgarrish added the topic:manifest label Aug 24, 2017

iherman mentioned this issue Aug 24, 2017

Retrieving a TOC from HTML files #47

Closed

mattgarrish mentioned this issue Aug 24, 2017

rewording of title and language #49

Closed

mattgarrish closed this as completed Aug 28, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

separate implicit information from failure handling #48

separate implicit information from failure handling #48

mattgarrish commented Aug 24, 2017

iherman commented Aug 24, 2017

lrosenthol commented Aug 24, 2017 via email

mattgarrish commented Aug 24, 2017 •

edited

Loading

mattgarrish commented Aug 24, 2017

iherman commented Aug 24, 2017

mattgarrish commented Aug 24, 2017

mattgarrish commented Aug 24, 2017

iherman commented Aug 24, 2017

lrosenthol commented Aug 24, 2017 via email

mattgarrish commented Aug 24, 2017

mattgarrish commented Aug 24, 2017

lrosenthol commented Aug 24, 2017 via email

lrosenthol commented Aug 24, 2017 via email

lrosenthol commented Aug 24, 2017 via email

mattgarrish commented Aug 24, 2017

BigBlueHat commented Aug 24, 2017

iherman commented Aug 24, 2017

mattgarrish commented Aug 28, 2017

separate implicit information from failure handling #48

separate implicit information from failure handling #48

Comments

mattgarrish commented Aug 24, 2017

iherman commented Aug 24, 2017

lrosenthol commented Aug 24, 2017 via email

mattgarrish commented Aug 24, 2017 • edited Loading

mattgarrish commented Aug 24, 2017

iherman commented Aug 24, 2017

mattgarrish commented Aug 24, 2017

mattgarrish commented Aug 24, 2017

iherman commented Aug 24, 2017

lrosenthol commented Aug 24, 2017 via email

mattgarrish commented Aug 24, 2017

mattgarrish commented Aug 24, 2017

lrosenthol commented Aug 24, 2017 via email

lrosenthol commented Aug 24, 2017 via email

lrosenthol commented Aug 24, 2017 via email

mattgarrish commented Aug 24, 2017

BigBlueHat commented Aug 24, 2017

iherman commented Aug 24, 2017

mattgarrish commented Aug 28, 2017

mattgarrish commented Aug 24, 2017 •

edited

Loading