Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cleanup of infoset requirements and fallbacks for title, language, toc and default reading order #51

Merged
merged 4 commits into from
Aug 27, 2017

Conversation

mattgarrish
Copy link
Member

@mattgarrish mattgarrish commented Aug 26, 2017

Looks like the changes got merged and the old PR re-issues in my last attempt. I've reverted and this one's showing the right changes. Apologies for the extra email.


This PR is a consolidation of the changes in PRs #46, #47 and #49.

Also includes a few additional reversions/changes to the terminology that arose, stemming from issue #16:

  • reverting the definition of primary resource to a resource in the default reading order
  • changing secondary resource from required for a primary resource to required for the publication
  • moving default reading order definition inline into its section, otherwise it creates duplication/redundancy

Please give this a good look over to make sure I didn't mistranslate the discussions.


Preview | Diff

@iherman
Copy link
Member

iherman commented Aug 26, 2017

@mattgarrish bravo:-)

I have some purely editorial comments. I list them here, but is independent of my "review" comment...

  • "1.3 Terminology": this section should be explicitly set to normative (the enclosing section is set to be non-normative...)
  • First paragram in 1.3, sentence "In particular, for the following terms: user, user agent, browser, and address.": somehow the English does not sound right. Maybe simple make it part of the previous sentence?
  • "3.1 Overview", second para, second sentence: "It is primarily compiled from a Web Publication's manifest, whose serialization requirements are defined in Manifest." It reads funny with the word 'manifest' repeated. Maybe the second occurence should be spelled out as "in a separate section", or something similar (or the relevant section heading should be changed to allow for respec to do its work)
  • At some point it is probably better to reorder the subsections in section 3 to follow the list of required items of 3.2
  • "3.4 Language", end of first para: Maybe it is worth emphasizing that the language tag is also used as the default language tag for other information items or metadata where appropriate, like title, DC descriptions, etc. (Unless overwritten like, for example, if the title is extracted from a resource with its own language setting)
  • "3.5 Canonical identifier", second paragraph, the canonical can also be used, as far as I understand, for an HTTP response header; worth mentioning. Also, the link element is an HTML element; just to be precise we may want to add that.

Copy link
Member

@iherman iherman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little bit unsure of the usage of "und" for language. I am tempted to say that if no language tag is defined and the user agent does not use its own algorithm or extraction method, then the value is set to "und", ie, undefined. This seems to be the 'spirit' of the "und" in BCP47...

@mattgarrish
Copy link
Member Author

Thanks, Ivan, I'll see what I can do about your notes.

Note that 1.3 is normative now after the last PR (informative is now applied to subsections instead of the introduction itself). I took a look at HTML and some other specs and that's the approach they've used for these preliminary mixes of informative and normative.

I keep thinking about reshuffling, but we keep tweaking the lists so I've held off. As that's purely editorial, I'll do after we get this PR out of the way.

Also, would it make also sense to clarify under resources that the primary resources are compiled from the default reading order, since that already is the list of primary resource? Or too early?

@mattgarrish
Copy link
Member Author

Maybe it is worth emphasizing that the language tag is also used as the default language tag for other information items or metadata where appropriate

This is what confused me in the original issue, as EPUB differentiates the xml:lang of the package document from the dc:language declaration(s) of the content, for example.

A resource can only have one active language declaration for its text content, so what do you do in the case of a multilingual publication if this declaration is both the content and the infoset info?

@mattgarrish
Copy link
Member Author

And one last comment...

Why is the html link element brought into the canonical identifier section at all? Isn't that serialization-specific? I've changed to the following, but as I missed the calls where this was discussed feel free to point out if I'm missing the point:

If the canonical identifier is a URL, it may be used as the href value of a "canonical" link [rfc6596] for the Web Publication in its manifest or in the HTTP response header.

@iherman
Copy link
Member

iherman commented Aug 26, 2017

@mattgarrish : on the normative vs. non-normative: if the overall section is normative by default, that is fine.

I would not touch the primary resource issue in this PR. Let us se if it is accepted and come back to it.

@iherman
Copy link
Member

iherman commented Aug 26, 2017

@mattgarrish : the link element is an HTML element; in other syntaxes it may not exist or, worse, can have a different meaning. Hence it is better, imho, to make it precise...

@iherman
Copy link
Member

iherman commented Aug 26, 2017

@mattgarrish : my feeling is that the language tag is to be meant only for the manifest and the metadata. Ie, it does not have any effect on the individual resources. If we do it otherwise then, although in rare cases, the behaviour of the browser with the resource may be different than when the same resource is 'accessed' via the WP/manifest.

The metadata is affected by the language tag only if it is in the manifest or in a file referred from the manifest. DC entries used as metadata in the individual resources should not; they are treated as if they were used by the User Agent, independently of the WP.

At leat this is my current feeling...

@mattgarrish
Copy link
Member Author

the link element is an HTML element

Yes, but wasn't it proposed that the canonical/self link be in a JSON expression, too?

We say that there must be a canonical identifier, and that this must be part of the infoset, but we don't say how it's expressed or retrieved.

If it can be used as an html link, where is it coming from and where does this html link belong?

Is it the case that it must be expressed as a canonical link element/property/header so that it can be determined?

@mattgarrish
Copy link
Member Author

my feeling is that the language tag is to be meant only for the manifest and the metadata. Ie, it does not have any effect on the individual resources.

Yes, I agree to an extent. The extent being that it has no bearing on any resources, including the manifest.

That's the impression I got when I asked here: #29 (comment)

This language is not used for processing/rendering, only to describe the publication. It's like the content-language header, where you're just specifying the intended audience.

The language of the manifest file (or any primary resource) will often be the language of the publication, but falls apart with multilingual publications. I can't say I have a bilingual document by putting lang="en fr" in an HTML document, for example.

@iherman
Copy link
Member

iherman commented Aug 27, 2017

the link element is an HTML element

Yes, but wasn't it proposed that the canonical/self link be in a JSON expression, too?

I do not recall about that.

We say that there must be a canonical identifier, and that this must be part of the infoset, but we don't say how it's expressed or retrieved.

If it can be used as an html link, where is it coming from and where does this html link belong?

Is it the case that it must be expressed as a canonical link element/property/header so that it can be determined?

I believe the whole remark about being able to use the identifier in such a link is a note, rather than part of the core text, actually. We do not talk about how the identifier is expressed or retrieved, just as we do not say anything about the way the TOC is expressed (in the manifest)..

@iherman
Copy link
Member

iherman commented Aug 27, 2017

@mattgarrish,

my feeling is that the language tag is to be meant only for the manifest and the metadata. Ie, it does not have any effect on the individual resources.

Yes, I agree to an extent. The extent being that it has no bearing on any resources, including the manifest.

Hm. You made me reaalize that we have three different roles that we MUST somehow cover.

  1. General information on the language(s) of the publication used, e.g., to install or access dictionaries and
  2. Language of the manifest/metadata, ie, the language of textual information like the title (in the manifest) or Dublin Core or schema.org items in the attached metadata
  3. Language(s) of the individual resources

I guess we agreed that the language information item has no bearing on No. 3; handling that falls back on how the resources do that per the HTML/SVG/etc. specifications. Maybe we should have, actually, two different language information items for Nos. 1 and 2, with the following extra rules:

  • if no publication language is explicitly stated, it has the single "und" value
  • if the manifest/metadata language is not stated separately, and there is a single publication language, that one is used; otherwise it has the "und" value

In the absence of a publication language, User Agenta MAY reuse the language information of the first primary resource.

Administratively, maybe we should move this thread to issue #29, though, and merge/close the PR (unless there are other objections)

@mattgarrish
Copy link
Member Author

I believe the whole remark about being able to use the identifier in such a link is a note

Oh, okay, that makes a little more sense. I was trying to figure out where this normatively comes into play. I'll re-adjust.

@mattgarrish
Copy link
Member Author

Administratively, maybe we should move this thread to issue #29, though, and merge/close the PR

Sounds like a plan.

I'll merge later tonight if nothing else comes up. We're not striving to be complete at this stage so there's still plenty of time for debate on everything we've done in this PR.

@mattgarrish mattgarrish merged commit e4bea6c into w3c:master Aug 27, 2017
@llemeurfr
Copy link
Contributor

Thanks for the work, @mattgarrish.

Some remaining typos, to be treated in the next PR.

On 2.1, sentence about the manifest, the last word is now missing. Was "Manifest" before. Ivan proposed ... in "a separate section".

3.3 Title, the title is now optional, but the note about issue 20 states it is required, which is a contradiction.

3.4 language: I'm surprised to find mention of "BCP47 or its successors". BCP47 is a version independent identifier, the current RFC being 5646. And we also find the contradiction between the optional language in the infoset and the required language in the note about issue 29.

@mattgarrish
Copy link
Member Author

Thank @llemeurfr

On 2.1, sentence about the manifest, the last word is now missing. Was "Manifest" before. Ivan proposed ... in "a separate section".

Yes, I discovered that this morning while checking for more bad links. Respec should insert the section number/name.

3.3 Title, the title is now optional, but the note about issue 20 states it is required, which is a contradiction.

Yes, good catch.

3.4 language: I'm surprised to find mention of "BCP47 or its successors". BCP47 is a version independent identifier, the current RFC being 5646. And we also find the contradiction between the optional language in the infoset and the required language in the note about issue 29.

Yes, "or its successors" is definitely unnecessary.

I'll have these updated shortly.

@HadrienGardeur
Copy link
Member

Yes, but wasn't it proposed that the canonical/self link be in a JSON expression, too?

That's what we have in Readium: https://github.com/readium/webpub-manifest

Even if it's just a note, I don't think that we should recommend using "canonical" for what we call the canonical identifier. This is used quite differently on the Web, and the "identifier" proposal would be a better fit.

@HadrienGardeur
Copy link
Member

Going back to the reading order:

  • let's imagine a manifest with only a TOC specified in its list of primary resources
  • since it has something in the manifest, it's not entirely clear what the UA should do with the current spec language

@mattgarrish
Copy link
Member Author

Even if it's just a note, I don't think that we should recommend using "canonical" for what we call the canonical identifier.

Yes, I'm still kind of confused by this. It's not completely wrong, so long as the "canonical" identifier is the URL of the manifest or the resource it is included in, but the note isn't saying that clearly.

@llemeurfr
Copy link
Contributor

the "canonical" identifier is the URL of the manifest or the resource it is included in

Therefore what is the difference btw this canonical id and the address of the Web Publication?

@iherman
Copy link
Member

iherman commented Aug 28, 2017

Even if it's just a note, I don't think that we should recommend using "canonical" for what we call the canonical identifier.

Yes, I'm still kind of confused by this. It's not completely wrong, so long as the "canonical" identifier is the URL of the manifest or the resource it is included in, but the note isn't saying that clearly.

I am not sure about this at all; the only problem I see with the Note is that it restricts the usage of the "canonical" link to the case when the identifier is a URL. The definition of an identifier says that if it is not a URL per se, it must be possible to make a one-to-one mapping to an address, and, I would think, that address should also be acceptable to be used in a link element (or LINK header).

We do not say whether or not the address would map onto the manifest; as far as I can see this is still an open issue, related to #5 (except that the comments in #5 went all over the place).

@HadrienGardeur
Copy link
Member

I thought that the canonical identifier was meant for other identifiers, such as DOIs or ISBNs for example.

@iherman
Copy link
Member

iherman commented Aug 28, 2017

@llemeurfr

the "canonical" identifier is the URL of the manifest or the resource it is included in

Therefore what is the difference btw this canonical id and the address of the Web Publication?

@HadrienGardeur:

I thought that the canonical identifier was meant for other identifiers, such as DOIs or ISBNs for example.

Exactly. The identifier is indeed the DOIs and friends, and the URL representations thereof (when necessary) is some sort of a canonical "address". That is what should go into the link element, imho, and that is different than the address which might change.

The definition of the identifier also says that the ID must provide a way to get to the manifest. This is not the same as saying it is the address of the manifest.

@HadrienGardeur
Copy link
Member

@iherman

This is true for DOIs, not so much for ISBNs which are expressed as URNs. You definitely don't want to use link@rel="canonical" with a URN.

@mattgarrish
Copy link
Member Author

This is true for DOIs, not so much for ISBNs which are expressed as URNs.

This is where I'm lost. A canonical link provides the preferred address for a resource. It can overlap with a canonical identifier, but does it always?

@iherman
Copy link
Member

iherman commented Aug 28, 2017 via email

@BigBlueHat
Copy link
Member

Question. What's the time/space/tech continuum for "canonical identifier" (re: this PR)?

It's currently stated as:

If assigned, this canonical identifier MUST be unique to the Web Publication .

Given the following publication, what would it's "canonical identifier" be?
https://www.w3.org/TR/html/

@iherman
Copy link
Member

iherman commented Aug 28, 2017

@BigBlueHat

Question. What's the time/space/tech continuum for "canonical identifier" (re: this PR)?

It's currently stated as:

If assigned, this canonical identifier MUST be unique to the Web Publication .

Given the following publication, what would it's "canonical identifier" be? https://www.w3.org/TR/html/

The W3C considers https://www.w3.org/TR/html/ as THE identifier for the HTML standard, and this approach seems to be fine with its constituent community. However, we have to recognize that other communities may not agree, because the W3C short name refers to the latest HTML standard; this is currently HTML5.2, but it may refer, one day, to HTML6 (if ever there is such thing). Policies on other identifiers may decide that such a major new version should receive a different identifier instead of sharing the same one.

What this tells me is that, in my view, is that how identifiers are used by various communities are not to be defined by this Working Group. It goes way beyond our scope. The information set should provide the right slots to store and use identifiers based on the specification we give, but that is where we should stop, and let other organizations and/or communities establish their own rules.

@mattgarrish
Copy link
Member Author

As a formality, can I ask that we stop using this closed PR to discuss issues. It's confusing to follow at this point.

Please open new issues for any clarifications/changes you think are necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants