-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Role of the HTML document returned by the WPUB URL #103
Comments
Hadrien, could you describe how it would work if the HTML document returned by the WP URL is not part of the WP, and doesn't contain a link to the WP's manifest? |
Is there any reason why it wouldn't contain a link to the manifest? E.g. for RSS, sub-pages (like 'about') are generally not a part of what appears in the feed but still contain the link. Was there a specific reason cited in that other thread of why that can't be the case? I have to confess that issue #94 totally lost me about halfway through so I may well have missed this point if it was made later on. |
@dauwhe that's a little more specific than the question that I just raised in this issue. First of all, I think that the document will always return information about the publication, so out of the three things that I've listed, this one is always true. I'll use concrete examples: Info about the Web Publication, no link to the manifest, not part of the WP This could apply to a book behind a paywall, for example Hachette Book Group could start selling Web Publications directly from their website. If the user is not logged in and/or hasn't bought that specific publication, he/she gets redirected to a product page with info about the book (use case 1) and a way to buy it (use case 3). Info about the Web Publication, with a link to the manifest, not part of the WP This could apply to a publisher distributing strictly Open Access content. The catalog for that publisher has a page per publication to provide a broad presentation (metadata, cover, summary, tags) which would fit use case 1. That page would also include a link to start reading that WP (which would point to a resource that's part of the WP and in reading order) along with a link to the manifest (use case 2) that could trigger specific behaviours in a UA (similar to Chrome Install banner for PWA, we could imagine a similar banner that shows the cover, metadata and interactions such as "read" or "add to your shelf"). |
@baldurbjarnason I've made that point several times already but a few people (including @mattgarrish) find that behaviour "weird". |
My question was what are the implications of a resource initiating the web publication when that resource is not itself a part of the web publication, and done via the linking mechanism defined in the spec? If a page with a link is not part of the publication, but some sort of launcher out to a publication, how does the user agent determine which is which? As I said in #94, it seems like a security hole when there is no specific scope to a web publication and no sure way to verify what is or is not a publication resource, since the resource list is not required to be complete. |
@mattgarrish unlike a Web Application, a Web Publication must list all of the resources that are part of it. If a resource is not listed, it's by definition outside the scope of the Web Publication. This effectively replaces the |
No, there must be a list that must include the resources in the default reading order, but otherwise it doesn't have to be exhaustive. That's as far as we got agreement. That's why I'm not a fan of that section and concerned about how those implications spill outward. We've said the bounds are important, but we aren't strong in defining them. I don't have a real position in favour of MUST/SHOULD/MAY here, only that everything fit together. |
Also, we're leaning heavily into the realm of implementation. Each UA will effectively implement for WP differently, simply displaying a link in a banner or an icon in the address bar for example is IMO not such a big deal. if this is not a security issue for RSS/Atom, I don't see why it would be a security issue for WP. |
Yeah and I don't like that part at all. We can't say that bounds are important and be as weak as we are in defining them. |
For me, it's more what happens after you click that button or icon in this scenario. Typically, I'd expect that page to be initiated (paginated, table of contents link appears, etc.). But it potentially puts any page into the publication with our weak boundedness. Assuming we tighten up our resource list requirements to solve that, do we also need to consider start_url for these cases? I realize some user agents will just make a publication out of the manifest and drop the user at the first page (or whatever), but some additional direction seems useful if the address is not a publication resource. With those bits in place, I'd be fine dropping to either should/may. |
Here's a good example that illustrates a few use cases that we'll have to deal with: http://books.openedition.org/ They have three different models:
They currently provide multiple ways to access content:
With Simple Open Access, you can do all four for free. Let's imagine that they decide to replace their special viewer with a Web Publication instead:
|
This is similar as the 'le Monde' example. The identifier of the Hachette WPUB is behind the paywall, ie, the surrounding system will force you, as a reader, to go through some HTTP redirections or whatever, before the HTTP response is indeed the content of the entry page you refer to. In other words, the content of that page at that URL can, without any problems, contain both the link to the manifest and be part of the publication. Hachette may of course decide to provide a separate page where customers would go to buy/access the book, but the URL of that page is, for me, not the address/identifier of that book.
I believe we are mixing the address/identifier of the book, and any kind of other page on the vendor's/publisher's site that provides other services and information. In some cases the entry page can play both roles, and in some cases the two roles are played by two different pages. |
@iherman OK so let me turn that into a bullet list to make sure that we're on the same page:
Is that an accurate summary? This doesn't cover whether the document returned by the WP URL is within the boundaries of the WP BTW, that's a different issue. |
I believe we are conflating the WPUB itself, and the various accesses to the WPUB. As far as I am concerned, these are different, and it is up to the vendor to ensure access to the same WPUB. How it happens (via the open access or not) is not the subject of this specification. |
Agree with "The WPUB URL must serve an HTML document that contains info about the publication + a link to the manifest" Not quite sure where you're ( @HadrienGardeur ) going with "... but it's perfectly fine to redirect the user instead to a separate URL that does not contain a link to the manifest or any content from the WP itself". I do think any purchase/paywall stuff should be on the "outside/before" of the WP, and not within our scope of definition. Re "doesn't cover whether the document returned by the WP URL is within the boundaries of the WP BTW" -- I used to be in the "doesn't matter" camp, but if majority is at "must be", I'm okay with that too. I think we're somewhat going around in circles... this issue is largely a re-open of #94 -- perhaps it's better to resolve on the Monday call... I predict will be the major agenda item. :-) |
I'm simply rephrasing what @iherman is suggesting. Also, if the WPUB URL itself is behind a paywall, this is definitely relevant here. It means that just knowing the URL won't be enough to discover/read the publication. |
I think any commerce is outside our scope and must happen before one gets to the WPUB URL. We should not be dealing merchandising, commerce, shelving at the WP level. |
From an HTTP perspective, this is purely status codes not merchandising/commerce/shelving/whatever. This essentially means that we can't expect a 200 HTTP status code and an HTML document with all the things we've discussed so far. |
If you don't get a 200 from the WPUB URL, it's not one you have access to. I don't quite know if we agree or not. |
Sure, just want to make it very clear that this is not a magical URL that will always do what's been suggested so far. Many times it'll result in being redirected elsewhere or just getting an error (4xx/5xx without any redirect). But the main point that I wanted to raise in this issue is the role of this document, and for now it's very very vague at best. While it's clear that the URL itself identifies the WP, it seems that the only consensus we might have so far about the document that is returned is discovery of the manifest (and I insist on the term discovery, the presence of a link to a manifest is not meant to say that the current document belongs to the WP). |
Until a URL results in an entry page that identifies that URL as the Consequently, the scenarios you provided earlier are not about "a Web Publication URL":
Neither of those result in an "entry page," so neither of those URLs are (at that point) the What's needed next is more clarity around how the "entry page" declares itself to be a Web Publication--it can't simply be a link to a manifest document as we've also defined that as a means for discovering Web Publications. This entry page needs to declare to the UA that it authoritatively provides the infoset (through yet to be determined means) for a specific Web Publication. |
We don't even know what an "entry page" is, we haven't even agreed on the terminology, we're simply using this term because "landing page" was confusing for everyone involved in these discussions.
What's needed is actually quite different:
I'm not convinced at all that we need any additional mechanism for a WP to "declare itself". Why would we require something else?
That's not something we ever agreed on. I know that you have your own agenda (one HTML document to rule them all, and in the |
If the manifest is in JSON, and discovered from anywhere on the Web what thing is responsible for the authority within that manifest is used? Is the expectation that "something else" will read this manifest and create the reading experience? If that's the case, that falls well outside the extensible web manifesto. Where does one provide a shim/polyfill to use that manifest? In any of the resources that reference the manifest? Given a Web Publication URL, we have to return something extensible that can "boot up" the publication and work as an authority of the "bound" resources. The spec does currently say "Linking to a Manifest" and then present That "extensible response format" may indeed use a JSON document (in the end) to provide some or all of the infoset, but the browser can't be extended from a JSON file, so the initial response document becomes the authority over the other bits (JSON, JS, CSS, more HTML, etc). Is that any clearer? |
Well, actually it isn't. You're basically saying that the "start URL" will provide a Web App for handling the Web Publication. That's an option but certainly not a requirement. If my Web Publication already contains sufficient navigation to read the whole publication, why would we require a polyfill as well? To polyfill what exactly? Offline access? Packaging a WP into a PWP? Accessibility features? |
Well, conversely, why would one make it a Web Publication? Which is what folks at browser vendors keep asking--"why don't ya'll just write PWAs?"
Again. We are (and will be) asked what a Web Publication provides to the Web and it's existing UAs.
We'll need to prove to browser vendors that there are things we need them to add/change about browser behavior--at least if this is a spec for teaching browsers about publications (which has been my assumption). If we're simply defining a JSON format for publishers to build Web Apps around, then we're doing this all wrong. 😄 |
I think we can both agree that this group has yet to prove to the rest of the W3C what makes Web Publications unique and requires a separate format and dedicated features in a browser. But I don't think that requiring an "entry page"/"start URL" that basically behaves like a Web App is the best argument for that. Anyway, this feels like a different discussion... but my take is that in addition to what I've listed above in this issue, you're also proposing the "entry page" to serve a "WP viewer". |
I don't think anyone is suggesting that the entry page serves as a viewer. We are simply suggesting that it is the way to ENTER the publication. Methods of reading the publication are not up to the UA. We can talk about conformance or preferred behaviors later. Neither the entry page nor manifest should create a reading experience. |
Pretty sure everyone in this group (and on this issue) believes in a future thing called a "Web Publication" that is built entirely from descriptive components--be they HTML, CSS, JSON, etc. You are correct that it's my expectation that the "entry page" would serve as a "WP viewer." The hope being that in the future providing that viewer becomes optional, and that browsers "reader modes" would use the same definitions to take that experience farther, or make it more consistent and accessible, etc. The important part of the whole "entry point" discussion is that given a Web Publication URL, we need to feed the browsers something extensible. That document (and that URL) will ultimately define the authority space of the publication (via ServiceWorker Conversely (word of the day! 😉), if a JSON document is the "authority" of the Web Publication, then we must re-define how all those things (CORS, CSP, SOP, etc) act when rendering the stuff inside that JSON and inside what browsing context that happens, etc. There's a much higher hill to climb here. So. The role of this HTML page becomes the defining authority space for whatever comes after. |
There's no requirement for that, no, but it'd be my expectation that this HTML provides the opportunity to "provide a reading experience"--and that publisher would/do that already (even if it's just next/prev/contents links).
...the way to enter the publication when one only has the publication URL (and not coming in from some sub-portion of the publication).
They will be in the future, though, correct? That may need to be its own issue. 😄
The entry page MAY do that. The manifest can't...at least not on its own. The idea is that for now (who knows how long...) the "entry page" will provide some sort of "reading experience" (however subjective...). The hope of any specification we right here is to set those expectations (by defining requirements). As we do that, we can shim those expectations from an "entry page," but we'd have to provide a separate thing (i.e. a "reading system") to interpret a JSON-only definition of a Web Publication. Which is likely where @HadrienGardeur and I find ourselves on opposite ends of the same rope. 😃 He has a reading system, and I have publications. 😄 The core to all this though is what are we "teaching" browsers to do on behalf of the good people of the Web. If we require an "entry point" response as HTML (and presumably as part of the publication), then we have a way to show them that. |
We can't use Using |
updates per comments in issue #103
A There are additional conditions too for the browser to display an install banner (we'll need to use the same syntax as the WAM, use SSL and also include a Service Worker). |
Based on today's call, we seem to have consensus that what is required of an 'Entry Page' should be relatively minimal - i.e., a link to the Way Bill. My question goes a bit the other direction. Is there any impediment to allowing (optionally, of course) an 'Entry Page' to also be a Start Page for the WP, or possibly even an entire WP (excepting of course the Way Bill, which we have decided must be a separately addressable JSON file)? In other words, is our minimal requirement that all WPs consist of at least 2 resources: a Web Manifest (JSON) and an Entry Page (HTML)? |
As proposed by Ivan in #103 (comment) , the proposal for the fpwd is that the entry page:
This issue will remain open past fpwd to capture input from the broader community. |
The minimal Entry Page is only required to have a link to the JSON waybill of the WP, because it's not feasible to prescribe for all kinds of use cases or types of publications what it should contain - TOC, cover, abstract, etc. Suppose that such an Entry Page would just contain that link and nothing else. A WP-aware user agent would access the waybill and find all the necessary information for giving the user access to the contents of the WP. But a non-WP-aware user agent that can't process the waybill properly, might just display a blank page. Shouldn't we just add a further fallback option for traditional UAs, especially browsers, that the Entry Page should offer a way to access the contents of a WP and list some possible approaches? |
I'm trying to puzzle out if we can find better consensus by breaking apart what we want to achieve. What started all of this was, I believe, this basic assertion:
This page ensures that we have at least one resource that is compatible with the WAM linking model, and for compatibility with vanilla user agents, search engines, etc. We probably don't need to call it anything special. Where things have gone awry is in also making this the required address. There's no particular need for it to be, as far as I can tell, as vanilla anything isn't going to find the address or a way to this document from the manifest, since they don't understand the manifest. The same is true of any other document you reference. All the webby ways of finding documents will lead people into the publication. Isn't all that we need an optional start url which tells wp-aware user agents what resource to load first? Even that is not required, as loading the first document in the default reading order seems like a natural enough choice. What breaks or is not possible to achieve if the web publication does not have an address in the manifest? I can't answer that, which makes me think we might well be better off without it and the confusion it causes. All I find myself wanting to clarify further is the non-Web Pub resource with a link to the manifest, but maybe all that needs is the following (only quasi-spec prose):
|
You say:
and I do not think that is correct. I believe we start by an assertion and a question:
And that is the question to which we gave an answer on the call and in the PR. In this respect, the "address" seems to play the same role as the "start_url" in the WAM, and I am fine with that. I have the impression that your proposed text would just complicate things further. Unless you want to reopen the discussion which led to the assertion above but, personally, I would not want to do so... |
The new assertion made is that there must be at least one html page somewhere on the web that any browser can render and by which the manifest can be found and (should be) a publication resource. I find this problematic, as what happens when someone creates a publication without any html documents which is perfectly valid right now? They're forced to put up one just for the sake of justifying an address in the manifest? All I'm asking is whether we can find agreement on at least one HTML document as a publication resource without bringing "address" or "entry page" into the discussion, as the rationale is clearer and easier to understand. For fpwd, we can leave the address in with the requirement for it to dereference to an html document that must have a link to the manifest, and leave it at that. It can be a pointer to the required html document in the absence of anything else. That's all I'm really proposing for now. |
(Admin) @mattgarrish, shouldn't that be a separate issue? Or do we want to change the title of the issue? We are getting somewhere else, so to say... |
Yes, it probably would be useful. I think we have two questions swirling around this discussion: 1) what is the address and is it needed, which we can leave here; and 2) do we require an html document as a resource, which is an offshoot. Having separate answers to these questions would be useful, as they aren't bound to each other. I'll open a separate issue for the html question. |
As I mentioned during our last call, I also believe that we're mixing up two different concepts:
I see that @mattgarrish seems to also agree. Since it's a little hard to provide examples given our lack of serialization in WP, I'll use the Readium syntax to illustrate. Example 1: identifier, no start URL {
"@context": "http://readium.org/webpub/default.jsonld",
"metadata": {
"@type": "http://schema.org/Book",
"title": "Moby-Dick",
"author": "Herman Melville",
"identifier": "urn:isbn:978031600000X",
"language": "en",
"modified": "2015-09-29T17:00:00Z"
},
"links": [
{"rel": "self", "href": "http://example.org/manifest.json", "type": "application/webpub+json"}
],
"spine": [
{"href": "http://example.org/publication/c001.html", "type": "text/html", "title": "Chapter 1"},
{"href": "http://example.org/publication/c002.html", "type": "text/html", "title": "Chapter 2"}
]
} Example 2: identifier, start URL outside of the reading order {
"@context": "http://readium.org/webpub/default.jsonld",
"metadata": {
"@type": "http://schema.org/Book",
"title": "Moby-Dick",
"author": "Herman Melville",
"identifier": "urn:isbn:978031600000X",
"language": "en",
"modified": "2015-09-29T17:00:00Z"
},
"links": [
{"rel": "self", "href": "http://example.org/manifest.json", "type": "application/webpub+json"},
{"rel": "start", "href": "http://example.org/publication/start", "type": "text/html"}
],
"spine": [
{"href": "http://example.org/publication/c001.html", "type": "text/html", "title": "Chapter 1"},
{"href": "http://example.org/publication/c002.html", "type": "text/html", "title": "Chapter 2"}
]
} Example 3: identifier, start URL part of the reading order {
"@context": "http://readium.org/webpub/default.jsonld",
"metadata": {
"@type": "http://schema.org/Book",
"title": "Moby-Dick",
"author": "Herman Melville",
"identifier": "urn:isbn:978031600000X",
"language": "en",
"modified": "2015-09-29T17:00:00Z"
},
"links": [
{"rel": "self", "href": "http://example.org/manifest.json", "type": "application/webpub+json"}
],
"spine": [
{"rel": "start", "href": "http://example.org/publication/c001.html", "type": "text/html", "title": "Chapter 1"},
{"href": "http://example.org/publication/c002.html", "type": "text/html", "title": "Chapter 2"}
]
} Just a few notes to provide additional clarity:
I completely agree that this is where we start from.
👍 for that proposal. |
@HadrienGardeur, just for my understanding. In all examples, the identifier of your publications is Actually, if I follow your reasoning, there is no standard answer to this, but I am looking at what the reasonable setups are. In exampled (2) and (3) it could be the start page (ie, the one you identify as such) and I expect your answer for alternative (1) is that 'whatever the publisher decides it to be'. Which may be a technically reasonable answer. But If we go down this open ended line, I am really worried that we are building into the spec a major source for bugs. |
@iherman in the Readium Web Publication Manifest, we treat this as purely an identifier, which will be used for example internally by a user agent. There's absolutely no expectation that this will be shared or dereferenced, so we really don't care what's returned. This is also meant to work for PWP/EPUB4, where we know that some publications won't have such a URL. In this situation, they can always use a URN for an ISBN or a UUID. |
@HadrienGardeur forget about that for a moment. Say that the identifier is a URL. What happens then? |
@iherman well it doesn't change much. In our case (Readium) we don't give any particular role to that identifier aside from identifying. It can return whatever the publisher wants, we don't care. Since this is JSON-LD, it might be also ingested by a JSON-LD aware crawler, this is where the identifier may matter a lot more. Try the examples that I provided in the JSON-LD playground: https://json-ld.org/playground/ |
@HadrienGardeur, at this moment I am not really interested in what Readium does, sorry about that. What I am asking it: how would you translate those three alternatives to a WP case? Is your answer that "It can return whatever the publisher wants, we don't care."? If so, that is where I feel that this will be a source of errors in practice. The authors of the publications are not the same as the publishers, and if both can, sort of, push the buck to the other then we have a possible problem. |
I would say that these three examples already work fine for a WP as long as you use a URL instead of a URN. I'm not sure what you mean by a source of errors in practice. The friction between author/publisher (+ third party content producer) will always exist, I don't see how this makes things worse. |
If we require that 'entry page' (whatever the name is) to be part of the publication (whether start page or not) then the responsibility to produce one is by the publisher. A valid WP must have this. Otherwise... who knows? |
@iherman that's currently a SHOULD, not a MUST. What we're discussing here is:
|
I think this issue has become unnecessarily complicated. This is not the place to discuss identifiers, addressability, serialization, or whether we are using the As @mattgarrish asked, we are dealing with only 2 points here:
|
Sorry @TzviyaSiegman but constantly saying that we should avoid discussions and/or close issues is not exactly helpful. To address your comment:
We've already addressed both in #94, this is not what the current issue is about. By saying that a WP URL (identifier for the WP) returns an HTML document we've opened Pandora's box, from which all these questions are popping up:
BTW, we could just say that we don't care about any of those, and let the publisher decides. That option is on the table, but completely ignoring all these issues is not helpful. |
Proposal for resolution of the initial question:
I think it answers to all sub-questions raised by Hadrien and may be the best consensus we'll get. |
@llemeurfr this works for me as long as we keep this completely separate from the concept of a start URL. I think that the spec language proposed by @mattgarrish in #103 (comment) is also useful and complementary. |
Closing this issue since the changes have been integrated in the draft a while ago. |
We've had long discussions in issue #94 about what the WPUB URL resolves to and there's a consensus that it should return an HTML document.
What's not well defined though is the exact role of that document.
Several proposals have been made so far:
There's also an on-going discussion whether this document MUST belong to the Web Publication itself or remains external to the publication (I'll let @mattgarrish and @GarthConboy repost some of their relevant comments here).
Every agrees that the document MAY belong to the publication.
If the document belongs to the Web Publication, it MAY be:
The text was updated successfully, but these errors were encountered: