-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How do we identify a web publication and its components? #5
Comments
From @GarthConboy on June 26, 2017 22:25 I think we'll need to point to the "manifest" -- we'll need to be able to download or package the entire publication and its constituent resources (given Brady's correct observation that with scripting scanning the markup can reliably determine what's really referenced). Also need to know what markup file should initially displayed and how to progress from there. If your URL is a "directory root," one could say everything under it is inherently part of the publication (which could resolve the scanning the markup issue [maybe]), but one will still need to find the manifest to know where to start rendering and the reading order thereafter. |
This reminds me of a concern about progressive enhancement. Say I point my browser at One option would be to give the first document you want displayed a special name, say, |
From @mattgarrish on June 26, 2017 22:56 The case usually given for complexity is open textbooks and course packs, where content is aggregated from different locations without having to actually amass the resources under a single domain/directory. Does "everything" here only refer to html pages? How realistic is it that all the resources are going to be neatly stored together? What if my css is two levels higher up from the publication under a common folder? What if I'm pulling in css or scripts from another domain? I'm all for simplification, don't get me wrong, but I'm not optimistic about a model that requires the user agent to traverse and parse all the documents to figure out what is in scope and needed, if that's where this is leading.
Isn't this where we've considered using link/rel to establish the "belonging"? (And another case of why cross-domain publications get complicated quickly, since their parentage can only be established by starting at an author-controlled location, which then has to be maintained despite what the linked resources might indicate.) |
Do we need to design something that will support content documents ("spine items" in EPUB-speak) hosted on multiple origins?
|
So the URL of the WP would point to the “manifest” rather than a directory? This would then imply (I believe) that the manifest be discoverable from some sort of file. So what sort of file? I would argue that pointing to HTML would be better than the alternatives, given all user agents know what to do with HTML files. But that leaves open the question of whether this HTML file contains the manifest, or just points to the manifest. |
From @mattgarrish on June 27, 2017 1:40
We need to consider it, at least. Intertwined with what I mentioned above is the problem of iframes and bringing in entire chunks of content below the level of the spine. We need to be open to how the web works and not just publications as we're used to making them. The problem doesn't seem confined to content documents but affects their constituent resources, as well, so we need some solution. Taking a publication offline is less of a problem than what happens to references in a packaged web pub. So while we can ignore the problem at this level, we probably do so at our own peril later. Or maybe we add rules farther down the chain that limit what a packaged web pub can reference? (That's kind of a nasty gotcha I'd hate to discover, though.) |
From @GarthConboy on June 27, 2017 2:47 "I would argue that pointing to HTML would be better than the alternatives, given all user agents know what to do with HTML files. But that leaves open the question of whether this HTML file contains the manifest, or just points to the manifest." -- interesting. As long as the manifest was discoverable in a known location, I guess that would okay -- I think a browser might be interested in a first HTML page, whereas a Reading System would want to start with the manifest. "Do we need to design something that will support content documents ("spine items" in EPUB-speak) hosted on multiple origins?" -- I would think "no". |
From @iherman on June 27, 2017 7:27
This is already how the Web works. We routinely use URLs to a directory, and it is up to the server setup on what this means in practice. It can return the Bottom line: I believe your first statement, whereby |
From @iherman on June 27, 2017 7:29 I think the |
From @baldurbjarnason on June 27, 2017 22:13 The scope notion would play nicely with the proposed packaging spec which IIRC relies on it quite a bit. Outlining how identification for web publications would work if it followed the expectations set by the rest of the web stack (e.g. web app manifests, atom/rss feeds, etc.):
This is the basic pattern used by feeds, web app manifests, service workers, etc: component files link to a central document with metadata, indication of scope, link to self, and an identifying URL. Even AMP uses a variation of this theme. And as I mentioned above sometimes the identifying URL and scope definitions are interrelated. E.g. atom feeds link to the URL whose updates they list (explicit id, implicit scope). This pattern gives us discovery (direct links to chapters let you discover the publication ID, its metadata, and all related assets) as well as a single source of truth for the publication ID, publication-level metadata, and publication assets (the manifest). And this guarantees that the publication id is itself a URL to a human-readable HTML resource that in turn lets you discover the manifest. Of course, this is just going from what you'd expect if you were coming at this from the web development community. I realise that they aren't the only constituency at play here. And this does not necessarily dictate anything about the format of the manifest. Although, if we're going by the principle of least surprise, most web developers would at least expect a JSON file. On service workersService workers achieve this process programmatically, but the pattern is very similar overall. Although a lot of service worker behaviour by necessity violates common developer expectations.
Basically, even though service workers are awesome, they do also have a deserved reputation for being confusing (this is only scratching the surface) so anything we can do to avoid that complexity is a win. That means not letting the publication manifest claim scope over cross-domain resources and not letting it control requests in any way. (Apologies for the brain dump. I didn't have time to edit this down to a concise note 😊) |
From @HadrienGardeur on July 2, 2017 20:9 What you're describing is almost exactly what we do in Readium-2 @baldurbjarnason, there are only minor differences or observations that I need to add.
Ideally yes, but what if a resource is included in multiple Web Publications ? What if you can't change the HTML or HTTP headers for that resource ? IMO, such a link to a publication is an important part of how discovery is handled, but it's not an absolute requirement.
In Readium-2 we list all resources under two separate collections: This has some clear benefits over a simple
That's one of our only requirements. In Readium-2 we always provide a link that points back to the manifest. The other two requirements are:
That's pretty much the only difference between what you're describing and Readium-2/Readium Web Publication Manifest. The "root URL" (a link with One reason for that is tied to the fact that we'd like anyone to create a Web Publication by remixing content already available on the Web. On Service WorkersI really don't think that Service Workers should in any way influence our design for Web Publications. There are many different ways that content can be cached, and Service Workers are only one method among others. Let's keep our options open and let people use all the possibilities offered. |
From @llemeurfr on July 3, 2017 12:58 So, to come back to the initial question, Readium-2 folks propose:
|
Correction of my previous comment, after discussion in #6: |
I think that it should be possible to reference any resource in a web publication (which is referenced by a URI) by URIs. To treat such resources as first-class citizens of the web, use of fragment identifiers should not be required. Moreover, fragment identifiers defined for resources media types should be usable. An absolute-path reference (a relative reference beginning "/") in such resources should reference another resource in the web publication. Furthermore, the absolute URI constructed from the base URI (i.e., the absolute URI of the web package) and the absolute-path reference should reference the same resource. When a resource belongs to multiple web publications, depending on which web publication is used a base URI, relative references in the resource should be resolved differently. EPUBCFI does not satisfy these desiderata. |
I still wonder about issues like this when multiple domains are involved. Not so much at the WP-level, but at the packaging level. How are such references resolved and domains preserved? Do we at some point need to decide whether multiple-domain publications are out of scope for a first release? |
Perhaps there are two questions. Having multiple domains for content documents seems problematic, and perhaps not worth the complexity. But what about things like fonts and scripts that might come from other origins? But then our white paper suggests that the publisher has an obligation to provide an origin:
|
Yes, it's a troubling question. I recall at one point we were discussing URL mapping in the DPIG, but even that's not quite enough, as there have to be rules about domain root independence. If the manifest were to set a scope and all resources had to be below it, then the problems seemingly goes away, but it also greatly reduces what can be called a publication. Maybe that's not a bad thing, but it invalidates many of the possible applications. |
I think that "/" should always reference the web publication, no matter which resource "/" appears in. I updated my comment above for covering multiple web publications. |
But if we're not redefining how the web works today, how can that work for a publication with documents on different domains? It can't even work unless the publication root is the domain root, otherwise we're redefining how to resolve a path that starts with a slash, no? |
I would like to begin with desiderata. If we reach consensus on desiderata, we can invent a solution. But it is true that the domain root should reference a web package. In other words, the domain root of a resource-in-WP URI should contain the WP URI. |
I strongly object to limiting Web Publications to a single domain, this goes completely against the model of the Web. On a modern website these days it's not uncommon to have:
If we require content to be served from a single domain, we're no better than AMP and require a parallel Web to be built specifically for the constraints that we decide. IMO this is perfectly unacceptable: Web Publications should work with content that exists on the Web today, on as many different domains as the content requires. For Web Publication and its manifest, what's the issue with using absolute URIs? We can perfectly do whatever needs to be done (preload, cache, prerender...) with absolute URIs. If you're talking specifically about the use case of transforming a WP into a PWP, that's a very different problem and the difficulty will be tied to the packaging and manifest formats that we select. The Web Packaging proposal for instance can perfectly support resources across multiple domains: https://github.com/WICG/webpackage#multiple-origins-a-web-page-with-a-resources-from-the-other-origin |
Who suggested "limiting Web Publications to a single domain"? I didn't. It is true that the path component of a resource-in-WP URI should be able to specify a different domain. |
Right, and I was speculating about problems we'd face if '/' refers to the root of the publication even though the domain root is not the publication root. The content won't work when it's on the web. I'm not in favour of limiting publications to a domain, but it seems like the only way that could work. |
I'll quote this thread twice: From @mattgarrish
From @dauwhe
To be fair, @mattgarrish also said:
While @dauwhe also pointed out:
IMO, the potential design for PWP shouldn't affect WP in such a dramatic way. Aside from resources (CSS, JS, fonts, images, audio and video) which are often served from a different origin, being able to create a publication across multiple domains would also open up the possibility to remix content from the Web which I personally find compelling. Even if a single publisher controls a publication, it might want to reuse content across different domains or sub-domains. Let's take an example, publisher A has:
Publisher A decides to remix content about a specific place (let's say Rome) and create a new publication together. It would make a whole lot of sense for this publication to simply point to content documents and resources on food.publisherA.com and travel.publisherA.com instead of being forced to re-publish them somehow. |
I don't think that we need a publication root or the equivalent of the We can either reference resources in the manifest using:
A scope works when you want to be vague about the constituent resources that are part of an app. If we take a different approach, one that's more declarative (spine, resources) than scripted (Service Worker), having a scope is completely redundant. |
I think that we need a new URI scheme whose authority component can contain an absolute WP URI and whose path component can contain an absolute resource URI or a relative URI of a resource in the WP. |
Right, I don't disagree. My point above was only that there's a lot of simplicity in not having to deal with the issues of multiple domains. I can see arguments for it. I'm arguing for a decision, not necessarily advocating a position. Are there requirements we can start taking for granted as we weigh deeper into the issues so we know how to judge proposals. Can we drop the idea of a scope and move ahead with an assumption of a declarative file set? Are there objections? It doesn't mean we don't have to revisit our thinking later, but we can't stay open to all options. |
It's also entirely tied to how we package a publication. If we use the Web Packaging draft, URI stability is available by default since the package is designed with the concept of URIs from scratch. There's no need to do anything specific in the manifest. |
First, why is this important? Smooth transition between a PWP and a WP Second, what do I mean by "URI-stableness"? My second desideratum is
Suppose that a relative reference in a resource A in a WP is resolved to The same applies to unpackaging. Suppose that a relative reference in resource A
How do you unpackage a PWP comprising a manifest http://example.com/manifest.foo, an HTML file, /one.html and /two.html? one.html contains both /two.html and two.html as relative references. Can we put the manifest of the WP at a non-root file of a domain? |
I feel that unpackaging as a requirement is a mistake and we shouldn't do it. This WG is not about transporting resources from one server to another, and since URIs can be spread across domains it's impossible to do what you're suggesting anyway. |
This is certainly debatable. I would like to have discussions about such high-level |
@murata0204 <https://github.com/murata0204>
The big problem with your requirement concerning relative resources is that
we don't have any concept yet of what they might be relative to! Today in
EPUB, there is a "root" and everything is defined as relative to that and
the ZIP format's native concept of mapping out a file system is utilized to
help that.
However, as we move to a more web-centric and less file-system centric
model (as mentioned by someone in another issue) - the concept of relative
(or at least relative to 'root') may not longer apply. And as
@HadrienGardeur keeps pointing out, that also assumes a single origin for
all your content (which will probably also not be relevant).
I do agree with you that as we define WP, we certainly need to keep the P
aspect for PWP in mind to make sure we don't do anything to violate that.
|
Actually, it was @lrosenthol, but thanks for thinking of me!
But this is give and take. Can we expect publications to have their own domain? This is no different than the question I asked about using content negotiation and having one directory per publication. The con of such approaches is that we're imposing potentially onerous web architecture requirements on authors. Do we want a domain per article of a journal, and then another domain where all the articles have to be duplicated for the full journal? Whatever decisions we make have to be considered across all the architectures, yes.
I'm not finding this. OCF says that all resources must reference each other through relative paths. I tried an absolute path out of curiosity, and epubcheck couldn't make any sense of the reference and threw errors, so is anyone using them if it's true? The lack of roundtripping from WP->PWP->WP that seems unavoidable with multi-domain resources may be mitigated at the EPUB level, as EPUBs that are probably going to continue to created as a bundle of relatively-located resources and won't face the same unpacking issues. They may flow more easily into a WP environment. If you're drawing sources of cross-domain web-hosted content into an EPUB, what is the likelihood you're doing so with an expectation of another party unpacking it? It's intended for an EPUB reading system to ingest. We might have to make it an advisement not to use absolute paths if you expect roundtripping, to allow for web-born publications, but it doesn't seem problematic with how epubs are constructed today. But maybe I'm missing something. |
I agree that it's much easier to go from EPUB to WP than from:
For EPUB 4, it'll depend on what we end up using in terms of packaging. Going from EPUB to WP is pretty much what the "streamer" component of Readium-2 does:
For WP -> PWP -> WP, we could populate a proxy or a CDN in very specific situations but can't expect to simply unpackage in a folder. |
This is pre-supposing a lax multi-origin decision on WP, right? |
That's the $64,000 philosophical question we keep bumping against. Do publications handle what the web can throw at them, or is only a subset of the web able to be a publication? |
That's not the only restriction. For example:
I see way too many restrictions that shouldn't exist in these discussions. Any resource on the Web should be a potential Web Publication resource. Instead we're trying to build some sort of special snowflake that has nothing to do with how the Web actually works. |
I'm not sure I'll disagree, but that's a big decision we need to get to as a group. If a WP is not some sort of subset of the Web, then it's nothing special (snowflake, or otherwise). :-) |
It's a bounded set of resources; that's what makes it a unique subset. Being a subdomain of resources isn't all that unique, just a limitation. |
First, sorry for my mistake. I fixed it.
Since absolute-path references are relative references, I believe that they are allowed by EPUB3. But it is certainly possible to disallow absolute-path references in our future specs. Note that unzipping will invalidate absolute-path references. |
Yes, that's true, but I didn't think it was technically allowed. As an epub is supposed to work on file systems, a path that starts with a slash is relative to the root of whatever drive the content is in. It's only within the abstract container that it makes sense, or if you handle the epub as web content in its own domain. I think that's the problem epubcheck has with them. Since its validating on a file system, I believe it expects the '/' to resolve to the current drive root, and then complains that the resource is outside the container. |
Here are some high-level questions raised during recent discussions.
|
Yes, and it's bounded by its manifest. "One manifest to rule them all, and in the JSON bind them" |
Tim Cole will look through this thread (and issues in the PWP repo) for discreet, potentially more up to date related identify issues that need to be opened (i.e., that have not yet been opened in one repository or the other). In anticipation of this we should close this issue. |
From @dauwhe on June 26, 2017 22:17
Perhaps the simplest possible answer to these questions is just a URL:
https://www.example.com/MobyDick/
would both identify the publication and mean that everything whose URL starts with this is part of the publication.So I guess that I’m looking for reasons to make this more complicated :)
Copied from original issue: w3c/publ-wg#10
The text was updated successfully, but these errors were encountered: