-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
manifest: requirements for offline #22
Comments
Using a script to load resources and scripts would be (and is often now) considered an anti-pattern that limits the ability of the browsers to help developers. Things like It seems safe to say that "deterministic offline caching" will do what it can based on the spec we right, but if developers Do It Wrong their apps (and more importantly their users!) will suffer the consequences. |
Also, Chrom(ium) is working on a new "offliner" system to replace (or rather enhance for limited space/processing scenarios) the current Here's the conversation about the new system. And some of the code they're moving toward. Certainly it would be worth connecting with some of these developers to see where things overlap in our ideal world and what they have on hand now. |
I confess I don't understand how all resources can be cached. If the needed resource is dependent on user interaction, for example, you can't know what it is until the user interacts with the interface. There might also be context-dependent ways of calling such a resource that can't all be represented in a manifest. All you can know in some cases is the script that will be called, not what it will call. And I don't know that we need to optimize web publications in a way browsers don't. If a script has no value offline, it seems like that should probably be a property that HTML defines for the script element, as dropping the file from the cache doesn't remove the element from the DOM. We'd only be "optimizing" one side of the problem. |
Correct me if I'm wrong, but modules only cover JS resources since arbitrary content types for imports haven't been specified yet? Which means that if an author is using a script to arbitrarily load other resources (like images, video, or JSON data) based on user interaction or some other dynamic value neither type="module" nor rel="prerender" can do anything about it. And the offliner code is, from the looks of it, an attempt to dynamically offline cache a page without explicit manifests or resource lists provided by the author.
We can't prevent developers from doing it wrong. We can't police how people will author files. We've been trying that approach for literally decades now and it doesn't work. The specification should be realistic about what it can mandate. If you want full and universal offline support for all web publications, you're better off trying to build something like what the Chrome team is trying to do: pre-render then store offline. Bits of it will break but it's honestly an approach that will work in many many more scenarios than just demanding that authors always list all resources required for the publication to function or else they'll be scolded vigorously. It's entirely up to the author which resources are listed in the manifest and its entirely up to the User Agent to try and figure out how to provide a decent user experience with the inevitably insufficient data that the authors will provide as a starting point. That's what it's going to be like in practice and the spec can either reflect reality or not. I'd prefer the spec reflect reality because otherwise we are undermining its usefulness as an implementation guide (implementors will have to deviate from the spec to provide an adequate user experience) and its overall credibility as a standards document. |
Yep. That's what it does now. What I'm leaning toward is a way to tell the browser more explicitly + user interaction (most likely) that the user wants to "keep" a publication offline. It would use the "kit" being developed now plus a distinct user interaction moment (like clicking "Reader Mode" in Firefox these days) and bring the entire publication "into" the browser/reading system/thing. Also, I've never suggested most of what you just said I suggested. 😃 We will develop testable, verifiable MUSTs in our spec that conforming implementations MUST support to get that coveted gold ⭐️ Web Developers who want their stuff to work as intended will follow the spec (and all the content generated later about the spec) to build Web Publications that work in those implementations. When they drive off the map, there will be consequences...same as with any spec + implementation combo. There's no reason to reiterate that people will Do It Wrong. |
What you are describing are SHOULDs, otherwise the whole gold star concept is meaningless. Like I've been trying to say, MUSTs have consequences—they are things that are absolutely required under every circumstance or the sky will fall—and indicating a MUST when a strongly recommended SHOULD would do is pretty darn developer-hostile and absolutely undermines adoption. You can tell developers that if they want a certain feature to work, they need to do X, Y, or Z. That's fine. And you can tell implementations that they need to do X, Y, or Z to be able to reliably offer a specific feature. That's also fine. What you can't do is mandate that all developers accommodate said feature even when they have no desire or economic incentive to do so. A huge number of them won't, even if that results in invalid files. Remember, from the perspective of a web developer, the entirely of the web publication manifest is an optional extra. If we're really unlucky, our complex demands will make devs simply ignore the entire idea and just carry on as they have so far. Possibly just relying on unreliable browser offline features instead of web publication specs because just letting the browser do its work is so much simpler in practice. And the publishing industry already has ePub 3 and PDFs. Their adoption isn't a certainty either unless we take pains to ensure that web publications are considerably simpler than, say, just making a subscription website with a service worker. Or, even continuing to use ePubs and PDFs. The more granular we make the features of web publications, the more likely it is that the concept will see adoption and implementation. I'd even go so far as to suggest that each individual aspect of the manifest be specced individually as either extensions to the web app manifest or to the HTML standard itself (i.e. publication metadata = one spec, spine = another spec, secondary resources = yet another spec) so that they can be adopted and implemented individually. But it's way way way way too early for that conversation 😄
I strongly believe there is a reason to reiterate that again and again. It should be a running theme throughout the standardisation process. People will Do It Wrong and in large enough numbers for it to be an issue. And because it hasn't been A Wrong Thing To Do for them so far, we aren't really in a position to scold them for it. We can entice—"you get this nice feature if you list all your resources"—but not mandate. Unlike many other specs, we are building on top of a pre-existing ecosystem and platform and have limited standing to just waltz in and lay down all-or-nothing rules. So… I have no problem with any of what you describe as long as they are SHOULDs. We can even hammer on how important they are and that the feature they enable is really nice so people really really should do the work. But we can't just set a hard rule that they MUST do it. I mean, we can try to set that hard rule. They just won't follow it. |
Philosophy of spec writing is a fascinating topic, but let's get back to the issue. What are the requirements for making a WP available offline? |
On Tue, Aug 8, 2017 at 12:52 PM, BigBlueHat ***@***.***> wrote:
What I'm leaning toward is a way to tell the browser more explicitly +
user interaction (most likely) that the user wants to "keep" a publication
offline.
I'm OK with that sort of things as one option that an author can use to
have their publication go offline - but certainly not the only way. As
long as the author can also use other methods (such as explicit
ServiceWorker development) and ignore the declarative model entirely - all
is good.
|
On Tue, Aug 8, 2017 at 1:51 PM, Tzviya ***@***.***> wrote:
What are the requirements for making a WP available offline?
While I think that's a good question to ask, it's a bit misleading.
A WP can go offline today without anything being done in the UA, such as
via technology like ServiceWorkers. So from a WP author's perspective,
there is nothing more to do - they have everything they need.
Maybe we think we need to make it easier for WP authors to have their
publications go offline, perhaps by putting requirements on UAs. If so,
then we need to look at the problem from the UA perspective and not the WP
one.
And here's where it gets fun - a user may wish to take a WP offline that
wasn't designed/authored to be taken offline. Is that a problem we are
trying to solve? and what about the author who wants to restrict the
"offline-ability" of their publications?
|
“What are the requirements for making a WP available offline?”
Coming in late to this conversation, but is it a requirement that offline=browser access only?
…-Rick
|
Agree that this is an issue to discuss. For us, the ability to be only online, only offline, or mixed is a supply chain decision that the owner/distributor controls. |
I'm getting to the point where I don't even understand what we are talking about. So, "A WP can go offline today without anything being done in the UA". What does that even mean? What is a WP in that sentence, a Web Publication? I didn't think those even EXISTED today, let alone had features like going offline. And this concept of a WP designed to go offline vs not go offline is strange. The ability to be functional while offline is one of the "most important and high-level characteristics" [from the charter] of a WP. And yes, today, I can write a web app that has all the characteristics of a WP as listed in the charter. But do we really intend to have every WP be a web app that can cache itself? Isn't the whole point of this spec to avoid that? Otherwise, what are we writing? I can already do everything the charter calls for in a web app, no new specs needed! I was under the impression our goal was to make a specification for creating a web page or pages that had certain fundamental characteristics not intrinsic to arbitrary web pages. |
On Tue, Aug 8, 2017 at 2:15 PM, bduga ***@***.***> wrote:
I'm getting to the point where I don't even understand what we are talking
about. So, "A WP can go offline today without anything being done in the
UA". What does that even mean? What is a WP in that sentence, a Web
Publication? I didn't think those even EXISTED today, let alone had
features like going offline.
Sorry @bduga - point well taken. I should have written that a web page can
go offline if it wishes to.
And this concept of a WP designed to go offline vs not go offline is
strange.
Why? It's just one of a series of possible approaches that we could take
- Put the burden of "offline-ability" entirely on the author
- Put the burden entirely on the UA
- Some combination of the two
And yes, today, I can write a web app that has all the characteristics of
a WP as listed in the charter. But do we really intend to have every WP be
a web app that can cache itself?
Maybe...I would say that this is a similar discussion/debate to where the
UX of the publication lives and who controls it.
Isn't the whole point of this spec to avoid that? Otherwise, what are we
writing?
That's one of the many things we are discussing (or need to be discussing).
I can already do everything the charter calls for in a web app, no new
specs needed! I was under the impression our goal was to make a
specification for creating a web page or pages that had certain fundamental
characteristics not intrinsic to arbitrary web pages.
For those things that we felt were necessary for publications that were not
already present - sure. But we have also said (multiple times) that if
something already exists in the OWP, we should use it.
|
On Tue, Aug 8, 2017 at 2:11 PM, Rick Johnson ***@***.***> wrote:
“What are the requirements for making a WP available offline?”
Coming in late to this conversation, but is it a requirement that
offline=browser access only?
User Agent (UA) access, yes. Since a WP is consumed by a UA, at least so
far as our definitions exist today
|
This is probably the root source of many of these disagreements. I'm definitely not working towards that goal and it's not the goal of my employer. I'm working towards extending the web's feature set to accommodate the working group's stated requirements for web publications. The authors of arbitrary web pages should be able to use the Working Group's output to add individual publication-style features to their web pages. If they add them all—congratulations—you have a web publication. That's a very different goal that requires a considerably different approach from the one you state. "The UA can make your publication function offline if you provide a list of its resource—no service worker needed" is a feature that can be assessed, specified and implemented independently from other features that fulfil the publication requirements. And that's it, that's the requirement right there: the UA needs the author to list the resources the publication wants to be made available offline. This could then potentially be widely used outside of publications specifically and would improve the web as a whole. If we are speccing offline-publications as an independent feature in a specification of its own (as I think we should but, again, that's a topic for a later debate) then we could have it as a MUST because otherwise offline is meaningless. But in the context of the manifest as a whole, it has to be a should. That is, if we're taking the approach of extending the web feature by feature. If people want to intentionally diverge from regular web pages, then that's a different thing entirely. "This webpage becomes something fundamentally different from a regular web page if you add these bunch of things together, and you must add them all for it to work" is a Different Thing. And, yes, that different thing requires an all or nothing approach because otherwise you're just extending the web's feature set imperfectly and one at a time and you're back at the other approach. The let's-make-a-fundamentally-different-kind-of-web-page is a valid approach to take. But it's also uninteresting to many of us who are coming at this from the web end of things. And I suspect that includes many browser vendors. |
Irrespective of our differences in overall goals, offline is rather complicated. (As others have pointed out frequently.) Even just the idea of relying on a list of resources in the manifest (primary + secondary) to let the UA cache the publication raises some questions:
The answers to each of these questions has implications for what we need to put in the manifest for offline to work. E.g. if we don't want to require UAs to regularly re-request each resource, then offline requires modification times in the manifest in addition to a resource list for it to function properly. And much in the same way that the fetch api is now how WhatWG is defining network requests in general, we probably will have to specify this behaviour in terms of how you'd implement them in a Service Worker, even if that's not how everybody will implement it. The simplest thing to do, AFAICT, is to define offline publications as pre-populating a default Cache store coupled with a pre-defined caching strategy (e.g. either cache and update, or cache, update, and refresh) that is overridden if the publication provides a service worker. The service worker can then take over managing that cache by opening it by its predefined name. That way we only need to provide a list of resources without worrying about an update strategy. |
Assuming a default reading order is specified, is there any greater requirement to a manifest than to list all additional non-embedded resources that are rendered to the user? (e.g., non-primary resources referenced by an All subresources, with the exception of some script-used resources, can be determined by the user agent by inspecting those resources, even if it isn't as efficient as having the user list it all out. Anything else is optional to list, with perhaps strong encouragement to list those pesky script resources. And/or do we add a flag that content-inspection isn't necessary if a complete manifest is provided? Would that provide enough information for any offlining solutions, without us getting bogged down in what they might be, and without bringing onerous manifesting requirements to the web? Or is that too simplistic a thought? |
On Tue, Aug 8, 2017 at 5:05 PM, Matt Garrish ***@***.***> wrote:
Assuming a default reading order is specified, is there any greater
requirement to a manifest than to list all additional non-embedded
resources that are rendered to the user? (e.g., non-primary resources
referenced by an a tag so the UA doesn't get confused about which of
those is in or out of scope)
@bduga is the only person that has requested that. The rest of us are
perfectly fine without this...
All subresources, with the exception of some script-used resources, can be
determined by the user agent by inspecting those resources, even if it
isn't as efficient as having the user list it all out.
That's incorrect. Since HTML, CSS, SVG and other WP tech can also
reference other things - it's not necessarily only scripts.
|
But if they're explicitly referenced, they can be found by inspecting the resource, else how does anything load in a browser? The one fault in the model is resources that are dynamically needed, as discovering them requires initiating the script(s). Sometimes they'll be static resources and can be listed, other times... Aside from caching, listing the primary resources along with those that will be directly rendered provides a full context of what is in the scope of the publication. If you don't have this information, how does the user agent know the bounds of what belongs to the publication? (e.g., to unload itself?) |
The problem is that both CSS and JS are both complex languages that can dynamically respond to context and user input. And SVG has a set of animation elements that can dynamically modify the attributes of other elements (and, quite frankly, are a security hazard). So if you want to statically analyse a publication (i.e. without actually rendering it in a browser view) to discover which resources are needed, you are going to miss a bunch of stuff. HTML is pretty simple to analyse: stylesheet links, app manifest links, the prefetch/preload/prerender trifecta (as these might indicate dynamically loaded resources), scripts, src attributes, srcset attributes, href and xlink:href attributes in SVG elements, then inline styles. Which leads you to CSS, which is a bit tricky as you need a proper CSS parser to find all of the url() values but you can with a bit of work get a list of resources out of it that is exhaustive in all but the weirdest of edge cases. SVG is doable if you ignore things like and (those are a pain in the rear in general). With a bit of work you might even be able to cover the animation element edge case as well. JS requires running the actual code in an actual browser to get anything meaningful so that's out of the picture for static analysis. Assuming JS-loaded assets are by definition external to the publication (even if they aren't, really) and assuming that HTTP content negotiation always returns predictable and roughly equivalent resources (which it should if it's working properly) then yeah, static analysis can do the job. It won't do it perfectly, but it will do most of it. And if you rely on authors to plug in the missing gaps in the rare cases they give a damn, you'll might even get to around 85-90%. I think that is Good Enough™, personally. But others have disagreed strongly and insisted that authors must provide an exhaustive list of resources. AMP solves this problem by limiting the format to a subset of CSS, SVG and HTML and forbidding non-AMP JS entirely. Which works. It is a pre-existing, offline-capable, and portable version of HTML. The fact that it's basically Google's version of HTML is supremely problematic, of course. As a technology it's full of interesting ideas, though. Chrome is (if I'm reading things correctly) hoping to solve this problem by rendering the page in the background and doing a full runtime evaluation and inspection to get all resources, static and dynamic. This might still miss out on stuff, e.g. things that will only load on mobile or on desktops, as the background render is a version of the current rendering context. But with a bit of clever querying it's a method that could get them very close to 100%. But this is not an easy path to take.
I'm not sure what you mean here. The UA will always know which resources it has stored offline so it follows that it can remove them if needed. At runtime it knows all of the resources, at least in this context, because it's running the publication's JavaScript and CSS. |
I must admit, but it may be my happy-pills (™Garth) but I do not really think you disagree. Indeed, I do not see where @bduga said otherwise. Maybe one should say "web site" rather than "web page", but I believe the goal is the same. |
Referring back to what @mattgarrish said in #22 (comment) I have the impression that, in fact, we do have some sort of a consensus for now (remember that the goal is to come up with a First Public Working Draft ASAP, and not solve all the problems between now and the end of the year!). Indeed, it seems that a list of the resources as part of the Manifest is Good Enough (™Baldur). In my view, this is the answer to the question raised in the issue. Indeed, this works for the important use cases that, at least, I have in mind (e.g., I want to be able to look at a research paper on the Web and be able to read it offline or online). Will there be cases when this set of information will not be enough (eg, if the user uses all kinds of sexy javascripts dynamically loading things as a result of interaction)? You bet there are. Does it mean that not all Web sites can be turned, in fact, into a Web Publication? Yep, that is true. So what? We are not aiming to change the Web as a whole, we did not said every Web page can be a WP; we merely aim to provide a way for “publications” (like the F1000 article I referred to) to find their place on the Web as first class entities. And that is perfectly enough for me. Will there be open issues? Yes, that is possible. Let us list them, record them, refer to them from the spec and move on for now. |
On Wed, Aug 9, 2017 at 6:37 AM, Ivan Herman ***@***.***> wrote:
Referring back to what @mattgarrish <https://github.com/mattgarrish> said
in #22 (comment)
<#22 (comment)> I have the
impression that, in fact, we do have some sort of a consensus for now
You really have that impression??
I have the exact opposite impression. I see two *very* divided camps.
There are those that want to (mandate having to) list non-primary resources
and those that do not.
Indeed, it seems that a list of the resources as part of the Manifest is
Good Enough (™Baldur). In my view, this is the answer to the question
raised in the issue.
As long as the list is *optional* (not even a should, but a may!) - then I
agree we would have consensus. But I am not sure if everyone is even
willing to go with that.
|
Sorry, it's probably the weird terminology of not having a concept like a "web page" to refer to. Yes, the user agent will know what subresources a primary resource needs, but we've said in another issue that not every resource that is directly rendered to the user (i.e., not wrapped in html or svg but standing alone in the viewport) has to be listed as a primary resource (the non-linear issue). So, say I have a choose your own adventure book. The first document has a couple of I'm not suggesting that the authors list every script, style sheet, image, etc., although they could have the option to do so. But to establish the bounds of the publication we need to know everything that the user is possible to encounter in whatever reading progression they follow that is considered within the scope of the publication. If we require all those resources, the subresources can (for the most part) be programmatically determined. I had assumed that establishing the bounds was an important part of a web publication, as it is what allows for features like taking the publication offline.
I stand to be proven wrong by saying this, but CSS is, in my mind, easier to determine the possible necessary resources for. Yes, which ones to apply can only be known at run time, but a user agent could grab them all for caching. At least more easily than JS. Maybe it grabs them all, maybe it only takes the CSS applicable to the current context -- those are issues we don't have to solve if we don't try to make our own caching mechanism. (Of course, if the CSS itself is dynamically generated on the server, all bets are off, but we can't try to handle everything.) |
On Wed, Aug 9, 2017 at 7:56 AM, Matt Garrish ***@***.***> wrote:
I'm not suggesting that the authors list every script, style sheet, image,
etc., although they could have the option to do so. But to establish the
bounds of the publication we need to know everything that the user is
possible to encounter in whatever reading progression they follow that is
considered within the scope of the publication.
Why do we need to know that? You are pre-supposing some sort of
implementation or requirement that has not been either stated or agreed to.
If we require all those resources, the subresources can (for the most
part) be programmatically determined.
True. But again, we don't have a requirement that says we need that...
I had assumed that establishing the bounds was an important part of a web
publication, as it is what allows for features like taking the publication
offline.
No it is not. It is *just one way* that would allow this. It is not the
only way nor necessarily the way we have agreed to.
The problem is that both CSS and JS are both complex languages that can
dynamically respond to context and user input.
I stand to be proven wrong by saying this, but CSS is, in my mind, easier
to determine the possible necessary resources for.
Compared to JS, yes CSS is easier. But there are still a whole lot of dark
corners where things can hide and be missed.
|
Yes it is, and we agreed upon it in a long mail thread titled "definition of Web Publication". |
@lrosenthol, in #22 (comment) you said:
You are right on this aspect. What I was reflecting on (forgetting this) is whether there is any other information that a Manifest must provide for offline usage and, I believe, we have not listed any. So you are right, there is the issue on whether all the secondary resources must be listed or not, which is mostly discussed in issue #6. Let us go back to the definition of the WP in the current draft:
and it also says
First of all, what this tells me is that the manifest must list all the primary resources of a WP. If it does not do it, it is simply not the manifest of a Web Publication. It can be useful for other purposes, but we are not talking about that. I also believe that the manifest in the abstract sense must contain information on the secondary resources. As @mattgarrish put it in another comment, the boundaries of a WP must be set, otherwise a WP might fold the whole of the Web. We are still talking about the abstract information that the UA has to know about via the Manifest. It may be (to be decided further) that this can be done via a means that does not require to list all the secondary resources (e.g., by some scoping mechanism, listing some of the base URL-s whose discovered resources are considered to be secondary resources), but that is a matter of the practical realization. To summarize: I am strongly in favour to say that a manifest MUST include information about all the resources, primary or secondary. Put it another way, it MUST ensure that the UA is in position to discover the boundaries of the WP, and to decide whether a particular resource is within or outside a Web Publication. |
I have created a separate issue (#23) to concentrate on the question whether the manifest must contain information on secondary resources or not. |
@iherman The UA can implement other mechanisms to improve the offline user experience, but that's up to them. The author can use a service worker to add logic and dynamism to how the publication works offline (or just to improve its caching when online), but that's also just up to them. We will need to outline how the caching mechanism is going to interact with service workers but that's also a separate issue. Does that make sense? @mattgarrish My personal rule of thumb, which may or may not be useful, is:
Subresource being, using the SRI spec's definition: resources fetched by a web page. The boundaries of a web page and whether it can be fully offline are two separate issues, IMO. A publication is often going to have subresources within its boundaries that aren't listed in the manifest and thus aren't secondary resources in publication terms (even if we demand that authors not make such publications at the pain of invalidation, it's going to happen). I don't know if that works as a formal definition but I've found it to be a useful guiding heuristic when I'm thinking about this topic. |
This has been refined a bit in #23 (comment). We are getting there... |
Propose closing: the draft has now a number of references and to this, and this issue became extremely long an a bit lost focus. We may be better off closing it and, if necessary, open new, more focused issues when the time comes. |
Closing this issue and redirecting conversation to Issue #141 in Affordances |
from #15 (comment) by @bduga
To be discussed: which resources should be listed in manifest
The text was updated successfully, but these errors were encountered: