-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
should we disallow "out-of-container" relative URLs? #1912
Comments
The parsing may have been unclear, but I wouldn't say the intent was. From 3.2:
which is currently:
I can't find this now, or anything equivalent, in the proposal for #1888, so restoring it in some form would seem appropriate. There's also a question of whether |
I guess I'm sorta confused by the "i.e.": that the URL virtually represents something at or below the root directory doesn't mean the resource exists. That "existence" issue is yet another one, and only tangentially related to the current issue of out-of-container URL. In the example above, and assuming #1898, "
Yeah, I had raised that point in #1888 (comment), which was left unnoticed among the flow of the discussion 😅, copying it here:
Re:
Right. That's another issue 😊 |
The issue was discussed in a meeting on 2021-11-19 List of resolutions:
View the transcript3. should we disallow "out-of-container" relative URLs? (issue epub-specs#1912)See github issue epub-specs#1912. Dave Cramer: there is a related problem of ../ repeated until it leaks out of the epub. Ivan Herman: isn't it correct that we don't need to do anything about this problem because what we just merged avoids any sort of security issue. Romain Deltour: this would add an authoring conformance requirement for URLs.
Romain Deltour: but to avoid conforming but non-interop friendly epubs, we just ban such URLs. Matt Garrish: an epub2 RS must still be able to open epub3. And because of that it makes sense to keep things consistent from an authoring perspective. Rick Johnson: seems like this obligates a change in epubcheck? How does that change happen?. Romain Deltour: epubcheck will be updated to implement epub 33, don't worry. Rick Johnson: thank you! Dave Cramer: alignment of epub 33 spec and epubcheck will be smooth, thanks to mgarrish and romain. Ivan Herman: there is already an alpha version of epubcheck for epub spec 33.
Dave Cramer: i think we just disallow out of container URLs. Romain Deltour: in the issue i propose an algorithm to identify what is an out of container URL string.
|
Before getting into a PR... and looking at
I presume the reason we duplicate the same steps with
|
Yes, exactly 👍
what if the chosen name collides with the tested relative URL? The A/B proposed algorithm was the simpler and more explicit I could find. I agree it feels a bit hacky 😊. Another approach could be to loop through path segments, and use a counter to make sure dot-dot segments are balanced by other path segments, which would avoid using a test URL altogether. That may be more elegant, although a tad more complex. |
I am not sure what you mean. By the way, we could also simplify by using a simply ASCII instead of File Name if that works otherwise. I can of course add a note to the algorithm making it explicit why this double dip... |
Oops, you mean the same way as for 'A' or 'B' above, right? Maybe it becomes o.k. by changing the description to make it clear that the decision should be identical for any chosen File Name. |
Yes 😊
Yeah, that's the idea but it's an algorithm: it has to explicitly describe how to test that. Testing for |
So I asked Anne for a quick sanity check and he says the approach is reasonable (both the definition override and the algorithm). Hi did suggest that we could leave that to a note, since our processing requirements make the point moot for conforming RS. But given we normatively forbade such out-of-container URL strings in previous EPUB versions, and since they do create interop issues with current RS, I believe it's better to normatively forbid them. |
Excuse me, could you explain why the /A/ and the /B/ redundancy in the above algorithm? And I also would like to understand whether the algorithm applies also to relative urls inside any-depth-level XHTML files (in such a case the algorithm seems not to be reliable to assess whether urls are leaking or not-leaking) or just to spine level references. |
@rdeltour, I yield :-). But I will do the PR only when the current set of PRs (or at least most of them) get merged. Some of those PR-s are very large and go all over the place; I am afraid to get into merge hell if I do the PR on |
This is to avoid any false negative. Take the URL string "
Yes, the restrictive definition of relative-URL string intends to apply everywhere within the EPUB container. I believe a note will be helpful in clarifying that. |
I think that the above mentioned algorithm is not good for relative urls that are found in some deeper XHTML file because relative urls are relative to the path of the file itself. About false negatives, I think you are underestimating all possible "strange cases", like for example A/B forms existing in the original relative url with an enough complex .. sequence, that could trick the algorithm when flattened. |
Very good point 🤦♂️ I guess the algorithm needs to be tweaked by saying either saying we use the test URLs as the URL of the root directory, or by somehow constructing a test base with the path of the file where the URL occurs. Of course, this becomes more complicated when the underlying document format allows to override the base URL (like in HTML with the |
Do we really need to define an algorithm at all? We could just mention that container relative URL strings should not reference files outside of the abstract container, and if it does such URLs will not work, and maybe point them to the new section on URL processing for more details. That way existing content will not suddenly start failing epub check. I am worried we are making rejection due to epub check failures an even greyer area. That is, if we forbid such URLs and epub check returns an error, can a RS reject that epub? Aren't they also required to process such URLs? It seems like we are tackling something that is difficult to define, introducing backwards compatibility problems, and potentially making RS pipelines more complex, and it isn't really clear what we are gaining. |
About the algorithm Maybe there is a solution: About the BASE tag @bduga So to summarize: Regards |
I'm not sure I see any alternative option, except maybe some kind of syntactic definition.
That's the idea, indeed. But "reference files outside of the abstract container" isn't well-defined. Before #1898, it was under-specified. After #1898, both "
I don't think a lot of existing content will suddenly start failing EPUBCheck if the proposal in this issue becomes normative. Again, it was already the intention behind the prose of previous EPUB versions, so it's likely already somewhat (incompletely) implemented in EPUBCheck. I'll need to run some more tests to back this up.
I'm not sure I fully understand your concerns, do you have a concrete example?
Yes, the proposal in this issue is an authoring requirement. Basically, the idea is to forbid such URLs, formalizing what has always been stated in EPUB (but not very precisely).
We're gaining better interoperability by staying closer to previous versions of EPUB. |
I must admit, I am compelled by the arguments of @bduga and @P5music that we may want to do too much. On our call, I believe, we did agree that the section on root URL-s (and its counterpart in the RS spec) avoids the pressing issue, namely the possible security leak that would happen through an "out-of-container" relative URL. The main argument put forward to add a normative constraint nevertheless was that not all RS-s are conformant, so it is good to have this prevented in the content specification. However, aren't we too cautious? Shouldn't we trust that RS-s will, eventually, come around? So... what about putting something like the following (inspired by #1922, which was born of a somewhat similar issue about older and newer RS-s) into the spec:
(This is a somewhat more specific version of @bduga's comment in #1912 (comment))) |
The main argument, I think, is that previous EPUB version disallowed out-of-container URLs. The specification was slightly ambiguous and not super explicit in its definitions, but the intention was pretty clear. By not saying anything, I'm (slightly) concerned that we open a (tiny) can of worms. Authors can start having EPUBs that pass EPUBCheck but that aren't readable. For instance, if all the
it will pass EPUBCheck successfully (post-#1898), but I doubt the resulting EPUB will actually be readable in any of today's reading system. Now, the risk is mitigated by the fact that no one would likely intentionally start using such URLs on purpose. But still.
The current working draft already has a note that says:
Is your proposal to replace that note with your suggested text? Or move that normative? My main concern is that all normative statements have to be well-defined. At the end of the day, I can see 3 options to say that out-of-container URLs shouldn't be used:
It seems that @bduga and @iherman suggest approaches 1 or 2. I'm more in favor of 3, but won't die on that hill 😊. |
Right, this is already disallowed by epubcheck and has been for as long as I can remember. If you use a relative path outside the container, you'll get an error from epubcheck that the resource could not be located. Where I agree with @bduga is in the severity. With the additional processing for reading systems, do we need to still reject EPUBs with these paths? Authors should not use relative references to files outside the container and Reading Systems must not allow access to them seems like appropriate security. I believe what @bduga is proposing is that instead of another algorithm we point to the one that has already been developed. Doesn't that overcome the ambiguity problem? |
I am a fool, because I did not read through the text and did not remember about that note being already present at the end of §6.1.5. Sigh. I guess the only (very minor) difference between what is there and the text I wrote down is to make it clear that the security leak is avoided and, also, that the relative URL, when parsed, will refer to a different place than what the author thought it would be. Nothing major, so I am indeed a fool:-). I would be o.k. turning that Note into a proper text, with the "should avoid" turned into "SHOULD avoid". That means there is an expectation for, say, epubcheck to report the problem if it occurs (and epubcheck already does that), and the ambiguity stays, in my view, within acceptable bounds. I am a bit wary indeed to go into something mathematically 100% correct (and we are still not sure what it means) for very much a side issue after all. If we want to make one step more, we could use the original algorithm in your proposal but as a note, making it clear that this is what we mean, acknowledging that in some cases, as @P5music notes, it is not absolutely correct. |
Right. But post-#1898, and assuming we do not re-instate that normatively, EPUBCheck would either no longer report that, or would deviate from the spec (which I'd argue is a bad thing).
Yeah, maybe. I'd like to see first if a less-ambiguous approach is reasonable.
That would be my preference.
The not-so-well-defined part is "the base URL defined in its usage context". Usually it is the content URL of the document where the URL string appears, but sometimes can be overridden (e.g in in HTML with the |
So far we do not say that "Authors should not use relative references to files outside the container". This issue is precisely trying to say that, explicitly. The severity is not critical for conforming RS, but based on quick-and-dirty testing it would appear at least a couple big RS are non-conforming, so I'd say it is important to add an authoring safeguard.
I'm not sure I understand the proposal here. Can you please clarify? |
Right, I think we're in agreement that this needs stating. That was how I read this quote from Brady:
which continues on to say:
Maybe I need to go back and read what's in the processing, but doesn't it prevent a relative path from resolving to a level above the container root by having the custom domain to parse against? This should prevent reading systems from resolving outside the container. That explains why authors should not use them and why the recommendation against exists. Whether we need to define a new algorithm to let tools like epubcheck determine if the path leads outside the container sounds like it could be more of an implementation detail. If there's more than one way to do it, should we define a required way of enforcing, or are you anticipating this is only informative? |
Yes, that is the reason I was proposing to turn what today is a note into a clear text with a SHOULD NOT. The only difference would be then, if my understanding is correct, that epubcheck would report a warning and not an error. Which I think is fine.
We are converging... :-)
You are right about the "conceptually" not being spec-level language. To make it clear, what I originally proposed is simply to use the text that is in the current spec as a note (quoted in #1912 (comment)) and not what I wrote in #1912 (comment). The language of the former is probably better. B.t.w., the note that the base element is discouraged. With that in mind, and with the fact that we are talking about a non-normative text, I could imagine saying that the usage context is either the content URL of the document or the container root URL (if we are in a package document), with a separate mention that the base element may complicate things further (or not to mention it altogether, the text in the spec already refers to the problems of relative URLs). At that point, it is probably better to be comprehensible rather than mathematically fully correct... |
Correct.
Yes. I think I understand what you're suggesting now: that the authoring requirement be a SHOULD NOT instead of a MUST NOT, as a conforming RS will already ensure the URL is parsed in a safe way. Correct?
Yes, after last week's WG +1ed the idea, I assumed we were at that implementation detail 😊
I'm not sure I understand the above. I'm suggesting we normatively define as precisely as possible what is an out-of-container URL string, and that we normatively recommend authors not to use them (SHOULD NOT is fine, as far as I'm concerned). |
Yes, it seems we are 😅
Yeah, not quite sure that "more double-dot path segments than needed" is much better.
I agree. Especially when we explain the intent in a side explanatory note. |
Just a note:
one could imply that it has to include the container folder filename itself. Indeed the usage context has to be normalized before applying the algorithm. If the check wants to assess the usage context it has to have it already, and it is likely it is a file:/// reference like in internal representations. Hence there is a underlying filesystem approach that is implicit in the apparently general relative-URL check approach. It sounds strange to me. So in the end this needs that the entire subpath is known, including the epub container folder and, when you have a relative url inside ch01.html you can check this against the http://example.org/mobydick The mobydick folder was certainly created by the system So you could replace A with the folder name the system certainly know. |
But the container root folder does not have a file name. Or rather, by definition (implicitly) its name is the empty string.
The URL parser does what I understand you call the normalization. We do not have to specify how this normalization works.
The URL of a container file, that is basically what I called "usage context" in the simplified algorithm above, is defined by #1898.
Many implementations are not using
I think I see what you mean by "underlying filesystem approach": URLs in EPUB do not have opaque paths but are path segments separated by
Yes, that's what the algorithm proposes to do, except that it uses an arbitrary "
No, it's not. The root URL is implementation-specific.
Again, the "system" does not know of any "folder name" to use for the root directory. Our specification does not say that. There's a lot of assumptions there.
Yeah, unfortunately assuming specification readers "know what is meant" is alway very hazardous when writing specifications. The more explicit, the better. |
@rdeltour -You say that many systems have an http:// approach instead of having the file:/// approach. -I think that the usage context is not the base URL in
It refers to the fact that when a HTML file is loaded, the relative urls in references are interpreted as relative to the folder where the HTML file resides. Believe me, browsers and webviews act this way. -if the container root url is implementation-specific, So one thing is serving the ePub content via http://, another thing is the checks the system does on the underlying filesystem structure. You are right that some constraints have to be enforced in specs about the out-of-container URLs, Regards |
Unfortunately it is not true. Also, it is not because a concrete file exist that any of the component uses a The Web is a complex machine, and one person can only have a partial understanding of how it exactly works. So in discussion like that, it's best to not make assumptions.
With all due respect, I prefer to not simply "believe you" here, but to trust what is specified in the HTML standard 😊.
I'm not sure who "them" is here. I'm not sure what part of the algorithm "hard work" refers to, nor "clever check". The algorithm basically say (to simplify)
That algorithm is rather unambiguous. The admittedly vaguely-defined part is that the base URL used to parse a URL string is defined by the host language, so it's not always defined as the URL of the document due to overriding mechanisms (e.g. the
what's a "normalized tree"? how do you "normalize"? what exactly do you test a URL string is leaking or not?
Again, it's not always the case, and the EPUB specification does not mandate that. So it cannot be assumed.
It is not "strange", it is a well-known, arbitrary, willfully non-conforming, root URL to test out-of-container URLs.
who's "they"? what if they do not create a folder? what is the folder is not exposed in the URL?
What I'm proposing has nothing to do with serving EPUB content. I'm proposing an explicit way to statically evaluate what URLs are conforming or not.
OK, we agree. That is the very essence of this issue (#1912).
It may be "just and editorial prescription", and yet we're on the 31st comment to decide how to "just word it". My main point is precisely that "only self contained archives with no out-of-container relative URLs are genuine ePub publications" is not specification-level precise language. You're obviously free to disagree, but I won't change my mind on that 😊. |
First I want to say that I am not a native english speaker so some italian nuances are under the words in english, and it is likely that something I write can be felt as more "strong" than it really is in italian. I realized that I made some assumptions, and I also see that you are also doing the same. |
No problem. I understand the language barrier is real (all the more as I'm not a native english speaker myself). In fact it's one of the reasons why I insist (hopefully not too pendantically) on using standard and well-defined terminology whenever possible 😊. I for one think all contributions and comments are welcome! 😊 My understanding of your position is that the definition of "out-of-container URL string" need not be formally specified and can be described in prose using commonly accepted concepts. We agree to disagree here, but the point is made! 👍 |
Right, I like when the reading system does the right thing so that security isn't dependent on authors behaving or following validation rules. I think we can be a bit more relaxed when this is the case.
What I mean is how exactly you implement the check is an implementation-specific detail that we don't necessarily need to enforce. I'm wary of writing required processing for aspects of the spec that aren't critical to always process in a single way. Take the obfuscation algorithm for example. It defines pseudo-code for the algorithm, but it's not a requirement that you must translate that pseudo-code into your implementation. Is the outcome here the same, that we exemplify how to process URLs, or is this a requirement? If it's a requirement, what fails if this algorithm isn't followed? As a silly example, if an author just eyeballs their URLs and figures out in their head that they are inside the container, are said users invalid to the EPUB specs? 😛 |
Totally agree 👍
Oh no, implementations do certainly not have to exactly follow the algorithm! As you say, an author is perfectly authorized to review URLs by looking at their markup. Even a tool like EPUBCheck may not follow that exact same algorithm. But the algorithm is used to say what's what. For instance, if someone creates a bug report on EPUBCheck, saying it falsely reports a URL string that supposed to be conforming, we can refer to the normative algorithm to check if it's a bug indeed, or if the URL is really non-conforming. |
Btw, here's what Infra says:
In our case, eyeballing the URLs would likely be more performant for an author than mentally following the algorithm for each URL. For a tool like EPUBCheck, this might be checked at URL-parsing time, by raising an error if the parser tries to shorten an already empty path. As long as URLs are detected as non-conforming like the algorithm would, we're good. |
I am sorry if I missed something. Are we talking about URLs in container.xml? Or, are we talking about URLs in EPUB publications including those in content documents? |
The latter. |
Definitions
By out-of-container relative URL string, I mean a relative URL string that has a number of double-dot path segments ("
..
") high enough to conceptually go outside the container.For instance, "
../../../../EPUB/content.xhtml
" is an out-of-container relative URL string given the following container:Problem
In previous versions of EPUB, the URL definition was unclear (see #1888), but I believe the intent was to disallow them.
In the #1898 proposal, out-of-container URL string are conforming, but the base URL of the container is defined such that an out-of-container URL string is necessarily parsed into a in-container URL.
For instance, after #1898, the URL string "
../../../../EPUB/content.xhtml
" will be parsed to the same URL as the URL string "EPUB/content.xhtml
".But as we added to a note in the #1898 proposal, using an out-of-container URL string will likely lead to interoperability issues with legacy or non-conforming RS.
In addition, as I said earlier, I believe the intent in previous versions of EPUB was to disallow them.
Proposal
I think we should forbid out-of-container URL strings.
Here's a proposal (assuming #1898 is merged).
Replace:
by something along the lines of:
Explanation
The proposal above intends to override the URL standard definition of relative-URL string, so that:
/
" are not allowed)..
" path segments to go "outside" the container are not allowed)The intent is even if we refer to a broader "category" of URL strings, like a relative-URL-with-fragment string, our restrictions on relative-URL string apply.
In some way, it is monkey patching the URL standard definition. Monkey patches are usually not considered a good thing. But I do not see how to do otherwise: for the document formats we own (e.g. Package Document), we can easily define what is a valid URL string; but for other formats used in EPUB (e.g. HTML), they directly refer to the URL standard so I don't see an alternative to tweaking the definition.
Editorial consequences
We will be able to replace all our use of:
We may no longer need to assume the properties of the container root URL in the core spec, as they really only apply to out-of-container URLs.
We still need those in in the RS spec, to specify how reading systems must process non-conforming URLs.
The text was updated successfully, but these errors were encountered: