Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

should we disallow "out-of-container" relative URLs? #1912

Closed
rdeltour opened this issue Nov 16, 2021 · 38 comments · Fixed by #1939
Closed

should we disallow "out-of-container" relative URLs? #1912

rdeltour opened this issue Nov 16, 2021 · 38 comments · Fixed by #1939
Labels
EPUB33 Issues addressed in the EPUB 3.3 revision Spec-EPUB3 The issue affects the core EPUB 3.3 Recommendation Topic-OCF The issue affects the OCF section of the core EPUB 3 specification

Comments

@rdeltour
Copy link
Member

Definitions

By out-of-container relative URL string, I mean a relative URL string that has a number of double-dot path segments ("..") high enough to conceptually go outside the container.

For instance, "../../../../EPUB/content.xhtml" is an out-of-container relative URL string given the following container:

/
├── mimetype
├── META-INF
│   └── container.xml
└── EPUB
    └── content.xhtml

Problem

In previous versions of EPUB, the URL definition was unclear (see #1888), but I believe the intent was to disallow them.
In the #1898 proposal, out-of-container URL string are conforming, but the base URL of the container is defined such that an out-of-container URL string is necessarily parsed into a in-container URL.

For instance, after #1898, the URL string "../../../../EPUB/content.xhtml" will be parsed to the same URL as the URL string "EPUB/content.xhtml".

But as we added to a note in the #1898 proposal, using an out-of-container URL string will likely lead to interoperability issues with legacy or non-conforming RS.
In addition, as I said earlier, I believe the intent in previous versions of EPUB was to disallow them.

Proposal

I think we should forbid out-of-container URL strings.

Here's a proposal (assuming #1898 is merged).
Replace:

In the OCF Abstract Container, when a file uses a URL string to reference another file in the container, the string MUST be a path-relative-scheme-less-URL string, optionally followed by U+0023 (#) and a URL-fragment string.

by something along the lines of:

In the OCF Abstract Container, a relative-URL string MUST be a container relative URL string.

A URL string url is a container relative URL string if it is a path-relative-scheme-less-URL string and the following steps return true:

  1. let testURLRecord be the result of applying the URL parser to url with "https://example.org/A/".
  2. let testURLString be the result of applying the URL Serializer to testURLRecord.
  3. if testURLString does not start with "https://example.org/A/", then return false.
  4. set testURLRecord to the result of applying the URL parser to url with "https://example.org/B/".
  5. let testURLString be the result of applying the URL Serializer to testURLRecord.
  6. if testURLString does not start with "https://example.org/B/", then return false.
  7. Return true.

Explanation

The proposal above intends to override the URL standard definition of relative-URL string, so that:

  • scheme-relative- and path-relative- URL strings are not allowed (in other words, URL strings starting with "/" are not allowed)
  • "exceeding" or "leaky" URL strings are not allowed (in other words, URLs with enough ".." path segments to go "outside" the container are not allowed)

The intent is even if we refer to a broader "category" of URL strings, like a relative-URL-with-fragment string, our restrictions on relative-URL string apply.

In some way, it is monkey patching the URL standard definition. Monkey patches are usually not considered a good thing. But I do not see how to do otherwise: for the document formats we own (e.g. Package Document), we can easily define what is a valid URL string; but for other formats used in EPUB (e.g. HTML), they directly refer to the URL standard so I don't see an alternative to tweaking the definition.

Editorial consequences

We will be able to replace all our use of:

path-relative-scheme-less-URL string, optionally followed by U+0023 (#) and a URL-fragment string
by
relative-URL-string with fragment string (which is a bit more readable).

We may no longer need to assume the properties of the container root URL in the core spec, as they really only apply to out-of-container URLs.
We still need those in in the RS spec, to specify how reading systems must process non-conforming URLs.

@mattgarrish mattgarrish added the Topic-OCF The issue affects the OCF section of the core EPUB 3 specification label Nov 16, 2021
@mattgarrish
Copy link
Member

In previous versions of EPUB, the URL definition was unclear (see #1888), but I believe the intent was to disallow them.

The parsing may have been unclear, but I wouldn't say the intent was. From 3.2:

All relative IRI references MUST resolve to resources within the OCF Abstract Container (i.e., at or below the Root Directory).

which is currently:

All relative-URL-with-fragment strings [URL] MUST, after parsing to URL records [URL], identify resources within the OCF Abstract Container (i.e., at or below the Root Directory).

I can't find this now, or anything equivalent, in the proposal for #1888, so restoring it in some form would seem appropriate.

There's also a question of whether file: paths should be allowed, since they also let you reach outside the container. Right now, they're only discouraged in the package document via a restriction on the href attribute. Should this restriction be generalized, too?

@rdeltour
Copy link
Member Author

The parsing may have been unclear, but I wouldn't say the intent was. From 3.2:

All relative IRI references MUST resolve to resources within the OCF Abstract Container (i.e., at or below the Root Directory).

I guess I'm sorta confused by the "i.e.": that the URL virtually represents something at or below the root directory doesn't mean the resource exists.

That "existence" issue is yet another one, and only tangentially related to the current issue of out-of-container URL.

In the example above, and assuming #1898, "../../../EPUB/content.xhtml does resolve after parsing to an existing in-container resource. The proposal is precisely to disallow that kind of URLs.

I can't find this now, or anything equivalent, in the proposal for #1888, so restoring it in some form would seem appropriate.

Yeah, I had raised that point in #1888 (comment), which was left unnoticed among the flow of the discussion 😅, copying it here:

I removed the paragraph saying "All relative URLs MUST, after parsing, identify resources within the OCF Abstract Container (i.e., at or below the Root Directory)(…) I don't know if it was meant to explicitly require that resources exist? If that's the case, then we can reword it along the lines of: "All relative-URL-with-fragment strings MUST, after parsing, be equal to the container URL of an existing file in the OCF Abstract Container."

Re:

There's also a question of whether file: paths should be allowed

Right. That's another issue 😊

@iherman
Copy link
Member

iherman commented Nov 19, 2021

The issue was discussed in a meeting on 2021-11-19

List of resolutions:

View the transcript

3. should we disallow "out-of-container" relative URLs? (issue epub-specs#1912)

See github issue epub-specs#1912.

Dave Cramer: there is a related problem of ../ repeated until it leaks out of the epub.
… should we just disallow such URLs?.
… i think yes.

Ivan Herman: isn't it correct that we don't need to do anything about this problem because what we just merged avoids any sort of security issue.
… this sounds like adding a "good practice" to the text, but its an extra measure to guard against RS that don't follow what we merged in #1898.

Romain Deltour: this would add an authoring conformance requirement for URLs.
… such leaky URLs will likely create interop issues with legacy/non-conforming RS. Indeed, the result of #1898 is that these two are identical:

<item href="../../../../EPUB/content.xhtml"/>
<item href="EPUB/content.xhtml"/>

Romain Deltour: but to avoid conforming but non-interop friendly epubs, we just ban such URLs.
#1898 makes the two URLs above equivalent, but i'm doubtful that in practice all RS will handle this properly.

Matt Garrish: an epub2 RS must still be able to open epub3. And because of that it makes sense to keep things consistent from an authoring perspective.

Rick Johnson: seems like this obligates a change in epubcheck? How does that change happen?.

Romain Deltour: epubcheck will be updated to implement epub 33, don't worry.

Rick Johnson: thank you!

Dave Cramer: alignment of epub 33 spec and epubcheck will be smooth, thanks to mgarrish and romain.

Ivan Herman: there is already an alpha version of epubcheck for epub spec 33.

Romain Deltour: See Epubcheck alpha release for 3.3.

Dave Cramer: i think we just disallow out of container URLs.

Romain Deltour: in the issue i propose an algorithm to identify what is an out of container URL string.
… please comment in issue if you spot error there.

Proposed resolution: disallow "out-of-container" relative URLs. (Dave Cramer)

Dave Cramer: +1.

Rick Johnson: +1.

Ben Schroeter: +1.

Matt Garrish: +1.

Dan Lazin: +1.

Victoria Lee: +1.

Matthew Chan: +1.

Ivan Herman: +1.

Romain Deltour: +1.

Masakazu Kitahara: +1.

Bill Kasdorf: +1.

Toshiaki Koike: +1.

Wendy Reid: +1.

Murata Makoto: +1.

Resolution #3: disallow "out-of-container" relative URLs.

@iherman
Copy link
Member

iherman commented Nov 22, 2021

Before getting into a PR... and looking at

A URL string url is a container relative URL string if it is a path-relative-scheme-less-URL string and the following steps return true:

  1. let testURLRecord be the result of applying the URL parser to url with "https://example.org/A/".
  2. let testURLString be the result of applying the URL Serializer to testURLRecord.
  3. if testURLString does not start with "https://example.org/A/", then return false.
  4. set testURLRecord to the result of applying the URL parser to url with "https://example.org/B/".
  5. let testURLString be the result of applying the URL Serializer to testURLRecord.
  6. if testURLString does not start with "https://example.org/B/", then return false.
  7. Return true.

I presume the reason we duplicate the same steps with /A/ and /B/ is to avoid a peculiar situation that the specific URL, sort of 'collides' with the /A/ character and thereby creating a false positive. I just wonder whether it is not simpler to say something like

A URL string url is a container relative URL string if it is a path-relative-scheme-less-URL string and the following steps return true:

  1. let topdir be an arbitrary File Name
  2. let testRoot be the result of parsing topdir with https://example.org/ as a base
  3. let testURLRecord be the result of applying the URL parser to url with testRoot as base.
  4. let testURLString be the result of applying the URL Serializer to testURLRecord.
  5. if testURLString does not start with testRoot, then return false.
  6. otherwise return true.

@rdeltour
Copy link
Member Author

I presume the reason we duplicate the same steps with /A/ and /B/ is to avoid a peculiar situation that the specific URL, sort of 'collides' with the /A/ character and thereby creating a false positive.

Yes, exactly 👍

I just wonder whether it is not simpler to say something like

  1. let topdir be an arbitrary File Name

what if the chosen name collides with the tested relative URL?

The A/B proposed algorithm was the simpler and more explicit I could find. I agree it feels a bit hacky 😊. Another approach could be to loop through path segments, and use a counter to make sure dot-dot segments are balanced by other path segments, which would avoid using a test URL altogether. That may be more elegant, although a tad more complex.

@iherman
Copy link
Member

iherman commented Nov 23, 2021

I presume the reason we duplicate the same steps with /A/ and /B/ is to avoid a peculiar situation that the specific URL, sort of 'collides' with the /A/ character and thereby creating a false positive.

Yes, exactly 👍

I just wonder whether it is not simpler to say something like

  1. let topdir be an arbitrary File Name

what if the chosen name collides with the tested relative URL?

I am not sure what you mean. By the way, we could also simplify by using a simply ASCII instead of File Name if that works otherwise.

I can of course add a note to the algorithm making it explicit why this double dip...

@iherman
Copy link
Member

iherman commented Nov 23, 2021

I just wonder whether it is not simpler to say something like

  1. let topdir be an arbitrary File Name

what if the chosen name collides with the tested relative URL?

I am not sure what you mean. By the way, we could also simplify by using a simply ASCII instead of File Name if that works otherwise.

Oops, you mean the same way as for 'A' or 'B' above, right?

Maybe it becomes o.k. by changing the description to make it clear that the decision should be identical for any chosen File Name.

@rdeltour
Copy link
Member Author

Oops, you mean the same way as for 'A' or 'B' above, right?

Yes 😊

Maybe it becomes o.k. by changing the description to make it clear that the decision should be identical for any chosen File Name.

Yeah, that's the idea but it's an algorithm: it has to explicitly describe how to test that. Testing for /A/, then for /B/ means it works for any chosen name.
If we say "for any chosen file name", I'm afraid the algorithm isn't precise enough.

@rdeltour
Copy link
Member Author

So I asked Anne for a quick sanity check and he says the approach is reasonable (both the definition override and the algorithm).

Hi did suggest that we could leave that to a note, since our processing requirements make the point moot for conforming RS. But given we normatively forbade such out-of-container URL strings in previous EPUB versions, and since they do create interop issues with current RS, I believe it's better to normatively forbid them.

@P5music
Copy link

P5music commented Nov 23, 2021

Excuse me, could you explain why the /A/ and the /B/ redundancy in the above algorithm?
Is not /A/A unique (not colliding) in any circumstance? If yes, why not just use the base url without the /A or /B check?

And I also would like to understand whether the algorithm applies also to relative urls inside any-depth-level XHTML files (in such a case the algorithm seems not to be reliable to assess whether urls are leaking or not-leaking) or just to spine level references.
Thanks in advance

@iherman
Copy link
Member

iherman commented Nov 23, 2021

@rdeltour, I yield :-). But I will do the PR only when the current set of PRs (or at least most of them) get merged. Some of those PR-s are very large and go all over the place; I am afraid to get into merge hell if I do the PR on main right now...

@rdeltour
Copy link
Member Author

Excuse me, could you explain why the /A/ and the /B/ redundancy in the above algorithm? Is not /A/A unique (not colliding) in any circumstance? If yes, why not just use the base url without the /A or /B check?

This is to avoid any false negative. Take the URL string "../../A/doc.xhtml". The result of parsing it with "http://example.org/A/" as base starts with "http://example.org/A/"; so we need the extra-check that the result of parsing it with "http://example.org/B/" as base does not start with "http://example.org/B/" to detect it is an out-of-container URL.

And I also would like to understand whether the algorithm applies also to relative urls inside any-depth-level XHTML files (in such a case the algorithm seems not to be reliable to assess whether urls are leaking or not-leaking) or just to spine level references. Thanks in advance

Yes, the restrictive definition of relative-URL string intends to apply everywhere within the EPUB container. I believe a note will be helpful in clarifying that.

@P5music
Copy link

P5music commented Nov 23, 2021

@rdeltour

I think that the above mentioned algorithm is not good for relative urls that are found in some deeper XHTML file because relative urls are relative to the path of the file itself.
Example
../images/image1.png
could be found in a XHTML file in the myfolder/ directory, where myfolder/ and images/ are at the same level.
This is the simplest case, but imagine one XHTML file in a nested folder, its relative urls could be pointing to some resources inside the folder itself, or .. , or ../.. , or even a subfolder.
The relative urls should be composed with the full tree, before attaching them to the http://www.example.org/A base url to be checked against.
So one example could be
../../images/image1.png
but that would be relative to the folder containing the XHTML file, so
the algorithm would process
http://www.example.org/A/../../images/image1.png
that is clearly leaking because it starts at a deeper level, but with no fault, it does not really leak.

About false negatives, I think you are underestimating all possible "strange cases", like for example A/B forms existing in the original relative url with an enough complex .. sequence, that could trick the algorithm when flattened.

@rdeltour
Copy link
Member Author

I think that the above mentioned algorithm is not good for relative urls that are found in some deeper XHTML file because relative urls are relative to the path of the file itself.
Example
../images/image1.png
could be found in a XHTML file in the myfolder/ directory, where myfolder/ and images/ are at the same level.
This is the simplest case, but imagine one XHTML file in a nested folder, its relative urls could be pointing to some resources inside the folder itself, or .. , or ../.. , or even a subfolder.
The relative urls should be composed with the full tree, before attaching them to the http://www.example.org/A base url to be checked against.

Very good point 🤦‍♂️

I guess the algorithm needs to be tweaked by saying either saying we use the test URLs as the URL of the root directory, or by somehow constructing a test base with the path of the file where the URL occurs.

Of course, this becomes more complicated when the underlying document format allows to override the base URL (like in HTML with the base element). I'm not quite sure how to write the algorithm now. 🤔

@bduga
Copy link
Collaborator

bduga commented Nov 23, 2021

Do we really need to define an algorithm at all? We could just mention that container relative URL strings should not reference files outside of the abstract container, and if it does such URLs will not work, and maybe point them to the new section on URL processing for more details. That way existing content will not suddenly start failing epub check. I am worried we are making rejection due to epub check failures an even greyer area. That is, if we forbid such URLs and epub check returns an error, can a RS reject that epub? Aren't they also required to process such URLs? It seems like we are tackling something that is difficult to define, introducing backwards compatibility problems, and potentially making RS pipelines more complex, and it isn't really clear what we are gaining.

@P5music
Copy link

P5music commented Nov 23, 2021

@rdeltour

About the algorithm
I think that the references inside the ePub3 can be absolute http urls or relative urls.
Both they can have fragments so they are URLS, and they are not like filesystem entries.
It is your right claim if I understand.

Maybe there is a solution:
for the URL check's sake
a relative URL can undergo this processing (instead of A/B check):
-strip the fragment, so now the url is a filesystem-like entry, in the form of a relative path
-process it as a filesystem entry, that is, using the filesystem API methods that are available on every system and language, it works both on webservers and personal devices
-the result is also valid as URL if the fragment is restored at the end of the url and the complete relative url segment is appended to the root url, for example http://www.example.org/ or real filesystem paths file:///mainFolder/folder/subfolder/

About the BASE tag
I think no ePub is using it, so is it to be dis-allowed too?
And, isn't it usually absolute? So it is unlikely some ePub author puts a relative url as a BASE tag url.
If it is absolute it could be allowed when some remote resources have a certain origin like http://www.myepubresourcesite.com/epubresources/mobydick/bonusimages
If it is also allowed in the relative form, the check has to prepend it to the relative urls in the XHTML file that contains the BASE tag url.
However maybe it is better to forbid it in the relative form
because
the file:///mainFolder/folder/subfolder form is likely to be the internal representation in webviews and browsers.
And as references could contain something like image.png but also /image.png this means "relative to the base url at top level" in HTML, but it could mean something I do not know on filesystems.

@bduga
I think that relative urls are already checked in ePubs not to be leaking, at least because that would mean that the ePub has an error in it and cannot be viewed correctly.
Also RSs are likely to be already checking for leaks.
If this is explicitly said it is good.
The proposed algorithm just tries to explicitly say what is likely to be already done.
The URL standard complying check seems to be reasonable, for example because there are fragments so they are URLs and not pure filesystem paths (see also above).
But the proposed algorithm has issues.

So to summarize:
I think it would be enough to express the above mentioned compliance needs "in words", as they are likely to be already in ePub producers' and RSs' knowledge and concern.

Regards

@rdeltour
Copy link
Member Author

Do we really need to define an algorithm at all?

I'm not sure I see any alternative option, except maybe some kind of syntactic definition.

We could just mention that container relative URL strings should not reference files outside of the abstract container

That's the idea, indeed. But "reference files outside of the abstract container" isn't well-defined.

Before #1898, it was under-specified. After #1898, both "../../EPUB/doc.xhtml" and "EPUB/doc.xhtml" do refer to the same file, inside the container.

That way existing content will not suddenly start failing epub check.

I don't think a lot of existing content will suddenly start failing EPUBCheck if the proposal in this issue becomes normative. Again, it was already the intention behind the prose of previous EPUB versions, so it's likely already somewhat (incompletely) implemented in EPUBCheck. I'll need to run some more tests to back this up.

I am worried we are making rejection due to epub check failures an even greyer area.

I'm not sure I fully understand your concerns, do you have a concrete example?

That is, if we forbid such URLs and epub check returns an error, can a RS reject that epub? Aren't they also required to process such URLs?

Yes, the proposal in this issue is an authoring requirement. Basically, the idea is to forbid such URLs, formalizing what has always been stated in EPUB (but not very precisely).
EPUBCheck would reject those.
Currently we do not say if an RS must process such invalid URL strings or may process invalid URL strings. I think the latter makes more sense.

It seems like we are tackling something that is difficult to define, introducing backwards compatibility problems, and potentially making RS pipelines more complex, and it isn't really clear what we are gaining.

We're gaining better interoperability by staying closer to previous versions of EPUB.
I don't think it introduces any backward compatibility issue.
But it is somewhat difficult to define, yes 😊

@iherman
Copy link
Member

iherman commented Nov 24, 2021

I must admit, I am compelled by the arguments of @bduga and @P5music that we may want to do too much. On our call, I believe, we did agree that the section on root URL-s (and its counterpart in the RS spec) avoids the pressing issue, namely the possible security leak that would happen through an "out-of-container" relative URL. The main argument put forward to add a normative constraint nevertheless was that not all RS-s are conformant, so it is good to have this prevented in the content specification. However, aren't we too cautious? Shouldn't we trust that RS-s will, eventually, come around?

So... what about putting something like the following (inspired by #1922, which was born of a somewhat similar issue about older and newer RS-s) into the spec:

EPUB Creators should use caution when using relative URL-s of the form ../../../...FILENAME. In case of an error, that URL may refer, conceptually, to a resource outside the container. Although the definition of the URL of the Root Directory avoids such leaks, and the resulting potential security issues to occur, older tools or toolchains may not implement the Root Directory URL according to the specification, which may lead to unintended consequences. As a result, authors SHOULD NOT use such "out-of-container" URL-s.

(This is a somewhat more specific version of @bduga's comment in #1912 (comment)))

@rdeltour
Copy link
Member Author

The main argument put forward to add a normative constraint nevertheless was that not all RS-s are conformant, so it is good to have this prevented in the content specification. However, aren't we too cautious? Shouldn't we trust that RS-s will, eventually, come around?

The main argument, I think, is that previous EPUB version disallowed out-of-container URLs. The specification was slightly ambiguous and not super explicit in its definitions, but the intention was pretty clear.

By not saying anything, I'm (slightly) concerned that we open a (tiny) can of worms. Authors can start having EPUBs that pass EPUBCheck but that aren't readable.

For instance, if all the item elements of my Package Documents look like:

<item href="../../../../EPUB/chapter1.xhtml" />

it will pass EPUBCheck successfully (post-#1898), but I doubt the resulting EPUB will actually be readable in any of today's reading system.

Now, the risk is mitigated by the fact that no one would likely intentionally start using such URLs on purpose. But still.

So... what about putting something like the following (inspired by #1922, which was born of a somewhat similar issue about older and newer RS-s) into the spec (…)

The current working draft already has a note that says:

The properties of the container root URL are such that whatever the amount of double-dot path segments in a URL string (for example, ../../../secret), it will be parsed to a content URL (and not "leak" outside the container). However, for better interoperability with non-conforming or legacy Reading Systems, EPUB Creators should avoid using more double-dot path segments than needed to reach the target container file.

Is your proposal to replace that note with your suggested text? Or move that normative?

My main concern is that all normative statements have to be well-defined.
I would argue that "a URL that conceptually refers to a resource outside the container" is not well-defined, and cannot be used for a normative statement.

At the end of the day, I can see 3 options to say that out-of-container URLs shouldn't be used:

  • we say that in a note.
    • pro: easy to edit
    • con: we open a little can of worms, as EPUB 3.3 becomes more permissive than EPUB 3.2.
  • we say that as a vaguely-defined normative statement
    • pro: closer alignment to EPUB 3.2
    • con: our spec has some vaguely-defined statement
  • we say that as a well-defined normative statement
    • pro: closer alignment to EPUB 3.2 (same as above) + well-define spec language
    • con: more difficult to define, less readable for laypeople

It seems that @bduga and @iherman suggest approaches 1 or 2. I'm more in favor of 3, but won't die on that hill 😊.

@mattgarrish
Copy link
Member

The main argument, I think, is that previous EPUB version disallowed out-of-container URLs.

Right, this is already disallowed by epubcheck and has been for as long as I can remember. If you use a relative path outside the container, you'll get an error from epubcheck that the resource could not be located.

Where I agree with @bduga is in the severity. With the additional processing for reading systems, do we need to still reject EPUBs with these paths? Authors should not use relative references to files outside the container and Reading Systems must not allow access to them seems like appropriate security.

I believe what @bduga is proposing is that instead of another algorithm we point to the one that has already been developed. Doesn't that overcome the ambiguity problem?

@iherman
Copy link
Member

iherman commented Nov 24, 2021

I am a fool, because I did not read through the text and did not remember about that note being already present at the end of §6.1.5. Sigh. I guess the only (very minor) difference between what is there and the text I wrote down is to make it clear that the security leak is avoided and, also, that the relative URL, when parsed, will refer to a different place than what the author thought it would be. Nothing major, so I am indeed a fool:-).

I would be o.k. turning that Note into a proper text, with the "should avoid" turned into "SHOULD avoid". That means there is an expectation for, say, epubcheck to report the problem if it occurs (and epubcheck already does that), and the ambiguity stays, in my view, within acceptable bounds. I am a bit wary indeed to go into something mathematically 100% correct (and we are still not sure what it means) for very much a side issue after all.

If we want to make one step more, we could use the original algorithm in your proposal but as a note, making it clear that this is what we mean, acknowledging that in some cases, as @P5music notes, it is not absolutely correct.

@rdeltour
Copy link
Member Author

@mattgarrish

The main argument, I think, is that previous EPUB version disallowed out-of-container URLs.

Right, this is already disallowed by epubcheck and has been for as long as I can remember. If you use a relative path outside the container, you'll get an error from epubcheck that the resource could not be located.

Right. But post-#1898, and assuming we do not re-instate that normatively, EPUBCheck would either no longer report that, or would deviate from the spec (which I'd argue is a bad thing).

@iherman

I would be o.k. turning that Note into a proper text, with the "should avoid" turned into "SHOULD avoid" (…) the ambiguity stays, in my view, within acceptable bounds

Yeah, maybe. I'd like to see first if a less-ambiguous approach is reasonable.

If we want to make one step more, we could use the original algorithm in your proposal but as a note, making it clear that this is what we mean, acknowledging that in some cases, as @P5music notes, it is not absolutely correct.

That would be my preference.
For instance we could tweak the algorithm to something like (simplifying):

  1. set the container root URL to "http://example.org/A/"
  2. parse url with the base URL defined in its usage context
  3. test that it doesn't start with "http://example.org/A/"

The not-so-well-defined part is "the base URL defined in its usage context". Usually it is the content URL of the document where the URL string appears, but sometimes can be overridden (e.g in in HTML with the base element).
Also, it's kinda not-Infra to set "container root URL", like it is a variable.
So it's not 100% precise or correct, but still better or less ambiguous than "conceptually refers to a resource outside the container".
The "conceptually" explanation is really note-level language, not spec-level language, IMO.

@rdeltour
Copy link
Member Author

@mattgarrish

Where I agree with @bduga is in the severity. With the additional processing for reading systems, do we need to still reject EPUBs with these paths? Authors should not use relative references to files outside the container and Reading Systems must not allow access to them seems like appropriate security.

So far we do not say that "Authors should not use relative references to files outside the container". This issue is precisely trying to say that, explicitly.

The severity is not critical for conforming RS, but based on quick-and-dirty testing it would appear at least a couple big RS are non-conforming, so I'd say it is important to add an authoring safeguard.

I believe what @bduga is proposing is that instead of another algorithm we point to the one that has already been developed. Doesn't that overcome the ambiguity problem?

I'm not sure I understand the proposal here. Can you please clarify?

@mattgarrish
Copy link
Member

So far we do not say that "Authors should not use relative references to files outside the container".

Right, I think we're in agreement that this needs stating. That was how I read this quote from Brady:

We could just mention that container relative URL strings should not reference files outside of the abstract container

which continues on to say:

and if it does such URLs will not work, and maybe point them to the new section on URL processing for more details

Maybe I need to go back and read what's in the processing, but doesn't it prevent a relative path from resolving to a level above the container root by having the custom domain to parse against? This should prevent reading systems from resolving outside the container. That explains why authors should not use them and why the recommendation against exists.

Whether we need to define a new algorithm to let tools like epubcheck determine if the path leads outside the container sounds like it could be more of an implementation detail. If there's more than one way to do it, should we define a required way of enforcing, or are you anticipating this is only informative?

@iherman
Copy link
Member

iherman commented Nov 24, 2021

Right. But post-#1898, and assuming we do not re-instate that normatively, EPUBCheck would either no longer report that, or would deviate from the spec (which I'd argue is a bad thing).

Yes, that is the reason I was proposing to turn what today is a note into a clear text with a SHOULD NOT. The only difference would be then, if my understanding is correct, that epubcheck would report a warning and not an error. Which I think is fine.

If we want to make one step more, we could use the original algorithm in your proposal but as a note, making it clear that this is what we mean, acknowledging that in some cases, as @P5music notes, it is not absolutely correct.

That would be my preference.

We are converging... :-)

For instance we could tweak the algorithm to something like (simplifying):

  1. set the container root URL to "http://example.org/A/"
  2. parse url with the base URL defined in its usage context
  3. test that it doesn't start with "http://example.org/A/"

The not-so-well-defined part is "the base URL defined in its usage context". Usually it is the content URL of the document where the URL string appears, but sometimes can be overridden (e.g in in HTML with the base element). Also, it's kinda not-Infra to set "container root URL", like it is a variable. So it's not 100% precise or correct, but still better or less ambiguous than "conceptually refers to a resource outside the container". The "conceptually" explanation is really note-level language, not spec-level language, IMO.

You are right about the "conceptually" not being spec-level language. To make it clear, what I originally proposed is simply to use the text that is in the current spec as a note (quoted in #1912 (comment)) and not what I wrote in #1912 (comment). The language of the former is probably better.

B.t.w., the note that the base element is discouraged. With that in mind, and with the fact that we are talking about a non-normative text, I could imagine saying that the usage context is either the content URL of the document or the container root URL (if we are in a package document), with a separate mention that the base element may complicate things further (or not to mention it altogether, the text in the spec already refers to the problems of relative URLs). At that point, it is probably better to be comprehensible rather than mathematically fully correct...

@rdeltour
Copy link
Member Author

@mattgarrish

Maybe I need to go back and read what's in the processing, but doesn't it prevent a relative path from resolving to a level above the container root by having the custom domain to parse against?

Correct.

This should prevent reading systems from resolving outside the container. That explains why authors should not use them and why the recommendation against exists.

Yes.

I think I understand what you're suggesting now: that the authoring requirement be a SHOULD NOT instead of a MUST NOT, as a conforming RS will already ensure the URL is parsed in a safe way. Correct?
If yes, then I agree, it's perfectly reasonable.

Whether we need to define a new algorithm to let tools like epubcheck determine if the path leads outside the container sounds like it could be more of an implementation detail.

Yes, after last week's WG +1ed the idea, I assumed we were at that implementation detail 😊

If there's more than one way to do it, should we define a required way of enforcing, or are you anticipating this is only informative?

I'm not sure I understand the above. I'm suggesting we normatively define as precisely as possible what is an out-of-container URL string, and that we normatively recommend authors not to use them (SHOULD NOT is fine, as far as I'm concerned).

@rdeltour
Copy link
Member Author

@iherman

We are converging... :-)

Yes, it seems we are 😅

The language of the [current spec note] is probably better.

Yeah, not quite sure that "more double-dot path segments than needed" is much better.
To be clear, I don't know if some prose could be clear enough or if an algorithmic version is better. As long as a normative version is unambiguous/precise, I'm OK!

(…) At that point, it is probably better to be comprehensible rather than mathematically fully correct…

I agree. Especially when we explain the intent in a side explanatory note.

@P5music
Copy link

P5music commented Nov 24, 2021

Just a note:
if the usage context of a relative url is taken in account
as in

1 set the container root URL to "http://example.org/A/"
2 parse url with the base URL defined in its usage context
3 test that it doesn't start with "http://example.org/A/

one could imply that it has to include the container folder filename itself.

Indeed the usage context has to be normalized before applying the algorithm.
It is going to be a strange algorithm because it wants that a similar check is made before the official check is made.

If the check wants to assess the usage context it has to have it already, and it is likely it is a file:/// reference like in internal representations.
(It can contains ../.. but it is likely that the file:/// form is already normalized)
In fact only accessing the XHTML files the relative urls are unveiled.

Hence there is a underlying filesystem approach that is implicit in the apparently general relative-URL check approach.

It sounds strange to me.

So in the end this needs that the entire subpath is known, including the epub container folder
like
mobydick/OEBPS/chapters/firstpart/ch01.html

and, when you have a relative url inside ch01.html
it could be ../../../images/cover.jpg
that in normalized (as a path) form is
mobydick/OEBPS/images/cover.jpg

you can check this against the http://example.org/mobydick

The mobydick folder was certainly created by the system
because, if I understand, it is sort of mandatory in the concept of container root url.

So you could replace A with the folder name the system certainly know.
This way it is clear that the complete tree has to be considered and then the url serializer will normalize the url, flattening the ../.. if any.
But it seems that it can be said "in words" too, because they know what is meant.

@rdeltour
Copy link
Member Author

rdeltour commented Nov 25, 2021

one could imply that it has to include the container folder filename itself.

But the container root folder does not have a file name. Or rather, by definition (implicitly) its name is the empty string.

Indeed the usage context has to be normalized before applying the algorithm. It is going to be a strange algorithm because it wants that a similar check is made before the official check is made.

The URL parser does what I understand you call the normalization. We do not have to specify how this normalization works.

If the check wants to assess the usage context it has to have it already

The URL of a container file, that is basically what I called "usage context" in the simplified algorithm above, is defined by #1898.
We're assigning a well-known URL to the root directory for the purpose of this algorithm only, precisely so that a URL string has a well-known base URL that we can use for parsing.

and it is likely it is a file:/// reference like in internal representations. (It can contains ../.. but it is likely that the file:/// form is already normalized) In fact only accessing the XHTML files the relative urls are unveiled.

Many implementations are not using file: URLs, but rather a custom scheme or HTTP URLs.

Hence there is a underlying filesystem approach that is implicit in the apparently general relative-URL check approach.

I think I see what you mean by "underlying filesystem approach": URLs in EPUB do not have opaque paths but are path segments separated by /. But in a normative specification language, we need to have precise definitions and operations, and commonly accepted operations on filesystem-like path strings isn't well-defined enough.

you can check this against the http://example.org/mobydick

Yes, that's what the algorithm proposes to do, except that it uses an arbitrary "A" (and "B") "directory" instead of "mobydick".

The mobydick folder was certainly created by the system because, if I understand, it is sort of mandatory in the concept of container root url.

No, it's not. The root URL is implementation-specific.
In fact, if a RS used "http://example.org/A" or "http://example.org/mobydick" for the root URL it would not be conforming; but here we're willfully violate the assumed properties of the root URL, precisely to check that URL strings would not go out of the container in a non-conforming RS.

So you could replace A with the folder name the system certainly know.

Again, the "system" does not know of any "folder name" to use for the root directory. Our specification does not say that. There's a lot of assumptions there.

But it seems that it can be said "in words" too, because they know what is meant.

Yeah, unfortunately assuming specification readers "know what is meant" is alway very hazardous when writing specifications. The more explicit, the better.

@P5music
Copy link

P5music commented Nov 25, 2021

@rdeltour
I understand your objections but I have something more to my points.

-You say that many systems have an http:// approach instead of having the file:/// approach.
But I think that you cannot find a system that has not file:/// representations under the hood,
because in fact hosts serve web pages to the clients
but they are just computers, so the whole internet relies on filesystems.
So that's the reality you can count on.

-I think that the usage context is not the base URL in

2 parse url with the base URL defined in its usage context

It refers to the fact that when a HTML file is loaded, the relative urls in references are interpreted as relative to the folder where the HTML file resides. Believe me, browsers and webviews act this way.
This is why I pointed out that the entire path tree has to be calculated for the parsing to be done, before applying the algorithm,
so it is like saying to them "do the hard work, then perform this clever check". Once you have the normalized tree you know if a URL is leaking, no further application of algorithm is needed.

-if the container root url is implementation-specific,
then it does exist,
and I am confident that there is a folder that contains all the ePub sub-folders and files over that remote server.
So you do not have to provide a strange example like
http://example.org/A
Also I agree that
http://example.org/mobydick is useless too
I mean that mobydick is not like A, I was wrong in my previous message,
it is better
because it is very likely that you can just leave them check against whatever folder they have created for the ePub, and check URL leakings.
And it is like http://mywebsite.com/mobydick

So one thing is serving the ePub content via http://, another thing is the checks the system does on the underlying filesystem structure.
I think they are two separate operations.
The http:// approach is just what will "appear" outside
the filesystem approach is what really happens.

You are right that some constraints have to be enforced in specs about the out-of-container URLs,
but I think they are just an editorial prescription, like you say
"only self contained archives with no out-of-container relative URLs are genuine ePub publications".

Regards

@rdeltour
Copy link
Member Author

-You say that many systems have an http:// approach instead of having the file:/// approach. But I think that you cannot find a system that has not file:/// representations under the hood, because in fact hosts serve web pages to the clients but they are just computers, so the whole internet relies on filesystems. So that's the reality you can count on.

Unfortunately it is not true.
Some reading systems can serve content to web views using HTTP streaming directly from a zipped EPUB. In that case there is no actual concrete file involved.

Also, it is not because a concrete file exist that any of the component uses a file: URL at some point, internally or not.

The Web is a complex machine, and one person can only have a partial understanding of how it exactly works. So in discussion like that, it's best to not make assumptions.
Relying on standards and their terminology (HTML standard, URL standard, Fetch, etc) that are stable and close to implementations, is a good way to leave out the assumptions and have a common understanding of what we're discussing.

-I think that the usage context is not the base URL in

  1. parse url with the base URL defined in its usage context

It refers to the fact that when a HTML file is loaded, the relative urls in references are interpreted as relative to the folder where the HTML file resides. Believe me, browsers and webviews act this way.

With all due respect, I prefer to not simply "believe you" here, but to trust what is specified in the HTML standard 😊.
But basically, yes, when an HTML UA processes HTML content, it parses URL relative strings into full URL records by using the document base URL, which by default is the URL of the HTML document (to simplify).

This is why I pointed out that the entire path tree has to be calculated for the parsing to be done, before applying the algorithm, so it is like saying to them "do the hard work, then perform this clever check".

I'm not sure who "them" is here.
What we (the EPUB WG) want is to specify that authors should not use out-of-container URLs. The entities that have to understand this requirement can be EPUB authors, authoring systems, checking tools. Likely not Reading Systems, who may not need to check that the content is conforming, since they likely implement safeguarded URL-processing logic already.

I'm not sure what part of the algorithm "hard work" refers to, nor "clever check".

The algorithm basically say (to simplify)

  • use the explicitly given well-known base URL for the root (http://example.org/A/)
  • compute the URL of the document where the URL string occurs (doing so is well-defined by the EPUB spec, in the "URLs in the OCF Abstract Container" section).
  • parse the URL string with the resulting base URL obtained in the previous step (doing so is well-defined by the URL parsing algorithm).
  • test that the result does start with the explicitly given root URL.
  • repeat with another root URL (http://example.org/B/), to discard any false-negative.

That algorithm is rather unambiguous. The admittedly vaguely-defined part is that the base URL used to parse a URL string is defined by the host language, so it's not always defined as the URL of the document due to overriding mechanisms (e.g. the base element).

Once you have the normalized tree you know if a URL is leaking, no further application of algorithm is needed.

what's a "normalized tree"? how do you "normalize"? what exactly do you test a URL string is leaking or not?
I can't say I understand what you propose, since those terms may mean very different things, and none of the standards normatively referenced by EPUB define them.

-if the container root url is implementation-specific, then it does exist, and I am confident that there is a folder that contains all the ePub sub-folders and files over that remote server.

Again, it's not always the case, and the EPUB specification does not mandate that. So it cannot be assumed.
Also, the container root URL does exit, but it can be for instance "epub:/". I'm curious how do you test out-of-container URLs with that one?

So you do not have to provide a strange example like http://example.org/A

It is not "strange", it is a well-known, arbitrary, willfully non-conforming, root URL to test out-of-container URLs.

Also I agree that http://example.org/mobydick is useless too I mean that mobydick is not like A, I was wrong in my previous message, it is better because it is very likely that you can just leave them check against whatever folder they have created for the ePub, and check URL leakings.

who's "they"? what if they do not create a folder? what is the folder is not exposed in the URL?
Again, there are too many assumptions in your "just do" proposal.

So one thing is serving the ePub content via http://, another thing is the checks the system does on the underlying filesystem structure. I think they are two separate operations. The http:// approach is just what will "appear" outside the filesystem approach is what really happens.

What I'm proposing has nothing to do with serving EPUB content. I'm proposing an explicit way to statically evaluate what URLs are conforming or not.

You are right that some constraints have to be enforced in specs about the out-of-container URLs,

OK, we agree. That is the very essence of this issue (#1912).

but I think they are just an editorial prescription, like you say "only self contained archives with no out-of-container relative URLs are genuine ePub publications".

It may be "just and editorial prescription", and yet we're on the 31st comment to decide how to "just word it".

My main point is precisely that "only self contained archives with no out-of-container relative URLs are genuine ePub publications" is not specification-level precise language. You're obviously free to disagree, but I won't change my mind on that 😊.

@P5music
Copy link

P5music commented Nov 25, 2021

First I want to say that I am not a native english speaker so some italian nuances are under the words in english, and it is likely that something I write can be felt as more "strong" than it really is in italian.
Excuse me in advance if you are well-known persons in the IT field.
My original proposals have been abandoned and closed, so I have no official role here.

I realized that I made some assumptions, and I also see that you are also doing the same.
So I think it is better not to make things more difficult by correcting each other.
I think my opinion was clearly conveyed and enough considered for this issue.
Thank you
Regards

@rdeltour rdeltour reopened this Nov 25, 2021
@rdeltour
Copy link
Member Author

First I want to say that I am not a native english speaker so some italian nuances are under the words in english, and it is likely that something I write can be felt as more "strong" than it really is in italian.
Excuse me in advance if you are well-known persons in the IT field.
My original proposals have been abandoned and closed, so I have no official role here.

No problem. I understand the language barrier is real (all the more as I'm not a native english speaker myself). In fact it's one of the reasons why I insist (hopefully not too pendantically) on using standard and well-defined terminology whenever possible 😊.

I for one think all contributions and comments are welcome! 😊

My understanding of your position is that the definition of "out-of-container URL string" need not be formally specified and can be described in prose using commonly accepted concepts. We agree to disagree here, but the point is made! 👍

@mattgarrish
Copy link
Member

that the authoring requirement be a SHOULD NOT instead of a MUST NOT, as a conforming RS will already ensure the URL is parsed in a safe way. Correct?

Right, I like when the reading system does the right thing so that security isn't dependent on authors behaving or following validation rules. I think we can be a bit more relaxed when this is the case.

Yes, after last week's WG +1ed the idea, I assumed we were at that implementation detail

What I mean is how exactly you implement the check is an implementation-specific detail that we don't necessarily need to enforce. I'm wary of writing required processing for aspects of the spec that aren't critical to always process in a single way.

Take the obfuscation algorithm for example. It defines pseudo-code for the algorithm, but it's not a requirement that you must translate that pseudo-code into your implementation.

Is the outcome here the same, that we exemplify how to process URLs, or is this a requirement?

If it's a requirement, what fails if this algorithm isn't followed? As a silly example, if an author just eyeballs their URLs and figures out in their head that they are inside the container, are said users invalid to the EPUB specs? 😛

@rdeltour
Copy link
Member Author

Right, I like when the reading system does the right thing so that security isn't dependent on authors behaving or following validation rules. I think we can be a bit more relaxed when this is the case.

Totally agree 👍

What I mean is how exactly you implement the check is an implementation-specific detail that we don't necessarily need to enforce. I'm wary of writing required processing for aspects of the spec that aren't critical to always process in a single way.

Oh no, implementations do certainly not have to exactly follow the algorithm!
The algorithmic approach is only used as a way to provide an unambiguous definition. After all, we're talking of an authoring requirement here, not a processing requirement.

As you say, an author is perfectly authorized to review URLs by looking at their markup. Even a tool like EPUBCheck may not follow that exact same algorithm.

But the algorithm is used to say what's what. For instance, if someone creates a bug report on EPUBCheck, saying it falsely reports a URL string that supposed to be conforming, we can refer to the normative algorithm to check if it's a bug indeed, or if the URL is really non-conforming.

@rdeltour
Copy link
Member Author

Btw, here's what Infra says:

Conformance requirements phrased as algorithms or specific steps may be implemented in any manner, so long as the end result is equivalent. (In particular, the algorithms are intended to be easy to follow, and not intended to be performant.)

In our case, eyeballing the URLs would likely be more performant for an author than mentally following the algorithm for each URL.

For a tool like EPUBCheck, this might be checked at URL-parsing time, by raising an error if the parser tries to shorten an already empty path.

As long as URLs are detected as non-conforming like the algorithm would, we're good.

@murata2makoto
Copy link
Contributor

I am sorry if I missed something. Are we talking about URLs in container.xml? Or, are we talking about URLs in EPUB publications including those in content documents?

@iherman
Copy link
Member

iherman commented Dec 1, 2021

I am sorry if I missed something. Are we talking about URLs in container.xml? Or, are we talking about URLs in EPUB publications including those in content documents?

The latter.

@mattgarrish mattgarrish added the EPUB33 Issues addressed in the EPUB 3.3 revision label Dec 17, 2021
@mattgarrish mattgarrish added the Spec-EPUB3 The issue affects the core EPUB 3.3 Recommendation label Sep 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
EPUB33 Issues addressed in the EPUB 3.3 revision Spec-EPUB3 The issue affects the core EPUB 3.3 Recommendation Topic-OCF The issue affects the OCF section of the core EPUB 3 specification
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants