Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for base URLs to be used for URL parsing #1898

Merged
merged 26 commits into from
Nov 19, 2021
Merged

Conversation

iherman
Copy link
Member

@iherman iherman commented Nov 10, 2021

This is a PR version of @rdeltour's #1888 (comment) proposal.

Notes:

  • I kept the term "Path Name" (for now). There is a symmetry with the usage of "File Name", and the term is referred to all over the place. It would require an extensive search-replace to get them all settled. We may do that later if we want to.
  • I modified some statements to make them more spec-text, e.g., by using MUST if appropriate
  • I agree with @rdeltour that the order of the sections 6.1.3 and 6.1.4. should be switched. I have not done that, though, because it would make the diff file much messier. We can do that once all the dust is settled...
  • I have modified the scripting section on the RS spec (§4.4) by adding a reference to the new section on the root directory URL and removing the unique origin requirement from the bullet list. There were also two notes that appeared in that section; I moved one of the two to the new section and added the other as an editorial note; the latter does not sound to be relevant anymore, but better check.
  • @mattgarrish, I have added an explicit reference to the two new terms in the RS text, because the magic script did not work well (I guess it would work once everything is published). I have done that so that ECHIDNA would not complain, we can improve that later.

Fix #1888
Fix #1374

See:


Preview | Diff

@rdeltour
Copy link
Member

Thanks @iherman!

Copy link
Member

@rdeltour rdeltour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some editorial suggestions as inline comments.

epub33/core/index.html Outdated Show resolved Hide resolved
epub33/core/index.html Outdated Show resolved Hide resolved
epub33/core/index.html Outdated Show resolved Hide resolved
epub33/core/index.html Outdated Show resolved Hide resolved
epub33/core/index.html Outdated Show resolved Hide resolved
epub33/core/index.html Outdated Show resolved Hide resolved
epub33/rs/index.html Outdated Show resolved Hide resolved
Copy link
Collaborator

@bduga bduga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am philosophically aligned with this direction. Added some comments on specifics.

epub33/core/index.html Outdated Show resolved Hide resolved
epub33/core/index.html Outdated Show resolved Hide resolved
epub33/core/index.html Outdated Show resolved Hide resolved

<ul>
<li>The result of <a data-cite="url#concept-url-parser">parsing </a> "<code>/</code>" with the <a>container root URL</a> as <a data-cite="url#concept-base-url">base</a> MUST be the <a>container root URL</a>.</li>
<li>The result of <a data-cite="url#concept-url-parser">parsing </a> "<code>..</code>" with the <a>container root URL</a> as <a data-cite="url#concept-base-url">base</a> MUST be the <a>container root URL</a>.</li>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I see what you are trying to say here, but it also seems overly specific. For instance, given these, what is the required result of parsing ../.. with the root as base? Maybe more generally what we want is "The result of parsing any relative URL with the container root URL as base MUST begin with the Container Root URL".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it also seems overly specific

The requirements are very specific, but are sufficient to ensure that it covers all the cases (if I'm not mistaken). It's a minimal characterization.

what is the required result of parsing ../.. with the root as base?

If parse('..',root) == root, then parse('../..', root) == root, necessarily

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see no reason why that follows from what is defined in the spec. That is, nowhere do we say that parse("x/y", root) is the equivalent parse("y", parse("x", root)). If parse('...', root) is never a step in parse('../..', root) then I don't see how the restriction on the first applies to the second.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see no reason why that follows from what is defined in the spec.

It's not directly defined in the spec, but it follows from the (normatively referenced) URL parsing algorithm.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bduga are you o.k. with this explanation (ie, can this be considered as resolved?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bduga, to answer your question more directly:

I guess my question is, what do we expect when parsing ../OPS with a root of MyRoot/? Is it MyRoot/ or MyRoot/OPS? Or even OPS, which is what I think the current spec text says.

It's a bit difficult to say since MyRoot/ is not a URL (it's a valid URL string, being a valid relative URL, but not a serialization of a full URL record).

But, assuming root URL is http://localhost/MyRoot/, the result of parsing ../OPS is indeed http://localhost/OPS/ (so, as a root-relative path, that is OPS).

That's why http://localhost/MyRoot/ is not conforming to my proposal.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think I finally see what you are saying. It is not that the RS must make sure to do these things, but rather that whatever implementation the RS chooses must have these properties. That is we are not requiring the RS to change parsing, we are saying the way the RS represents the container root URL cause the defined parsing algorithm to behave in this manner. Taking a quick stab at clarifying, maybe something like this:

The container root URL is the URL [[URL]] of the Root Directory. It is implementation specific, but the implementation MUST have the following properties:

The result of parsing "/" with the container root URL as base is the container root URL.
The result of parsing ".." with the container root URL as base is the container root URL.

I still think this may be awkward, but at least it highlights the fact that these are properties of the implementation and not changes to parsing. I also remove the musts from the elements to make it clear that the RS is not expected to do anything itself in these cases, the must only applies to the implementation of the root. It's also strange to must musts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not that the RS must make sure to do these things, but rather that whatever implementation the RS chooses must have these properties. That is we are not requiring the RS to change parsing, we are saying the way the RS represents the container root URL cause the defined parsing algorithm to behave in this manner.

Exactly!

Your suggested rewording also works for me. I'm all for not musting musts 😁.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bduga @rdeltour I have made the editorial change that you have converged to (will come with the next commit). Just to be on the safe side, I do not push the 'Resolve conversation' button, I leave that to you guys if you feel this is o.k.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK to resolve this conversation. The change is editorial but a little clearer.
I'll let @bduga have the last word on it 😊

epub33/rs/index.html Outdated Show resolved Hide resolved
epub33/rs/index.html Outdated Show resolved Hide resolved
epub33/rs/index.html Show resolved Hide resolved
@iherman
Copy link
Member Author

iherman commented Nov 11, 2021

Better use these URLs for preview and diff:

See:

The one I've put in the original comment are static... (my fault).

@iherman
Copy link
Member Author

iherman commented Nov 11, 2021

@bduga @rdeltour @mattgarrish

I have made changes on the path name definition as follows:

  • I have created a new section for the algorithmic part. The algorithm reflects, I believe, what seemed to be the consensus among you (and it certainly looks o.k. to me :-)
  • The terminology section has now a Path Names entry that aligns with the style of the terminology entries, referring to the new section.

There is still the outstanding issue of the order of the subsection with §6. My choice for the order would be:

  1. Introduction
  2. File and Directory Structure
  3. Path and File Names
  4. Deriving Path Names from Files
  5. URLs in the OCF Abstract Container
  6. META-INF Directory

However, I did not want to change the order; I am a bit worried of a mess of merge conflicts with #1899 which changes the section on the path and file names; @mattgarrish I am not sure how you prefer to handle these issues...

epub33/core/index.html Outdated Show resolved Hide resolved
@iherman
Copy link
Member Author

iherman commented Nov 12, 2021

The issue was discussed in a meeting on 2021-11-11

  • no resolutions were taken
View the transcript

3. Proposal for base URLs to be used for URL parsing (pr epub-specs#1898)

See github pull request epub-specs#1898.

Brady Duga: do we want to wait until we have Romain?.
… looking at comments on PR, it seems like most people are okay with the idea, but the specific details need to be hashed out.

Dave Cramer: i'm fine with not addressing this tonight, we can probably let the conversation continue over in github.

epub33/core/index.html Outdated Show resolved Hide resolved
epub33/rs/index.html Show resolved Hide resolved
@iherman
Copy link
Member Author

iherman commented Nov 12, 2021

(Putting this into a separate comment to resolve the various comments, thereby cleaning up our status.)

@rdeltour, to pick up on your #1898 (comment): what you said reminded me of a comment you made in #1888 (comment) on the usage of the "Path Name" term. I argued to keep it as is, but I am now convinced that I was wrong, although I do not think we should use the term "Path" instead.

We have three notions that are a bit mixed up. Say we have a file A\B\C\file name.html in the container (I deliberately use the windows separator character to make the point). Then we have:

  • File Name: file name.html
  • Path Name: A/B/C/file name.html
  • The path in the content URL: A/B/C/file%20name.html

On the one hand, there is a big difference between a File Name and Path Name insofar as the former refers only to the last "portion" of the file's denomination, as opposed to the latter. I find this inconsistent in terms of naming. On the other hand, using the term Path instead of Path Name is also misleading because, as we said in #1898 (comment), the path may be different.

May I propose an editorial change for Path Name -> File Path? I think that makes things suddenly clearer...

@iherman
Copy link
Member Author

iherman commented Nov 12, 2021

Just for the records/reminder:

If we merge this PR we will have to open two issues for further discussions

  • Is it o.k. to keep the algorithmic definition for what was called Path Name?
  • Do we want to have an additional restriction whereby all content would share the same (unique) origin.

@rdeltour
Copy link
Member

May I propose an editorial change for Path Name -> File Path?

Works for me 👍

Useless aside comment, for EPUB 3.42:

This is another argument IMO for defining terms in their own section. In a section related to files, it is quite obvious that path refers to a file path; but in a big stand alone terminology, several kind of paths can co-exist and we have to disambiguate by using the fully-qualified term File Path. That may lead to editorial niceties like "a file's File Path" or "the File Name of the file" 😁

@iherman
Copy link
Member Author

iherman commented Nov 12, 2021

May I propose an editorial change for Path Name -> File Path?

Works for me 👍

Useless aside comment, for EPUB 3.42:

This is another argument IMO for defining terms in their own section. In a section related to files, it is quite obvious that path refers to a file path; but in a big stand alone terminology, several kind of paths can co-exist and we have to disambiguate by using the fully-qualified term File Path. That may lead to editorial niceties like "a file's File Path" or "the File Name of the file" 😁

I would rather propose to leave this for the mysterious EPUB 4... :-)

@P5music
Copy link

P5music commented Nov 12, 2021

@rdeltour

I know my message sounds not polite, but it is in fact like "please don't". So no offence is intended.
I see that you have gone so far, but I warned you all from the very first post of your thread.
Now I see that
../../../../../../../../../../../file.xhtml
could be equivalent
to
../../file.xhtml
if file.xhtml is just two levels below the container root.
(I think no web server would tolerate or process that, but I am not a web admin).

When you say that no intercepting is needed, but just parsing, it's just what I am talking about, there is no parsing of the entire XHTML file to find and parse URLs in real world for modern RSs, only intercepting is reasonable, because it is not the reader that provides the WebView its resources like images or css, but the WebView picks them up by itself as default.
There are no wandering developers asking themselves what is the straightforward way to do that and what the specs say on this because it is underspecified.
As an example let's take "EPub3 best practices" ebook. I have the O'Reilly edition, if I am not wrong, that is full CSS, not for old readers, so it is rendered at its best with CSS and images, and HTML5, and so on with a full WebView reader.
A reader that renders such editions at full layout is not going to process the URLs inside the XHTML files and feed the WebView, believe me, it lets the WebView do its job.

Developers will find this new specification as it's breaking their RSs.

I gave up some days ago on this thread, but I see that the proposal is going to change the specs in a way that is not good IMHO, so I chimed in again.
I do not read other RSs opinions so I wanted to alert you again, but I know this is your area so I will not insist anymore.

Regards

@rdeltour
Copy link
Member

I know my message sounds not polite, but it is in fact like "please don't". So no offence is intended.

for this discussion to go on, all I'm asking is for you to follow W3C's code of ethics and professional conduct.

I see that you have gone so far, but I warned you all from the very first post of your thread. Now I see that ../../../../../../../../../../../file.xhtml could be equivalent to ../../file.xhtml if file.xhtml is just two levels below the container root. (I think no web server would tolerate or process that, but I am not a web admin).

and yet,
https://github.com/w3c/../P5music
is the same URL as
https://github.com/w3c/../../../../../../../../P5music
and no web server behave otherwise.

When you say that no intercepting is needed, but just parsing, it's just what I am talking about, there is no parsing of the entire XHTML file to find and parse URLs in real world for modern RSs

right, RS are mostly backed by browser engines nowadays. That doesn't mean URL parsing isn't performed; for instance, it's performed by an underlying WebView implementation

only intercepting is reasonable, because it is not the reader that provides the WebView its resources like images or css, but the WebView picks them up by itself as default. There are no wandering developers asking themselves what is the straightforward way to do that and what the specs say on this because it is underspecified. As an example let's take "EPub3 best practices" ebook. I have the O'Reilly edition, if I am not wrong, that is full CSS, not for old readers, so it is rendered at its best with CSS and images, and HTML5, and so on with a full WebView reader. A reader that renders such editions at full layout is not going to process the URLs inside the XHTML files and feed the WebView, believe me, it lets the WebView do its job.

No disagreement here.

Developers will find this new specification as it's breaking their RSs.

Again, maybe. I don't think so.
That's why we standardize by consensus, with input welcome from everyone.

I gave up some days ago on this thread, but I see that the proposal is going to change the specs in a way that is not good IMHO, so I chimed in again. I do not read other RSs opinions so I wanted to alert you again, but I know this is your area so I will not insist anymore.

I'm certainly not against criticism. You think it is not a good proposal, but I still don't understand your rationale. I think one reason of this misunderstanding is that we're not having a common understanding of the terms being used in the discussion (URL, wrong/bad/valid URL, parsing, etc), and how it works on the Web.
If you'd like to clarify that, I'm happy to read carefully and take the time to answer respectfully.

@mattgarrish
Copy link
Member

Just noting that the diff in the first comment is not the same as the diff generated by pr-preview. I'd delete it, as when I went to the github diff I couldn't find the text the first diff was showing. Probably a caching problem with the w3c diff or the cdn.

epub33/core/index.html Outdated Show resolved Hide resolved
@P5music
Copy link

P5music commented Nov 12, 2021

@rdeltour

and yet,
https://github.com/w3c/../P5music
is the same URL as
https://github.com/w3c/../../../../../../../../P5music
and no web server behave otherwise.

Ok I admit that it's like the web servers work, as I see. I did not know that.
I hope that the WebView behaves the same with the relative paths on filesystem that it uses when the implementation encompasses unzipping the ePub archive.
You should check it with a real Android or iOS test application.
If the WebView is already parsing the URLs that way, no leaks were possible in the first place, I think, for applications using the WebView.
If not, the RSs have to change their source code if they do not implement such parsing algorithm as now.
Regards

Copy link
Member

@rdeltour rdeltour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @iherman 😊👍

@iherman
Copy link
Member Author

iherman commented Nov 15, 2021

@wareid @dauwhe @shiestyle the discussions on this PR have got to a consensus among all participants. Some (comparatively very minor) issues will have to be raised once the merge done, but the bulk is, I believe, ready to be merged.

Can we merge it right now, or do you prefer to go through a formal resolution on the WG call later this week?

@rdeltour
Copy link
Member

Can we merge it right now, or do you prefer to go through a formal resolution on the WG call later this week?

@dauwhe @wareid @shiestyle note that this PR creates a significant new conformance requirement for reading systems. We have not fully evaluated how many existing RS would fail these criteria, or to which extent it is reasonable for them to implement. Editorial or roadmap considerations aside, I for one believe it is crucial to hear from RS and/or go through a formal resolution.

That said, practically or editorially, we could always merge this PR and debate this specific question in another issue 😊

epub33/core/index.html Outdated Show resolved Hide resolved
epub33/core/index.html Outdated Show resolved Hide resolved
@iherman iherman merged commit 761a14d into main Nov 19, 2021
@iherman
Copy link
Member Author

iherman commented Nov 19, 2021

The issue was discussed in a meeting on 2021-11-19

List of resolutions:

View the transcript

1. Proposal for base URLs to be used for URL parsing (pr epub-specs#1898)

See github pull request epub-specs#1898.

Romain Deltour: basically the intent of the PR is to clarify how URLs are passed to epub.
… basic idea is to more clearly define root URL of the OCF container, and based on that precisely say that all URLs are to be passed relative to this root.
… one addition is that we do not define the root URL of the container, we say it is implementation specific.
… we just say RS must implement it such that it has certain properties.

Murata Makoto: :-) I know that it's difficult..

Dave Cramer: we want certain behaviours to come out of this definition, but we don't want to control RS implementation.

Ivan Herman: implication of this PR is that the problem of relative URIs being able to leak out of the container is gone.
… this PR defines a hard stop at the top of the container.

Murata Makoto: Good point.

Romain Deltour: to be clear, this is enforced in the RS spec.
… RS must implement the root URL in such a way that whatever the root URL looks like, it is something inside the container.
… so far we say that any relative URL is valid.
… we may limit this in the next agenda item.
… re. these properties we say must be defined on the root URL, it is not certain that all RS are currently done this way, so this PR may make some existing RS non-conforming.

Dave Cramer: can we test for this?.

Romain Deltour: I have created a test epub that uses js to display the URL of a content document, and based on this we may be able to infer the RS's root URL for the container.
… based on this naive test (done on Readium, iBooks, and ADE), ibooks and Readium do not implement the root URL internally the way we have it defined in the PR.
… not yet applied this test to other RS.
… to answer the question, no, it's not really testable.
… we can only try to infer it.

Ivan Herman: I think we do the right thing if we show there is a discrepancy between RS, that's why we're here, that's what CR is for.
… also, a warning for the wg, if we get a green light to merge today, there will be some follow up issues raised afterwards, mostly editorial issues.
… these may require further discussion, but they are mostly minor and editorial.

Romain Deltour: yes, i agree.

Ivan Herman: so don't be alarmed if you see follow up issues, it is intentional.

Dave Cramer: agree. Better to split the issue into manageable pieces.

Romain Deltour: in this PR we use one way to characterize the root URL, there may be something more minimal that would result in more RS being conforming.
… but we haven't heard for that many RS folks.
… the benefit of this PR is that it clarifies a lot of things. We may need to lose some of this precision later..

Ivan Herman: we had one RS contribute in the PR.

Dave Cramer: any other comments?.

Proposed resolution: Merge #1898. (Dave Cramer)

Ivan Herman: +1.

Gregorio Pellegrino: +1.

Romain Deltour: +1.

Matthew Chan: +1.

Toshiaki Koike: +1.

Dan Lazin: +1.

Murata Makoto: +1.

Masakazu Kitahara: +1.

Dave Cramer: +1.

Murata Makoto: +i.

Ben Schroeter: +1.

Rick Johnson: +1.

Resolution #1: Merge #1898.

Ivan Herman: we should thank Romain for this. This situation has existed for a long time in the spec before Romain proposed a solution.

Dave Cramer: thank you Romain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

What base URLs to use for URL parsing in EPUB? "IRI of the Package Document": what is this exactly?
7 participants