-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal for base URLs to be used for URL parsing #1898
Conversation
Thanks @iherman! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some editorial suggestions as inline comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am philosophically aligned with this direction. Added some comments on specifics.
epub33/core/index.html
Outdated
|
||
<ul> | ||
<li>The result of <a data-cite="url#concept-url-parser">parsing </a> "<code>/</code>" with the <a>container root URL</a> as <a data-cite="url#concept-base-url">base</a> MUST be the <a>container root URL</a>.</li> | ||
<li>The result of <a data-cite="url#concept-url-parser">parsing </a> "<code>..</code>" with the <a>container root URL</a> as <a data-cite="url#concept-base-url">base</a> MUST be the <a>container root URL</a>.</li> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I see what you are trying to say here, but it also seems overly specific. For instance, given these, what is the required result of parsing ../..
with the root as base? Maybe more generally what we want is "The result of parsing any relative URL with the container root URL as base MUST begin with the Container Root URL".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it also seems overly specific
The requirements are very specific, but are sufficient to ensure that it covers all the cases (if I'm not mistaken). It's a minimal characterization.
what is the required result of parsing ../.. with the root as base?
If parse('..',root) == root
, then parse('../..', root) == root
, necessarily
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see no reason why that follows from what is defined in the spec. That is, nowhere do we say that parse("x/y", root)
is the equivalent parse("y", parse("x", root))
. If parse('...', root)
is never a step in parse('../..', root)
then I don't see how the restriction on the first applies to the second.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see no reason why that follows from what is defined in the spec.
It's not directly defined in the spec, but it follows from the (normatively referenced) URL parsing algorithm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bduga are you o.k. with this explanation (ie, can this be considered as resolved?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bduga, to answer your question more directly:
I guess my question is, what do we expect when parsing
../OPS
with a root ofMyRoot/
? Is itMyRoot/
orMyRoot/OPS
? Or evenOPS
, which is what I think the current spec text says.
It's a bit difficult to say since MyRoot/
is not a URL (it's a valid URL string, being a valid relative URL, but not a serialization of a full URL record).
But, assuming root URL is http://localhost/MyRoot/
, the result of parsing ../OPS
is indeed http://localhost/OPS
/ (so, as a root-relative path, that is OPS
).
That's why http://localhost/MyRoot/
is not conforming to my proposal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I think I finally see what you are saying. It is not that the RS must make sure to do these things, but rather that whatever implementation the RS chooses must have these properties. That is we are not requiring the RS to change parsing, we are saying the way the RS represents the container root URL cause the defined parsing algorithm to behave in this manner. Taking a quick stab at clarifying, maybe something like this:
The container root URL is the URL [[URL]] of the Root Directory. It is implementation specific, but the implementation MUST have the following properties:
The result of parsing "/" with the container root URL as base is the container root URL.
The result of parsing ".." with the container root URL as base is the container root URL.
I still think this may be awkward, but at least it highlights the fact that these are properties of the implementation and not changes to parsing. I also remove the musts from the elements to make it clear that the RS is not expected to do anything itself in these cases, the must only applies to the implementation of the root. It's also strange to must musts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not that the RS must make sure to do these things, but rather that whatever implementation the RS chooses must have these properties. That is we are not requiring the RS to change parsing, we are saying the way the RS represents the container root URL cause the defined parsing algorithm to behave in this manner.
Exactly!
Your suggested rewording also works for me. I'm all for not musting musts 😁.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm OK to resolve this conversation. The change is editorial but a little clearer.
I'll let @bduga have the last word on it 😊
I have made changes on the path name definition as follows:
There is still the outstanding issue of the order of the subsection with §6. My choice for the order would be:
However, I did not want to change the order; I am a bit worried of a mess of merge conflicts with #1899 which changes the section on the path and file names; @mattgarrish I am not sure how you prefer to handle these issues... |
The issue was discussed in a meeting on 2021-11-11
View the transcript3. Proposal for base URLs to be used for URL parsing (pr epub-specs#1898)See github pull request epub-specs#1898. Brady Duga: do we want to wait until we have Romain?. Dave Cramer: i'm fine with not addressing this tonight, we can probably let the conversation continue over in github. |
(Putting this into a separate comment to resolve the various comments, thereby cleaning up our status.) @rdeltour, to pick up on your #1898 (comment): what you said reminded me of a comment you made in #1888 (comment) on the usage of the "Path Name" term. I argued to keep it as is, but I am now convinced that I was wrong, although I do not think we should use the term "Path" instead. We have three notions that are a bit mixed up. Say we have a file
On the one hand, there is a big difference between a File Name and Path Name insofar as the former refers only to the last "portion" of the file's denomination, as opposed to the latter. I find this inconsistent in terms of naming. On the other hand, using the term Path instead of Path Name is also misleading because, as we said in #1898 (comment), the path may be different. May I propose an editorial change for Path Name -> File Path? I think that makes things suddenly clearer... |
Just for the records/reminder: If we merge this PR we will have to open two issues for further discussions
|
Works for me 👍 Useless aside comment, for EPUB 3.42: This is another argument IMO for defining terms in their own section. In a section related to files, it is quite obvious that path refers to a file path; but in a big stand alone terminology, several kind of paths can co-exist and we have to disambiguate by using the fully-qualified term File Path. That may lead to editorial niceties like "a file's File Path" or "the File Name of the file" 😁 |
I would rather propose to leave this for the mysterious EPUB 4... :-) |
I know my message sounds not polite, but it is in fact like "please don't". So no offence is intended. When you say that no intercepting is needed, but just parsing, it's just what I am talking about, there is no parsing of the entire XHTML file to find and parse URLs in real world for modern RSs, only intercepting is reasonable, because it is not the reader that provides the WebView its resources like images or css, but the WebView picks them up by itself as default. Developers will find this new specification as it's breaking their RSs. I gave up some days ago on this thread, but I see that the proposal is going to change the specs in a way that is not good IMHO, so I chimed in again. Regards |
for this discussion to go on, all I'm asking is for you to follow W3C's code of ethics and professional conduct.
and yet,
right, RS are mostly backed by browser engines nowadays. That doesn't mean URL parsing isn't performed; for instance, it's performed by an underlying WebView implementation
No disagreement here.
Again, maybe. I don't think so.
I'm certainly not against criticism. You think it is not a good proposal, but I still don't understand your rationale. I think one reason of this misunderstanding is that we're not having a common understanding of the terms being used in the discussion (URL, wrong/bad/valid URL, parsing, etc), and how it works on the Web. |
Just noting that the diff in the first comment is not the same as the diff generated by pr-preview. I'd delete it, as when I went to the github diff I couldn't find the text the first diff was showing. Probably a caching problem with the w3c diff or the cdn. |
Ok I admit that it's like the web servers work, as I see. I did not know that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @iherman 😊👍
@wareid @dauwhe @shiestyle the discussions on this PR have got to a consensus among all participants. Some (comparatively very minor) issues will have to be raised once the merge done, but the bulk is, I believe, ready to be merged. Can we merge it right now, or do you prefer to go through a formal resolution on the WG call later this week? |
@dauwhe @wareid @shiestyle note that this PR creates a significant new conformance requirement for reading systems. We have not fully evaluated how many existing RS would fail these criteria, or to which extent it is reasonable for them to implement. Editorial or roadmap considerations aside, I for one believe it is crucial to hear from RS and/or go through a formal resolution. That said, practically or editorially, we could always merge this PR and debate this specific question in another issue 😊 |
The issue was discussed in a meeting on 2021-11-19 List of resolutions: View the transcript1. Proposal for base URLs to be used for URL parsing (pr epub-specs#1898)See github pull request epub-specs#1898. Romain Deltour: basically the intent of the PR is to clarify how URLs are passed to epub.
Dave Cramer: we want certain behaviours to come out of this definition, but we don't want to control RS implementation. Ivan Herman: implication of this PR is that the problem of relative URIs being able to leak out of the container is gone.
Romain Deltour: to be clear, this is enforced in the RS spec. Dave Cramer: can we test for this?. Romain Deltour: I have created a test epub that uses js to display the URL of a content document, and based on this we may be able to infer the RS's root URL for the container. Ivan Herman: I think we do the right thing if we show there is a discrepancy between RS, that's why we're here, that's what CR is for. Romain Deltour: yes, i agree. Ivan Herman: so don't be alarmed if you see follow up issues, it is intentional. Dave Cramer: agree. Better to split the issue into manageable pieces. Romain Deltour: in this PR we use one way to characterize the root URL, there may be something more minimal that would result in more RS being conforming. Ivan Herman: we had one RS contribute in the PR. Dave Cramer: any other comments?.
Ivan Herman: we should thank Romain for this. This situation has existed for a long time in the spec before Romain proposed a solution. Dave Cramer: thank you Romain. |
This is a PR version of @rdeltour's #1888 (comment) proposal.
Notes:
Fix #1888
Fix #1374
See:
Preview | Diff