Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How should "everything after the scheme" URLs work? #385

Open
domenic opened this issue May 8, 2018 · 6 comments
Open

How should "everything after the scheme" URLs work? #385

domenic opened this issue May 8, 2018 · 6 comments
Labels
clarification Standard could be clearer

Comments

@domenic
Copy link
Member

domenic commented May 8, 2018

There are several URL types that are basically of the form scheme:<some arbitrary data>. For example, data:, mailto:, javascript:, and urn:.

The question is, how should software process these URLs? I see three main models:

  1. Treat these as non-URLs: check if the string has a leading scheme:, then look at everything after that.
    • Nothing specced does this. (Although I suspect a decent amount of un-specced non-browser software might.)
    • This is probably not a good idea, if we want to call these things URLs at all. For example, it misses canonicalizations like percent-decoding and whitespace-stripping that are otherwise common to URLs.
  2. Parse the URL. Check if its scheme is the one you want. Then, serialize them, and strip the leading scheme. (Maybe also strip the fragment?) Now process that remaining set of code units.
  3. Parse the URL. Now, validate it according to some strict criteria, such as: no username, no password, no host, no port, maybe no query, maybe no fragment. Now, process the path, and optionally process the query or fragment, if those are allowed for your scheme.
    • Nothing specced does this, yet.
    • This might be better than (2), as it is stricter validation, and more in line with the traditional RFCs, which consider these "everything after the scheme" URLs as having paths only.
    • This model seems a bit weird in that if your <some arbitrary data> contains ?s or #s, you have to model that as allowing queries and fragments, and then processing ${path}?${query}#${fragment}. Whereas (2) just lets you process the whole string at once.

An interesting example contrasting (2) and (3) is the following: javascript://somehost/%0Aalert(1)

  • In (2), it would work, and cause an alert, because the source string //somehost/\nalert(1) is interpreted as a comment followed by an alert.
  • In (3), it would fail, since we'd validate that hosts aren't present in javascript: URLs.

Another example is that mailto:///[email protected] is interpreted as containing a <some data here> of ///[email protected] in (2) and a path of /[email protected] in (3). Maybe not relevant since I doubt many mail clients will let you send email to such an address?

There are probably more interesting examples of this sort.


The purpose of this thread is to gather community thoughts on these scenarios, with an eye toward setting a precedent for future such schemes, and providing recommendations for software that processes such URLs (including both the web's specced data: and javascript:, and other schemes like mailto: or urn:).

If we decide (2) is better, we should provide better spec support for it, including helper operations and explicit recommendations to continue doing this pattern. If we decide (3) is better, we should do the same, and we should either explicitly note data: and javascript:'s processing models as legacy, or we should try to change them (which might be possible if interop is bad).

/ccing some people who might have thoughts: @mnot @jasnell @sleevi @masinter

@masinter
Copy link

i'd lean toward (1) under the theory that there are likely registered schemes where percent-decoding and white-space stripping are inappropriate.

domenic added a commit to drufball/layered-apis that referenced this issue Jun 18, 2018
Closes #19. This technique is inspired by the data: URL processor's initial steps: https://fetch.spec.whatwg.org/commit-snapshots/7307d282dd7d1293d5697d63f73522007849e0db/#data-url-processor.

Whether or not this technique is ideal, is an open question. See whatwg/url#385.
@annevk annevk added the clarification Standard could be clearer label Apr 26, 2020
@ExE-Boss
Copy link

ExE-Boss commented Oct 5, 2020

See also: nodejs/node#35434 (comment)

@alwinb
Copy link
Contributor

alwinb commented Jun 29, 2021

I'm puzzling over (my) characterisation of the WHATWG resolution and this issue came to mind.
Some observations, in case it helps.

Let's look at the properties of parsed/ resolved URLs:

  • File URLs have an authority (may be empty) and an absolute path (may be just / or just a drive letter).
  • Other special URLs have a non-empty authority and an absolute path (may be just /).

These properties are natural consequences of the protocols.

For non-special URLs the parser/ resolver uses the 'cannot-be-a-base-url' flag to decide if the URL is a base URL. This amounts to the following:

  • If a non-special URL has an authority, or a path that starts with / then it is used as a base-URL.

So javascript:foo is not considered a base URL, but javascript:/foo and javascript:// are.
Note that foo against javascript: throws an error whereas foo against javascript:/ results in javascript:/foo.

I think it makes sense to define what is and what is not a base URL, based on the protocol only.
The protocol would then select one of the following options:

  1. An authority and an an absolute path (file URLs)
  2. A nonempty authority and an absolute path (other special URLs)
  3. (a, b) An absolute path, or an absolute path or an authority (some subset of current non-special URLs)
  4. No authority and opaque path. (such as javascript: URLs)

That requires a hardcoded list of protocols and their associated URL 'type' (ie. parsing/resolving behaviour) though.
It could also be useful to provide a way to manually register protocols to map to a certain parsing/resolving behaviour.

Just some ideas.

@annevk
Copy link
Member

annevk commented Jul 20, 2021

Having a largely protocol-agnostic parser is a design goal. Having to tweak the parser or getting different parser outcomes over time is far from ideal. (While at the moment this still happens due to convergence between implementations, my hope is that long term it won't.)

@alwinb
Copy link
Contributor

alwinb commented Jul 21, 2021

Completely agree, however it does seem accurate to distinguish a few categories.

It is very strange to apply path normalisation to javascript URLs, for example.
The same would be true for, say, data:, news:, urn:, mailto:.

I think there is a consistent, more general pattern here.

@ghost
Copy link

ghost commented Jul 21, 2021

Maybe it makes sense to define a few “special exceptions” like http:, https:, etc. being treated uniquely, and the same for javascript:, data:, etc., but then also allow URLs to somehow specify that they want to take that mode of parsing explicitly, e.g. with a prefix, so web-myscheme: would work the same as http[s]: and raw-myscheme: would work the same as javascript: and data:.

However, maybe it also makes sense to allow for implementations that give value to specific URLs to interpret and parse them specially. I know that hyper: URLs (for the Hypercore stuff) actually uses a hash instead of an address for the host. I think currently the hash will be parsed as a domain with the WHATWG spec, but that’s not accurate to what it actually represents (e.g. it can’t have a port, for example).

Of course, that would be awful in a way, because then different implementations would parse the same URL differently, so people couldn’t rely on manipulating URLs working the same way across implementations, which is what this spec is aiming to solve.

Maybe a good approach could be to establish a (limited) set of normalization rules that can be applied to URLs by implementations, enforcing specific normalization rules for certain URLs like http[s]:, but allowing the implementations to choose among other normalization rules for their own URLs.

So, for example, the spec could allow implementations to change the port of URLs freely depending on the scheme without requiring it to be fetched and redirected (as long as they do it consistently), then e.g. http: would take away the port if it is 80 (enforced by the spec), and hyper: URLs would always take away the port in implementations that support it (allowed by the spec).

Some other modifications and normalizations could likewise be done in a similar way, by being required for well‐known URLs, and allowed for other URLs.

The key here, I think, is that the set of normalization rules that can even be applied to URLs is already well known beforehand and is not arbitrary, so it is possible for authors to enjoy a consistent URL handling across implementations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clarification Standard could be clearer
Development

No branches or pull requests

5 participants