Review slashes after file:/// and file path normalization #405

alwinb · 2018-07-23T13:54:29Z

The current standard asks to remove leading empty directories from file paths.
This seems to have been motivated by the desire to ignore slashes after file:/// whilst avoiding reparsing issues. However,

It introduces new behavior that has no precedence in browsers.
It is inconsistent with path normalization of other special URLs, and less expressive.

Examples, tested in Chrome canary, Safari preview and Firefox nightly:

	Input	Currently specified	IE8 upto Edge18	Firefox and Safari	Chrome
1	`/..//localhost//pig` against `file://lion/`	`file://lion/localhost//pig`	`file://lion//localhost//pig`	`file:////localhost//pig`	`file://lion//localhost//pig`
2	`file://localhost//a//../..//`	`file:///`	`file:///.//`	`file://///`	`file://localhost///`
3	`file://localhost//a//../..//foo`	`file:///foo`	`file:///.//foo`	`file://///foo`	`file://localhost///foo`
4	`file://localhost////foo`	`file:///foo`	`file://foo/`	`file://////foo`	`file://localhost////foo`
5	`file:////foo`	`file:///foo`	`file://foo/`	`file:////foo`	`file:///foo`

Notes:

The spec agrees only with Chrome, only in the fourth case.
Browsers agree on the pathName except for the fourth case.
(Dropping the host in the first seems off).
Firefox and Safari do not ignore slashes and thereby avoid reparsing issues.
Chrome only ignores slashes immediately following file:/// and keeps localhost, avoiding a reparsing issue.

The text was updated successfully, but these errors were encountered:

alwinb · 2018-07-23T14:06:21Z

As I wrote in #232 maybe there are other solutions,

Some ideas:

Always replace the empty host with localhost.
Do so only in ambiguous cases.
Prefix ambiguous pathNames with ./.
Don't collapse slashes at all.
Come up with something else?

No 4. seems most strict/ traditional but does not unify file:////foo with file:///foo. If that is a hard requirement, then perhaps 1. or 2. can be considered.

annevk · 2020-05-10T05:46:25Z

FWIW, I'm open to revisiting this as it's clearly not implemented as-is (apart from perhaps in Node.js). It would be good to test IE6 and especially browsers on Windows in general. And we need to ensure that file:foo and file:/foo and such continue to work and not present idempotence issues.

Is this something you're still interested in and willing to help out with perhaps?

alwinb · 2020-05-13T10:13:56Z

I don't know if I can help, but I'm willing to give it a try.
I'll try and get some info on the older browsers using browserstack first (hoping that'll work) and post back here.

alwinb · 2020-05-13T16:08:15Z

I'm still trying to get results on IE6/7.
Meanwhile here are the results for IE 8 upto Edge 18 added to the table. (I've updated the table in the original issue also.)

	Input	Currently specified	IE8 upto Edge18	Firefox and Safari	Chrome
1	`/..//localhost//pig` against `file://lion/`	`file://lion/localhost//pig`	`file://lion//localhost//pig`	`file:////localhost//pig`	`file://lion//localhost//pig`
2	`file://localhost//a//../..//`	`file:///`	`file:///.//`	`file://///`	`file://localhost///`
3	`file://localhost//a//../..//foo`	`file:///foo`	`file:///.//foo`	`file://///foo`	`file://localhost///foo`
4	`file://localhost////foo`	`file:///foo`	`file://foo/`	`file://////foo`	`file://localhost////foo`
5	`file:////foo`	`file:///foo`	`file://foo/`	`file:////foo`	`file:///foo`

I tested a bunch more examples but I need some more time to make sense of them.

alwinb · 2020-05-20T15:09:06Z

I ran another batch of examples. An attempt to summarize:

Safari and Firefox behave according to the RFC and don't collapse slashes in file URLs.
Firefox always sets the host to the empty string.
Chrome/Mac does remove the leading empty path segments.
- However, it does parse e.g. //one as having a host.
Both Edge and Chrome/Windows ignore slashes after file:/// and interpret the first nonempty path segment as the host.
- However, file://one and file:///one are parsed as just that.

Given file:/// as a base URL:

Input	Edge	Chrome/Windows	Chrome/Mac	Safari	Firefox
`file:///one/two`	`file:///one/two`	`file:///one/two`	`file:///one/two`	`file:///one/two`	`file:///one/two`
`file:////one/two`	`file://one/two`	`file://one/two`	`file:///one/two`	`file:////one/two`	`file:////one/two`
`//one/two`	`file://one/two`	`file://one/two`	`file://one/two`	`file://one/two`	`file:///two`
`///one/two`	`file:///.///one/two`	`file:///one/two`	`file:///one/two`	`file:///one/two`	`file:///one/two`
`////one/two`	`file:///.////one/two`	`file://one/two`	`file:///one/two`	`file:////one/two`	`file:////one/two`

Tested on Edge 17 and 18 and Chrome 71 on Windows (via browserstack), Safari 13, Chrome 81 and Firefox 76 on Mac.

domenic · 2020-09-16T17:20:37Z

I'll add another case (which came up in https://bugs.chromium.org/p/chromium/issues/detail?id=1128999): file:///.// becomes file://// (four slashes) in browsers, but file:/// (three slashes) in the spec.

domenic · 2020-09-16T17:24:10Z

@alwinb (or anyone else): would you be able to propose a spec patch, or just a general algorithm for this? Then we can prototype your algorithm in jsdom/whatwg-url, run it against all the test cases you have compiled, and see how well it works and how well it matches browsers. If it's idempotent and closer to browser behavior than the current spec, then we can update the spec.

I'm happy to help with the prototyping/testing/spec patch process, but I would need an algorithm to start from.

domenic · 2020-09-16T17:36:15Z

Reading through the above in a bit more detail, it looks like you were asking for feedback on what approach would be most reasonable. I'll check with coworkers to see if they have any stronger opinions, but for my part, my requirements are very minimal. This is a crusty legacy area and just getting to interop is the important thing. So:

(Must have) Idempotency: i.e., parsing-serializing-parsing should produce the same result.
(Must have) Platform-independence: no differing behavior on Windows/Linux/Mac.
(Important) Match the maximum number of modern browser engines in the maximum number of cases as possible.
(Nice to have) Make the algorithm relatively simple.
(Nice to have) Make the mental model relatively simple. E.g. if we could explain to developers something like "localhost always goes away in this particular fashion" or "N slashes will always collapse to 4 slashes" or "slashes will never collapse" or "file URLs are just like non-file URLs in most ways" then that'd be nice.
(Nice to have) also try to solve related issues like Why is the host dropped if the path contains a Windows drive letter? #302 and "D:\foo" should be parsed as "file:///D:/foo" #271 in ways that are satisfactory to the folks opening them.
(Nice to have) match legacy IE/Edge versions, since those often reflect "what enterprises expect".

Trading (3) off against the nice to haves is also reasonable. E.g. if there's a much simpler version of the algorithm that only matches 1/3 modern browser engines, then we should probably go with that. (But if it matches 0 engines, then it's not good.)

masinter · 2020-09-16T19:03:06Z

Platform independence is difficule since there are many file system types (case sensitive, case insensitive, and many more with unicode and different kinds of unicode normalization) and different behavior for one platform. The platform-independence of box, dropbox, onedrive, icloud, google drive when mounted might be a more achievable goal.

domenic · 2020-09-16T21:35:57Z

The URL parser, i.e. the algorithm we're discussing here, doesn't need to worry about those things. There may be further troubles translating a parsed URL into a file on the filesystem, but that's out of scope for this repository, and more of the providence of e.g. HTTP server software.

alwinb · 2020-09-18T10:21:24Z

This turns out to be a very complicated problem.

The only viable approach I've been able to come up with is to not ignore multiple slashes in file URLs. At least this agrees with Safari and Firefox and the RFCs, and it prevents messy interactions with #302 that break idempotency. (It even agrees with IE/Edge in some cases).

Then for #302 I would propose to preserve the host in file URLs (also in the presence of drive letters) except for localhost, which would be replaced with the empty string. That then disagrees with Firefox but is more in line with the others (considering just the host).

Together these changes will cause quite a bit of disagreement though. But I don't see any better alternative.

Is this clear enough? I can have a look at the state machine to see where to make the changes, but I'd like to be sure that there's some consensus about the approach first.

Somewhat off-topic, and I'm hesitant to mention this, but I decided to go ahead because it could be useful for the standardisation process.

I have tried to build an URL library that is compatible with the standard, but has separate parsing, resolving and normalisation phases. It uses a simple 'theory' of URLs, modelling them as a sequence of tokens. They're similar, but different from the parse trees as per the RFCs. I wrote it down in the Theory section of the README.

It can be useful to describe the different browser behaviours as operations on such token lists. And in general it could help with points 4 and 5 that @domenic mentions above, amongst other things. – Just in case.

domenic · 2020-09-18T15:08:18Z

Is this clear enough? I can have a look at the state machine to see where to make the changes, but I'd like to be sure that there's some consensus about the approach first.

This makes sense to me, and I'm happy to say that Chrome is in favor of it at a high level. (In practice, Chrome is very slow to make changes to its URL parser, so this support doesn't necessarily translate into action. But it should be enough to help proceed with consensus-gathering.)

It sounds like your proposal mostly matches Safari, so probably if we can get @annevk's high-level sign-off for Mozilla, then we're ready to move on to the next step. We can perform a second round of checks with all the browser representatives once we have a spec patch/test results patch/complete listing of mismatches with current browsers.

annevk · 2020-09-21T09:13:09Z

If it works for Chrome and Safari, it'll likely work for Firefox. @jasnell for thoughts from Node.js.

cc @valenting

valenting · 2020-09-21T10:19:24Z

I'm all for it. Changing the Firefox implementation should be relatively straight-forward.

achristensen07 · 2020-09-21T15:10:44Z

alwinb's proposal sounds good to me. It sounds very close to if not exactly what I implemented in WebKit.

alwinb · 2020-09-22T09:33:17Z

Excellent!

I will start with updating the tests (urltestdata.json), then try and make the changes to whatwg-url, and post back here.

I may need some time to learn the process (forks and merge requests and the like), I have not done that before, but I suppose it'll become clear to me as I go along.

alwinb · 2020-09-23T10:04:00Z

I have updated the tests and made the few changes needed to to jsdom/whatwg-url.
My forks are here alwinb/whatwg-url, alwinb/wpt.

I'm not sure how to proceed now. I assume these have to be reviewed before I open any merge request? It seems inappropriate to do so before the standard has been updated.

I can have a look at making those changes to the standard as well (i.e. this repo).

Thanks,

annevk · 2020-09-23T10:20:19Z

@alwinb nice! You should feel free to open PRs and we can label them "pending specification change" (paraphrased). It's not a problem. If you can make the change to the specification too that'd be great. Note that you'll need to sign https://participate.whatwg.org/agreement as well. Hope that helps.

domenic · 2020-09-23T14:58:27Z

Yes, this is amazing work! Please do open pull requests; it's only merging of them that we'll block until the specification is updated. We'll also get an extra benefit, where if you open the web platform tests pull request, a bot will run the tests on all browsers so we can see the results.

The changes to whatwg-url are promising, in that they make the algorithm simpler. Hooray!

domenic · 2020-09-23T17:18:27Z

Looking at https://wpt.fyi/results/url/url-constructor.html?label=pr_head&max-count=1&pr=25716 (which is generated from web-platform-tests/wpt#25716 (comment)) most things look good, but there is a section of tests which fail in more browsers more than would be ideal:

Parsing: </..//localhost//pig> against <file://lion/>, Parsing: </rooibos> against <file://tea/>, and Parsing: </?chai> against <file://tea/> fail in Firefox and Safari, passing only in Chrome.
Parsing: <C|> against <file://host/dir/file> through Parsing: <C|\> against <file://host/dir/file> fail in all three browsers.
Parsing: <C|a> against <file://host/dir/file> fails in Chrome and Firefox, passing in Safari.
Parsing: </c:/foo/bar> against <file://host/path> fails in Firefox and Safari, passing in Chrome.

It looks like most of the cases with 2 failures were also 2 failures before (referencing https://wpt.fyi/results/url/url-constructor.html?label=master&label=experimental&product=chrome&product=firefox&product=safari&aligned), albeit sometimes a different 2. So that's not a regression. Indeed, in cases where all 3 browsers give different results, that's the best we can hope for.

But the conversion of some of them from 2-failures to 3-failures seems suboptimal. It seems to be concentrated around C|-style drive letter handling. Do you think there's something that could be tweaked there to match at least one browser?

alwinb · 2020-09-24T08:47:25Z

Hmm. This is actually about a separate issue – the handling of C| style drive letters.

Apparently, Chrome and Firefox do not handle the C| style drive letters according to the standard (they interpret it as an ordinary path component).

What is interesting is that IE/Edge prior to version 18 also did not interpret C| as a drive letter in a lot of cases. I discovered that whilst making browsershots for this issue and for #302.

I just checked. I have no data on e.g. file:C|, file:/C| and file://C|, nor file://host/C|, But

c|/dir/abc against file:/// gave file:///c%7C/dir/abc.
file:///c|/dir/abc did result in file:///c:/dir/abc.

This has changed since version 18, which does interpret c| as a drive letter.

As an aside, Safari fails the mentioned tests because it (sometimes? or always?) does not copy the host from the base URL when parsing host-less URLs. I suspect that this is not intentional; it seems inconsistent with the way in which it parses absolute file URLs. It does however handle the C| style drive letters according to the standard.

alwinb · 2020-09-24T08:52:12Z

Wait, no there's more going on. I think that Chrome on windows will pass those drive letter tests. If I remember correctly, Chrome disables drive-letter handling on non-windows platforms.

annevk · 2020-09-24T10:22:40Z

Yeah, wpt.fyi does not run Chrome on Windows and what Chrome on Windows does is somewhat important for file: URLs.

domenic · 2020-09-24T21:39:14Z

Ah, OK. Yeah, for drive letter normalization I'd rather align with Windows browsers. Since we have two to choose from these days, aligning with Chrome Windows seems reasonable to me. (Firefox seems to not treat pipe characters specially.)

OK, then yeah, let's proceed with the spec update! Let us know if we can help with that; it should be mostly mechanical given the changes you made to whatwg-url.

alwinb · 2020-09-25T09:41:54Z

Ok, I made the changes, I hope it all went ok!

alwinb mentioned this issue Jul 23, 2018

Needs to ignore repeated slashes in file URL #232

Closed

domenic added the topic: parser label Jul 30, 2018

piscisaureus mentioned this issue Jul 8, 2019

Joining relative path onto a file:// url with a hostname is non spec compliant servo/rust-url#504

Open

alwinb mentioned this issue May 14, 2020

Why is the host dropped if the path contains a Windows drive letter? #302

Closed

annevk mentioned this issue Jun 11, 2020

Non-special schema URL behaviors differently... #526

Closed

ghost mentioned this issue Sep 21, 2020

Support relative URLs #531

Open

This was referenced Sep 23, 2020

Changes URL tests for file path normalisation proposal web-platform-tests/wpt#25716

Merged

Implement proposal for file URL path normalisation jsdom/whatwg-url#170

Merged

alwinb mentioned this issue Sep 25, 2020

Changes to file URL path normalization #544

Merged

annevk closed this as completed in #544 Sep 30, 2020

alwinb mentioned this issue Oct 6, 2020

Fix file URL reparse issue #550

Merged

3 tasks

guybedford mentioned this issue Oct 8, 2020

file: URL separator deduplication API #552

Open

alwinb mentioned this issue Jan 14, 2022

Are drive letters always invalid? #612

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Review slashes after file:/// and file path normalization #405

Review slashes after file:/// and file path normalization #405

alwinb commented Jul 23, 2018 •

edited

Loading

alwinb commented Jul 23, 2018

annevk commented May 10, 2020

alwinb commented May 13, 2020

alwinb commented May 13, 2020 •

edited

Loading

alwinb commented May 20, 2020

domenic commented Sep 16, 2020

domenic commented Sep 16, 2020

domenic commented Sep 16, 2020

masinter commented Sep 16, 2020

domenic commented Sep 16, 2020

alwinb commented Sep 18, 2020

domenic commented Sep 18, 2020

annevk commented Sep 21, 2020

valenting commented Sep 21, 2020

achristensen07 commented Sep 21, 2020

alwinb commented Sep 22, 2020

alwinb commented Sep 23, 2020

annevk commented Sep 23, 2020

domenic commented Sep 23, 2020

domenic commented Sep 23, 2020

alwinb commented Sep 24, 2020

alwinb commented Sep 24, 2020

annevk commented Sep 24, 2020

domenic commented Sep 24, 2020

alwinb commented Sep 25, 2020

Review slashes after file:/// and file path normalization #405

Review slashes after file:/// and file path normalization #405

Comments

alwinb commented Jul 23, 2018 • edited Loading

alwinb commented Jul 23, 2018

annevk commented May 10, 2020

alwinb commented May 13, 2020

alwinb commented May 13, 2020 • edited Loading

alwinb commented May 20, 2020

domenic commented Sep 16, 2020

domenic commented Sep 16, 2020

domenic commented Sep 16, 2020

masinter commented Sep 16, 2020

domenic commented Sep 16, 2020

alwinb commented Sep 18, 2020

domenic commented Sep 18, 2020

annevk commented Sep 21, 2020

valenting commented Sep 21, 2020

achristensen07 commented Sep 21, 2020

alwinb commented Sep 22, 2020

alwinb commented Sep 23, 2020

annevk commented Sep 23, 2020

domenic commented Sep 23, 2020

domenic commented Sep 23, 2020

alwinb commented Sep 24, 2020

alwinb commented Sep 24, 2020

annevk commented Sep 24, 2020

domenic commented Sep 24, 2020

alwinb commented Sep 25, 2020

alwinb commented Jul 23, 2018 •

edited

Loading

alwinb commented May 13, 2020 •

edited

Loading