Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Are drive letters always invalid? #612

Open
alwinb opened this issue Jun 5, 2021 · 10 comments
Open

Are drive letters always invalid? #612

alwinb opened this issue Jun 5, 2021 · 10 comments
Labels
topic: file Aren't file: URLs the best? topic: validation Pertaining to the rules for URL writing and validity (as opposed to parsing)

Comments

@alwinb
Copy link
Contributor

alwinb commented Jun 5, 2021

The Writing section suggests that drive letters are allowed in file URLs only if they do not have a host.
This is reflected in the parser (the file-slash state does not issue a validation warning on a drive letter, the other states do).
However, step 5.1 in the scheme state ensures that hostless file URLs are always invalid, so it is out of sync with the Writing section, and it suggests there cannot be a valid file URL with a drive letter.

Maybe the idea is that e.g. file:/c:/etc/ has a drive, whereas file:c:/etc/, file://c:/etc/ and file:///c:/etc/ do not. That makes sense as a way to disambiguate drive letters from path components that 'look like' drive letters. But the parser/resolver does treat the c: part as a drive letter in all of them.

@annevk
Copy link
Member

annevk commented Jun 14, 2021

I don't think they should be invalid, except perhaps if Windows moved away from them.

@karwa
Copy link
Contributor

karwa commented Jun 15, 2021

The Writing section suggests that drive letters are allowed in file URLs only if they do not have a host.

I'm not sure if that is correct. Of the 4 kinds of paths Windows supports (yes, really), one of them is DOS device paths, for example: \\.\C:\Windows\. Since it has the same format as a UNC path, I wouldn't be surprised if some systems turned that in to file://./C:/Windows/ (i.e. a hostname which is a single dot). Sometimes, it's the only way to refer to locations on disk (e.g. if you need to refer to a folder named "con", which is otherwise reserved for legacy reasons).

Maybe the idea is that e.g. file:/c:/etc/ has a drive, whereas file:c:/etc/, file://c:/etc/ and file:///c:/etc/ do not. That makes sense as a way to disambiguate drive letters from path components that 'look like' drive letters. But the parser/resolver does treat the c: part as a drive letter in all of them.

The latter 2 URLs certainly do have drives. The version with 2 slashes was used by legacy software that basically just stuck file:// at the front of a DOS path, and the version with 3 slashes is considered the modern way to express a Windows path as a file URL (it's what the Windows system APIs will give you, for instance). It's likely valuable to support the version with no slashes as well (file:c:/etc), given that Windows paths do not begin with a leading slash like POSIX paths do.

I don't think it's worth trying too hard to disambiguate drive letters from path components that look like drive letters. There are obviously some trade-offs that need to be accepted when trying to represent both Windows and POSIX paths in a single format.

@alwinb
Copy link
Contributor Author

alwinb commented Jun 16, 2021

Thank you, this is useful information.

I think I made mistakes reading the standard, so there are mistakes in my question (I think...).

My expectation is that three slashes followed by a c: or a c| style drive letter are valid, e.g. file:///c: and file:///c| and alike. (But not zero, one or two slashes, though tolerated; in line with the other special URLs).

I don't think it's worth trying too hard to disambiguate drive letters from path components that look like drive letters. There are obviously some trade-offs that need to be accepted when trying to represent both Windows and POSIX paths in a single format.

I am interested in this, mostly because it is exactly what I am trying to do... So I should try to understand your point.
The thing is, drive letters act differently from path components when resolving URLs, so I would not know a way around it.

@karwa
Copy link
Contributor

karwa commented Jun 16, 2021

You could work around it by percent-encoding the : or | character; c%3A is not considered a drive letter.

When creating a file URL from a file path, you already need to encode characters which the URL parser would otherwise interpret (e.g. ? and # may be valid characters on the filesystem), so this is just another one.

@alwinb
Copy link
Contributor Author

alwinb commented Jun 17, 2021

Exactly, either that, or always prefixing the drive-letter-like component with ./ to prevent it from being handled as a drive letter, i.e. a very slight addition to #505

@alwinb
Copy link
Contributor Author

alwinb commented Jun 23, 2021

Step 2.4.1 in the path state, suggests at least that both file:///c| and file:///c: are valid.

@karwa
Copy link
Contributor

karwa commented Jul 1, 2021

I'm not sure if that is correct. Of the 4 kinds of paths Windows supports (yes, really), one of them is DOS device paths, for example: \\.\C:\Windows\. Since it has the same format as a UNC path, I wouldn't be surprised if some systems turned that in to file://./C:/Windows/ (i.e. a hostname which is a single dot).

On further investigation, it appears that, while DOS device paths can semantically contain drive letters, they are not considered the path root when they are in such paths. That's what Microsoft's documentation says, and indeed, using the Windows API function GetFullPathName to normalize \\.\C:\foo\bar\..\..\..\..\.. returns \\.\

Buuuuut, since we're talking about file paths, of course the situation is actually more complex than that. If the file URL has a hostname, and that hostname is not . or ? (and it won't be the latter, because we don't support that), then the URL represents a UNC path. That means - regardless of whether it looks like a drive letter - we should consider the first component the path root, since it is the UNC share. You shouldn't be able to pop the UNC share with .. in the same way you can't pop the drive letter. Normalizing \\server\share\foo\..\..\..\..\..\.. via the same Windows API returns \\server\share.

So basically: for most hostnames, we should always consider the share name as the root, which is semantically not a drive letter (i.e. shouldn't be normalized). Today, for every hostname, we only consider some share names to be the root, because they coincidentally look like drive letters, and we normalize them when really we shouldn't.

As far as I can tell, neither Edge/Chrome nor IE actually handle UNC path roots correctly (file://127.0.0.1/SomeShare/somedir/../../../ won't be found, even though file://127.0.0.1/SomeShare/ works), so I'd predict that using a file: URL to follow relative links on a static HTML page accessed via UNC might also not be very robust.

@annevk annevk added the topic: file Aren't file: URLs the best? label Oct 20, 2021
@alwinb
Copy link
Contributor Author

alwinb commented Jan 14, 2022

Looking at this again, I think that the writing section for file URLs still reflects the situation from before #405 and specifically #302, where (normalised/ resolved) file URLs that had a drive could not have a host. Which is great. Updating it to reflect the changes should simplify things, which is nice!

A question is, if the drive letter should be included as a production in the grammar (the writing rules) for file URLs. I expected that to be the case, but it seems that they’re not distinguished from ordinary path components right now.

@alwinb
Copy link
Contributor Author

alwinb commented Jan 7, 2024

In the writing section, in the definition of relative-URL string, in the "file" case, the following:

  • a scheme-relative-file-URL string
  • a path-absolute-URL string if base URL’s host is an empty host
  • a path-absolute-non-Windows-file-URL string if base URL’s host is not an empty host
  • a path-relative-scheme-less-URL string

Can be replaced with:

  • a scheme-relative-file-URL string
  • a path-absolute-URL string
  • a path-relative-scheme-less-URL string

@annevk annevk added the topic: validation Pertaining to the rules for URL writing and validity (as opposed to parsing) label Dec 2, 2024
@annevk
Copy link
Member

annevk commented Dec 2, 2024

Thanks for pointing that out. This text was written long ago and hasn't seen a lot of review.

I think that also means we should update https://url.spec.whatwg.org/#scheme-relative-file-url-string to say something like

//, optionally followed by a valid host string, optionally followed by a path-absolute URL string

right?

And that in turn means "path-absolute-non-Windows-file-URL string" can be removed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic: file Aren't file: URLs the best? topic: validation Pertaining to the rules for URL writing and validity (as opposed to parsing)
Development

No branches or pull requests

3 participants