Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more restrictions to the host parser #159

Closed
gijsk opened this issue Nov 2, 2016 · 16 comments
Closed

Add more restrictions to the host parser #159

gijsk opened this issue Nov 2, 2016 · 16 comments

Comments

@gijsk
Copy link

gijsk commented Nov 2, 2016

AFAICT Safari and Chrome already fail to parse this, both when trying to click links to e.g. "http://www.;.com/" and when trying to use the URL constructor to create that URL.

Edge and Firefox both seem to think it's valid. IE shows an error page but it's not clear to me what it thinks, and it doesn't seem to implement the URL constructor.

@valenting
Copy link
Collaborator

(In reply to zeusex81 from comment #8)

";" was just an example.
here's the characters that only firefox accepts :
http://www.$*+=!;,<>|&~^`'"(){}.com/

We should consider whether these characters should also be valid in a hostname.

@achristensen07
Copy link
Collaborator

I noticed Safari doesn't accept '<' or '>' but Firefox does, and Chrome percent-encodes these characters.

According to https://tools.ietf.org/html/rfc3986 we should only accept ALPHA DIGIT -._~!$&'()*+,;= or properly percent-encoded values, which would have been decoded at this point. Where did the current list of forbidden characters come from? I would support a change to match rfc3986 more closely by having a list of allowed characters instead of a list of forbidden characters.

@annevk
Copy link
Member

annevk commented Dec 9, 2016

The current list is basically maximally liberal while avoiding deeply problematic code points. I did this because in theory hosts basically have no restrictions on what they can contain. But yeah, we should probably move into the other direction given what user agents already do.

Note that since RFC3986 allows percent-encoded characters they do allow basically anything.

@annevk annevk changed the title Semi-colon (;) should be illegal in URL hostname parsing Add more restrictions to the host parser Dec 19, 2016
annevk added a commit to web-platform-tests/wpt that referenced this issue Jan 4, 2017
@annevk
Copy link
Member

annevk commented Jan 4, 2017

I created a test and Safari TP appears to follow the URL Standard, apart from setting to # and ? not failing properly.

Firefox is wrong for U+000B, U+000C, and %.

Chrome is all over the map. Given that Firefox and Safari TP are this close I'm a little more hesitant to make changes, but I'm open to suggestions.

@annevk
Copy link
Member

annevk commented Jan 4, 2017

Chrome's results do mean that moving to a safelist as @achristensen07 suggests is a potential option, although Chrome does allow a lot more through setters... If we discount Chrome's setter results, the post-IDNA safelist would be a-Z, 0-9, +, -, ., and _, which is fairly small.

@achristensen07
Copy link
Collaborator

Don't let Safari TP sway your decision here. I just blindly copied the spec here. I would still support changing the spec

@achristensen07
Copy link
Collaborator

The reason I would still support changing the spec is that NSURL's parser follows rfc3986. I would like WebKit's URL parser to not give NSURL any "valid" URLs it considers to be invalid. I'm not sure what would happen if you did a DNS lookup with invalid characters.

@annevk
Copy link
Member

annevk commented Jan 4, 2017

My analysis of Chrome discounted the fact that while Chrome throws for a lot more hosts, it also happily emits percent-encoded hosts sometimes. So https://}x/ becomes https://%7Dx/ and is not treated as an error.

If we wanted to make a change, I would remove the + from the earlier safelist (fails in stable Safari at least) which leaves us with the ASCII code points I am certain of we need to support.

Namely: a-Z, 0-9, -, ., and _ (only _ is not technically allowed in domain names, but is sometimes used in hostnames around the web and therefore browsers have to special case there HTTPS code around it).

@annevk
Copy link
Member

annevk commented Jan 4, 2017

That would be the most conservative we can go with host names (for special schemes). If that sounds reasonable I'll propose a commit to the test and the standard.

@sleevi
Copy link

sleevi commented Jan 4, 2017

@annevk Not sure I understand your comment on conservatism.

It sounds like you're proposing requiring the host component of the authority (for non-special schemes) to follow the LDH rule (plus _ for historic compat) of DNS? And to not allow any escape sequences? Is that a fair read?

@annevk
Copy link
Member

annevk commented Jan 5, 2017

That is a fair read. (It's conservative with respect to how much input ends up being parsed into something. More would end up rejected.)

@annevk
Copy link
Member

annevk commented Jan 10, 2017

@valenting @achristensen07 @sleevi shall we go with the proposal in #159 (comment)?

@valenting
Copy link
Collaborator

👍 The simpler the better.

@sleevi
Copy link

sleevi commented Jan 10, 2017

@annevk I would say that switching the host parser to observe DNS rules is, right now, for us, a non-goal. It's something that I think we'd be very unlikely to implement in Chrome, in part, because there's more ways than DNS to name things. That's not to say "No, never," but one with known sharp edge cases for our usage, so would require a lot more time and energy to work with internal teams to determine an appropriate suggestion.

In this space, at least, the 'obsoleted' IETF RFC was far more lenient with respect to DNS, and as much as possible, I think we'd want to avoid coupling URLs to DNS for the time being, including any format rules.

(I'm not sure about any thoughts with respect to whether encoding would be an acceptable compromise)

@annevk
Copy link
Member

annevk commented Jan 24, 2017

So given that non-DNS systems are still in use here and there and that at least Google still needs to support them, I'm not going to change the current approach in the standard until we have more data in some way.

If browsers want to be more restrictive in their address bar they can already do so, as long as parsing for <a> and such is not affected.

@annevk
Copy link
Member

annevk commented Feb 8, 2017

Closing this per the above comment due to lack of further feedback. #218 tweaks host parsing a little further, but doesn't really increase the number of restrictions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

5 participants