-
Notifications
You must be signed in to change notification settings - Fork 682
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
URL encoding of CSS values #9301
Comments
Using utf-8 as the URL encoding for all CSS features is what the test currently expects, and what Firefox does. I tried to find compat bugs but didn't find any. Is it possible for webkit and chromium to try switching to utf-8? |
If Chromium is willing to switch around the same time I think that would be ok for WebKit. I'm personally not too worried about compatibility fallout for this. |
Forgot to copy @chrishtr on this issue. I suspect he can weigh in for Chromium and then maybe it can be removed from the agenda. |
I don't see any particular reason to not align, only the possibility of compat issues. So long as those end up fine, let's please align with the rest of the platform. |
What kind of compat issues could there plausibly be here? If I'm reading https://url.spec.whatwg.org/#concept-basic-url-parser correctly the encoding only matters in the query string, so is it cases like |
Correct, coupled with the stylesheet not using UTF-8 as its encoding. |
I tried the change in https://chromium-review.googlesource.com/4846986 and the only test failures are in WPT. That's good. With my Blink API owner hat on, to ship this change in Blink now I'd want to see some kind of compat analysis showing the risk is low, either by looking for such cases in httparchive, or via Chrome use counters. However, if this behavior was already shipped in Firefox and Safari without hiccups I'd be convinced is low enough risk to just try it. |
WebKit is not compliant. I have a patch, but there are similar concerns around stability that would be resolved if Chromium shipped: WebKit/WebKit#17383. 😅 |
The CSS Working Group just discussed
The full IRC log of that discussion<emilio> TabAtkins: basic question is what encodings are used by things in the platform to parse URLs<emilio> ... there's very little interop <emilio> ... but with the tests that show up right now it seems everybody uses utf-8 for @import <emilio> ... but in a stylesheet it uses whatever the stylesheet encoding is <emilio> ... annevk wants to make it explicit <emilio> ... is everyone fine with that? <miriam> q? <miriam> ack TabAtkins <emilio> ... zcorpan mentions that firefox using utf-8 for all URLs <emilio> ... and seems to be fine with that <emilio> ... nobody is concerned about compat fallout <chris> I would prefer to use UTF-8 always <chris> q+ <emilio> ... nobody really uses non-ascii-compatible stuff <emilio> ... so really just a question of standardizing it <miriam> ack chris <emilio> ... two choices: all urls are utf-8, or @import utf-8 and sheet encoding for the rest <fremy> q+ <emilio> chris: I'd prefer utf-8 everywhere <emilio> emilio: +1 to that, seems better to be consistent <miriam> ack fremy <emilio> fremy: is that likely to create invalid files? <myles> q+ to ask a silly question <bramus> +1 to chris’s remark <emilio> emilio: wdym as invalid? <emilio> fremy: you may end up with content that is not parsable in the encoding <emilio> TabAtkins: fix is using utf-8 <emilio> fremy: why do we bother to allow utf-8 if the file is in a different encoding <emilio> TabAtkins: it's not when you parse the file, it's about how you feed it to the url parser <miriam> ack myles <Zakim> myles, you wanted to ask a silly question <emilio> myles: is the proposal that a stylesheet is in an encoding, you find a url and switch encoding? <emilio> TabAtkins: no, you parse as normal but you feed the url parser telling it that it's utf-8 <andreubotella> IIUC this is for url-encoded bytes (%20 and so on) <miriam> ack dbaron <emilio> dbaron: there is a very old backwards compat behavior which is that URLs carry around the encoding of the document that contained them <emilio> ... so that when the network fetch happens you send the bytes to the server instead of a decoded version of them <emilio> ... ideally that should only happen when this backwards compat hack is required <emilio> ... and has been phased out generally to use utf-8 rather than that <emilio> myles: so you get a file, decode decode decode, some stylesheet has a URL and you need to go back and send those bytes rather than the decoded stuff? <emilio> dbaron: the new old way of doing it, not the old old way, is that you store the encoding of the thing in which you found the url along the url <emilio> ... purpose of that is that the server gets the same bytes as the document <emilio> ... which is a horrible hack to mimic the old old behavior <emilio> ... which was where the web just carried bytes around <emilio> myles: So the new old behavior is you round-trip (go bytes to encoding, and then when the request go back to bytes) <emilio> dbaron: yeah, and there's a migration away from that where we just send utf-8 to the server <emilio> ... not sure what the status of that migration is <emilio> ... clearly there's a difference between @import and other urls here <emilio> myles: so proposal at hand is you decode, see a url, then decode those bytes as utf-8? <emilio> TabAtkins: dbaron explained it better, when we put the url in the network stack we just stop carrying that encoding and always put it as utf-8 <emilio> myles: perfect <emilio> dbaron: hope my memory about this is right <emilio> TabAtkins: proposal is for all urls we encode them as utf-8 when they go down the network stack in accordance with the url standard's recomendation <emilio> RESOLVED: for all urls we encode them as utf-8 when they go down the network stack |
@annevk sounds like we're both saying we'd be happy to ship if the other proves it safe by shipping first. Deadlock. Another way forward is to do the kind of web compat analysis required by the Blink launch process. In this case, I think an analysis of CSS resources in http archive could be enough. Concretely, look for any CSS files served with a non-UTF8 encoding, and within those look for any non-ASCII bytes in the query component of any URL. Make a list of all of them, then randomly sample 50. For each, load the site including that CSS in Chrome, Firefox and Safari and see if there's any observable difference. Is that something you'd be willing to do? I could do the blink-dev paperwork pointing to such an analysis and predict we'd try to ship it. |
I can take a stab at writing a query. |
Okay, I've dropped in some text about this.
I think this sufficiently captures the context? Let me know if I can word anything better. |
@tabatkins the URL parser will percent-encode the query, so the bytes that go over the network will be limited to the ASCII range. The issue is which encoding to use when percent-encoding the query string (always utf-8, or the encoding of the stylesheet or document). See https://url.spec.whatwg.org/#query-encoding-example So maybe a bit clearer would be to say that for URLs in CSS, the URL parser's encoding argument must be omitted (i.e. use the default, UTF-8). https://url.spec.whatwg.org/#concept-url-parser |
I've now queried httparchive. Getting the actual encoding is quite tricky, and so I gave up on that. I assume a big chunk is utf-8, but certainly not everything. The dataset is from 2022-07-01 (same as 2022 Web Almanac). Total number of pages is 7,303,959. Number of pages with non-ASCII in the query string in An example match is (I excluded the cases where the first non-ASCII character in the query string is "â" because there were some that used non-ASCII quote marks and with encoding mismatch it became e.g. Full results: https://docs.google.com/spreadsheets/d/1i9Gvs1JIDo5mOw-rwPc5ppI6KrbXC8ol2XJ7h9PGAVs/edit?usp=sharing |
Okay, rephrased. @zcorpan, does this wording look better? (Also, thanks for the archive trawling, and further review in general! Looks pretty safe, then.) |
Two things:
|
As @annevk said, https://url.spec.whatwg.org/#concept-url-parser is the appropriate entry point.
|
Yeah, after discussing with @annevk out-of-thread, I realized that the URL design is, for historical reasons, really weird. I had assumed that URLs stored their value normally, as codepoints, and then percent-encoded when you asked them to be serialized; instead, they store ASCII only, and percent-encode any non-ASCII (or ASCII-but-weird) codepoints during parsing according to a specified encoding. Fix incoming. |
Okay, I'm finally confident I've gotten this right. Thanks for the wording suggestion, @zcorpan, I basically used that. |
LGTM :) |
@zcorpan Sorry to bug you again, but, I kindof insisted we re-write the paragraph so that it can be understandable what's happening to non-URL-spec-editors-and-peers. :) New text is:
(The core sentence hasn't changed, we just added a bunch of explanatory text, which is hopefully correct.) |
@fantasai it's correct, but it really only applies to the query portion of the URL. The remainder was already using UTF-8. And all of this is also the recommended default, so I'm not sure it's worth calling out as if it's something special, but that's up to you. (As in, I'd have a note when you do supply that argument.) |
Yes it looks correct. It's the recommended default but it's different from URLs in HTML links, so it can still be surprising if one isn't already familiar with these details. |
A long time ago @zcorpan wrote a great set of tests to determine whether features used the URL parser with a non-UTF-8 encoding. As you might expect, there is no interoperability:
https://wpt.fyi/results/html/infrastructure/urls/resolving-urls/query-encoding/windows-1252.html%3Finclude%3Dcss
Everyone seems to agree that
@import
uses UTF-8, but CSS values depend on the stylesheet encoding in Chromium and WebKit. I wanted to double check with the WG that this is intentional and everyone is on board with aligning with the standard or if this is something that should be changed.The text was updated successfully, but these errors were encountered: