Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Add a normalization interface #729

Open
gibson042 opened this issue Dec 16, 2022 · 1 comment
Open

Proposal: Add a normalization interface #729

gibson042 opened this issue Dec 16, 2022 · 1 comment
Labels
addition/proposal New features or enhancements needs implementer interest Moving the issue forward requires implementers to express interest topic: api

Comments

@gibson042
Copy link

gibson042 commented Dec 16, 2022

As noted in #606 and elsewhere, the URL APIs strongly lean towards preserving input in path and query components, and therefore differentiating URIs that are equivalent per e.g. https://www.rfc-editor.org/rfc/rfc9110#section-4.2.3 . But users need to compare such URIs and/or map them to resources, and doing so robustly requires normalization. I think it therefore makes sense to provide a normalization interface, and probably one that is configurable (or can become so in the future) to account for various levels of the "comparison ladder" such as generic percent-decoding (and case normalization of percent-encodings that survive), dot-segment removal, component-sensitive percent-decoding, scheme-based rules, and possibly also even higher-order considerations such as full case normalization and/or query parameter ordering/combining/value normalization.

One possibility would be adding a normalize method to the URL class with reasonable behavior in the absence of any arguments (e.g., as much normalization as possible without conflation of URIs that implementations supporting the scheme are permitted to differentiate), such that e.g. new URL("httpS://EXAMPLE.com:443/%7ESMith/./home.html").normalize() === "https://example.com/~SMith/home.html" is true but so is new URL("http://example.com/data/").normalize() !== new URL("http://example.com/data").normalize() (because presence vs. absence of a trailing slash in a path are not equivalent at the level of an http-scheme URL).

@annevk annevk added topic: api needs implementer interest Moving the issue forward requires implementers to express interest addition/proposal New features or enhancements labels Dec 16, 2022
@annevk
Copy link
Member

annevk commented Jan 17, 2023

From https://www.rfc-editor.org/rfc/rfc3986.html#section-6.2 I think we would want this method to perform "Case Normalization" (essentially only of the %3a to %3A variety) and "Percent-Encoding Normalization".

The other aspects there are either already handled by the URL parser (e.g., httpS://EXAMPLE.com:443/%7ESMith/./home.html is already normalized to https://example.com/%7ESMith/home.html) or out-of-scope. We wouldn't want to offer scheme-based or protocol-based normalization as that's not tenable and better handled by the standards for those schemes and protocols. HTTP(S) schemes end up being covered anyway, but in general schemes are supposed to build on top of the URL Standard.

Now there are some difficulties with "Percent-Encoding Normalization", e.g., https://test/?%%33a. That would have to become https://test/?%253a presumably, but it's not entirely clear as the input is invalid.

And yeah, assuming application/x-www-form-urlencoded for query and normalizing that could make sense to offer as an option, though you could also do this yourself quite easily with url.searchParams.sort().

aarongable pushed a commit to chromium/chromium that referenced this issue Sep 4, 2023
This CL is part of the URL interop 2023 effort. "Intent to Implement
and Ship" is [1].

Currently, when Chrome parses a URL, it decodes percent-encoded ASCII
characters in URL path. However, this behavior doesn't align with the
URL Standard [2]. The CL fixes this behavior to retain percent-encoded
ASCII characters in URL's path.

Before:

> const url = new URL("http://example.com/%41");
> url.href
"http://example.com/A"

After:

> const url = new URL("http://example.com/%41");
> url.href
"http://example.com/%41"

Interoperability:

- Chrome isn't compliant, while Firefox and Safari are compliant.
- I've tested URL APIs in non-browser environments and libraries,
  such as Deno's `URL` implementation [3] and Rust's `url` crate [4],
  both of which are standard-compliant.

Background:

The existing behavior seems to be a result of past decisions. The
comment in `url_canon_path.cc` states:

> // This table was used to be designed to match exactly what IE did
> // with the characters.

Impact:

Regarding implementation, web-exported URL APIs, GURL, and KURL share
the same URL parser and canonicalization backend. Given that these URL
classes are widely used both internally or externally, predicting all
possible consequences and risks is challenging.

Given the very low user metrics [5], we received approval to land [1],
but with a kill switch in place.

UMA:

Usage: 0.000071% (URL.Path.UnescapeEscapedChar [5], as of Aug 2023)

This number isn't specific to any particular use case and represents a
an upper bound. The actual impact is likely lower.

Interaction with web servers:

Before:

When a user enters "https://example.com/%41" in the address bar or
clicks a link like <a href="https://example.com/%41">, Chrome sends
"/A" to the server.

After::

Chrome now sends "/%41" to the server, without decoding, similar to
Safari and Firefox. Note that Chrome's address bar will still display
"https://example.com/A" because the address bar formats URLs in their
own way.

For websites, how to handle percent-encoded characters in a URL's path
is up to each website. Since they can receive such URLs from various
clients, not just Chrome, this isn't a new issue for most websites.
They typically decode a URL's path before processing.

Another concern relates to Chromium's internal code or developers who
rely on the current behavior, intentionally or not.

For example, this CL might lead to issues in cases like:

```
const hash = {};

const url1 = new URL("http://example.com/%41");
hash[url1.href] = "v1";
// ...
const url2 = new URL("http://example.com/A");
hash[url2.href]  // Assumed that "v1" is retrieved, but this is no longer true.
```

According to the URL Standard, `url1` and `url2` are not equivalent
[6], but some clients might depend on Chrome's current behavior as a
feature. This presents a risk.

Additional notes:

- This change only affects the URL's path. Other parts like the host
  are not impacted.
- There was a discussion about Chrome's behavior [7]. The consensus is
  that Chrome's behavior should be fixed for better interoperability.
- There's a proposal to add a normalization interface [8] to URL.

- [1] https://groups.google.com/a/chromium.org/g/blink-dev/c/1L8vW_Xo8eY/m/3Otq2TkvAwAJ
- [2] https://url.spec.whatwg.org/#url-parsing
- [3] https://deno.land/[email protected]?s=URL
- [4] https://docs.rs/url/latest/url/
- [5] https://uma.googleplex.com/p/chrome/timeline_v2/?sid=1bb9e227dc4889fd2efbf5755d256c62
- [6] https://url.spec.whatwg.org/#url-equivalence
- [7] whatwg/url#606
- [8] whatwg/url#729

Bug: 1252531
Change-Id: I135b4efbe76bc58ba5b6c5ce652ed0aa72002249
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/4607744
Reviewed-by: Daniel Cheng <[email protected]>
Reviewed-by: James Lee <[email protected]>
Reviewed-by: Avi Drissman <[email protected]>
Reviewed-by: Emily Stark <[email protected]>
Commit-Queue: Hayato Ito <[email protected]>
Cr-Commit-Position: refs/heads/main@{#1191900}
ubonass pushed a commit to ubonass/google_url that referenced this issue Nov 10, 2024
This CL is part of the URL interop 2023 effort. "Intent to Implement
and Ship" is [1].

Currently, when Chrome parses a URL, it decodes percent-encoded ASCII
characters in URL path. However, this behavior doesn't align with the
URL Standard [2]. The CL fixes this behavior to retain percent-encoded
ASCII characters in URL's path.

Before:

> const url = new URL("http://example.com/%41");
> url.href
"http://example.com/A"

After:

> const url = new URL("http://example.com/%41");
> url.href
"http://example.com/%41"

Interoperability:

- Chrome isn't compliant, while Firefox and Safari are compliant.
- I've tested URL APIs in non-browser environments and libraries,
  such as Deno's `URL` implementation [3] and Rust's `url` crate [4],
  both of which are standard-compliant.

Background:

The existing behavior seems to be a result of past decisions. The
comment in `url_canon_path.cc` states:

> // This table was used to be designed to match exactly what IE did
> // with the characters.

Impact:

Regarding implementation, web-exported URL APIs, GURL, and KURL share
the same URL parser and canonicalization backend. Given that these URL
classes are widely used both internally or externally, predicting all
possible consequences and risks is challenging.

Given the very low user metrics [5], we received approval to land [1],
but with a kill switch in place.

UMA:

Usage: 0.000071% (URL.Path.UnescapeEscapedChar [5], as of Aug 2023)

This number isn't specific to any particular use case and represents a
an upper bound. The actual impact is likely lower.

Interaction with web servers:

Before:

When a user enters "https://example.com/%41" in the address bar or
clicks a link like <a href="https://example.com/%41">, Chrome sends
"/A" to the server.

After::

Chrome now sends "/%41" to the server, without decoding, similar to
Safari and Firefox. Note that Chrome's address bar will still display
"https://example.com/A" because the address bar formats URLs in their
own way.

For websites, how to handle percent-encoded characters in a URL's path
is up to each website. Since they can receive such URLs from various
clients, not just Chrome, this isn't a new issue for most websites.
They typically decode a URL's path before processing.

Another concern relates to Chromium's internal code or developers who
rely on the current behavior, intentionally or not.

For example, this CL might lead to issues in cases like:

```
const hash = {};

const url1 = new URL("http://example.com/%41");
hash[url1.href] = "v1";
// ...
const url2 = new URL("http://example.com/A");
hash[url2.href]  // Assumed that "v1" is retrieved, but this is no longer true.
```

According to the URL Standard, `url1` and `url2` are not equivalent
[6], but some clients might depend on Chrome's current behavior as a
feature. This presents a risk.

Additional notes:

- This change only affects the URL's path. Other parts like the host
  are not impacted.
- There was a discussion about Chrome's behavior [7]. The consensus is
  that Chrome's behavior should be fixed for better interoperability.
- There's a proposal to add a normalization interface [8] to URL.

- [1] https://groups.google.com/a/chromium.org/g/blink-dev/c/1L8vW_Xo8eY/m/3Otq2TkvAwAJ
- [2] https://url.spec.whatwg.org/#url-parsing
- [3] https://deno.land/[email protected]?s=URL
- [4] https://docs.rs/url/latest/url/
- [5] https://uma.googleplex.com/p/chrome/timeline_v2/?sid=1bb9e227dc4889fd2efbf5755d256c62
- [6] https://url.spec.whatwg.org/#url-equivalence
- [7] whatwg/url#606
- [8] whatwg/url#729

Bug: 1252531
Change-Id: I135b4efbe76bc58ba5b6c5ce652ed0aa72002249
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/4607744
Reviewed-by: Daniel Cheng <[email protected]>
Reviewed-by: James Lee <[email protected]>
Reviewed-by: Avi Drissman <[email protected]>
Reviewed-by: Emily Stark <[email protected]>
Commit-Queue: Hayato Ito <[email protected]>
Cr-Commit-Position: refs/heads/main@{#1191900}
NOKEYCHECK=True
GitOrigin-RevId: 21947c4c384cd129c20862475364ec6d430998ea
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
addition/proposal New features or enhancements needs implementer interest Moving the issue forward requires implementers to express interest topic: api
Development

No branches or pull requests

2 participants