Skip to content

Commit

Permalink
Prevent decoding of percent-encoded ASCII characters in URL's path
Browse files Browse the repository at this point in the history
This CL is part of the URL interop 2023 effort. "Intent to Implement
and Ship" is [1].

Currently, when Chrome parses a URL, it decodes percent-encoded ASCII
characters in URL path. However, this behavior doesn't align with the
URL Standard [2]. The CL fixes this behavior to retain percent-encoded
ASCII characters in URL's path.

Before:

> const url = new URL("http://example.com/%41");
> url.href
"http://example.com/A"

After:

> const url = new URL("http://example.com/%41");
> url.href
"http://example.com/%41"

Interoperability:

- Chrome isn't compliant, while Firefox and Safari are compliant.
- I've tested URL APIs in non-browser environments and libraries,
  such as Deno's `URL` implementation [3] and Rust's `url` crate [4],
  both of which are standard-compliant.

Background:

The existing behavior seems to be a result of past decisions. The
comment in `url_canon_path.cc` states:

> // This table was used to be designed to match exactly what IE did
> // with the characters.

Impact:

Regarding implementation, web-exported URL APIs, GURL, and KURL share
the same URL parser and canonicalization backend. Given that these URL
classes are widely used both internally or externally, predicting all
possible consequences and risks is challenging.

Given the very low user metrics [5], we received approval to land [1],
but with a kill switch in place.

UMA:

Usage: 0.000071% (URL.Path.UnescapeEscapedChar [5], as of Aug 2023)

This number isn't specific to any particular use case and represents a
an upper bound. The actual impact is likely lower.

Interaction with web servers:

Before:

When a user enters "https://example.com/%41" in the address bar or
clicks a link like <a href="https://example.com/%41">, Chrome sends
"/A" to the server.

After::

Chrome now sends "/%41" to the server, without decoding, similar to
Safari and Firefox. Note that Chrome's address bar will still display
"https://example.com/A" because the address bar formats URLs in their
own way.

For websites, how to handle percent-encoded characters in a URL's path
is up to each website. Since they can receive such URLs from various
clients, not just Chrome, this isn't a new issue for most websites.
They typically decode a URL's path before processing.

Another concern relates to Chromium's internal code or developers who
rely on the current behavior, intentionally or not.

For example, this CL might lead to issues in cases like:

```
const hash = {};

const url1 = new URL("http://example.com/%41");
hash[url1.href] = "v1";
// ...
const url2 = new URL("http://example.com/A");
hash[url2.href]  // Assumed that "v1" is retrieved, but this is no longer true.
```

According to the URL Standard, `url1` and `url2` are not equivalent
[6], but some clients might depend on Chrome's current behavior as a
feature. This presents a risk.

Additional notes:

- This change only affects the URL's path. Other parts like the host
  are not impacted.
- There was a discussion about Chrome's behavior [7]. The consensus is
  that Chrome's behavior should be fixed for better interoperability.
- There's a proposal to add a normalization interface [8] to URL.

- [1] https://groups.google.com/a/chromium.org/g/blink-dev/c/1L8vW_Xo8eY/m/3Otq2TkvAwAJ
- [2] https://url.spec.whatwg.org/#url-parsing
- [3] https://deno.land/[email protected]?s=URL
- [4] https://docs.rs/url/latest/url/
- [5] https://uma.googleplex.com/p/chrome/timeline_v2/?sid=1bb9e227dc4889fd2efbf5755d256c62
- [6] https://url.spec.whatwg.org/#url-equivalence
- [7] whatwg/url#606
- [8] whatwg/url#729

Bug: 1252531
Change-Id: I135b4efbe76bc58ba5b6c5ce652ed0aa72002249
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/4607744
Reviewed-by: Daniel Cheng <[email protected]>
Reviewed-by: James Lee <[email protected]>
Reviewed-by: Avi Drissman <[email protected]>
Reviewed-by: Emily Stark <[email protected]>
Commit-Queue: Hayato Ito <[email protected]>
Cr-Commit-Position: refs/heads/main@{#1191900}
NOKEYCHECK=True
GitOrigin-RevId: 21947c4c384cd129c20862475364ec6d430998ea
  • Loading branch information
hayatoito authored and copybara-github committed Sep 4, 2023
1 parent 99efc80 commit cc2a8e4
Show file tree
Hide file tree
Showing 4 changed files with 58 additions and 17 deletions.
7 changes: 6 additions & 1 deletion url_canon_path.cc
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,12 @@

#include "base/check.h"
#include "base/check_op.h"
#include "base/feature_list.h"
#include "base/metrics/histogram_functions.h"
#include "third_party/abseil-cpp/absl/types/optional.h"
#include "url/url_canon.h"
#include "url/url_canon_internal.h"
#include "url/url_features.h"
#include "url/url_parse_internal.h"

namespace url {
Expand Down Expand Up @@ -37,6 +39,7 @@ enum CharacterFlags {
// This character must be unescaped in canonical output. Not valid with
// ESCAPE or PASS. We DON'T set the SPECIAL flag since if we encounter these
// characters unescaped, they should just be copied.
// TODO(https://crbug.com/1252531): This is guarded by a feature flag.
UNESCAPE = 4,
};

Expand Down Expand Up @@ -333,7 +336,9 @@ bool DoPartialPathInternal(const CHAR* spec,
// the last character of the escape sequence.
char unescaped_flags = kPathCharLookup[unescaped_value];

if (unescaped_flags & UNESCAPE) {
if (!base::FeatureList::IsEnabled(
url::kDontDecodeAsciiPercentEncodedURLPath) &&
(unescaped_flags & UNESCAPE)) {
// This escaped value shouldn't be escaped. Try to copy it.
unescape_escaped_char = true;

Expand Down
58 changes: 42 additions & 16 deletions url_canon_unittest.cc
Original file line number Diff line number Diff line change
Expand Up @@ -1275,13 +1275,13 @@ DualComponentCase kCommonPathCases[] = {
{"/foo%2\xc2\xa9zbar", nullptr, "/foo%2%C2%A9zbar", Component(0, 16), true},
{nullptr, L"/foo%2\xc2\xa9zbar", "/foo%2%C3%82%C2%A9zbar", Component(0, 22),
true},
// Regular characters that are escaped should be unescaped
{"/foo%41%7a", L"/foo%41%7a", "/fooAz", Component(0, 6), true},
// Regular characters that are escaped should remain escaped
{"/foo%41%7a", L"/foo%41%7a", "/foo%41%7a", Component(0, 10), true},
// Funny characters that are unescaped should be escaped
{"/foo\x09\x91%91", nullptr, "/foo%09%91%91", Component(0, 13), true},
{nullptr, L"/foo\x09\x91%91", "/foo%09%C2%91%91", Component(0, 16), true},
// %00 should not cause failures.
{"/foo%00%51", L"/foo%00%51", "/foo%00Q", Component(0, 8), true},
{"/foo%00%51", L"/foo%00%51", "/foo%00%51", Component(0, 10), true},
// Some characters should be passed through unchanged regardless of esc.
{"/(%28:%3A%29)", L"/(%28:%3A%29)", "/(%28:%3A%29)", Component(0, 13),
true},
Expand All @@ -1302,21 +1302,20 @@ DualComponentCase kCommonPathCases[] = {
"/%7Ffp3%3Eju%3Dduvgw%3Dd", Component(0, 24), true},
// @ should be passed through unchanged (escaped or unescaped).
{"/@asdf%40", L"/@asdf%40", "/@asdf%40", Component(0, 9), true},
// Nested escape sequences should result in escaping the leading '%' if
// unescaping would result in a new escape sequence.
{"/%A%42", L"/%A%42", "/%25AB", Component(0, 6), true},
{"/%%41B", L"/%%41B", "/%25AB", Component(0, 6), true},
{"/%%41%42", L"/%%41%42", "/%25AB", Component(0, 6), true},
// Nested escape sequences no longer happen. See https://crbug.com/1252531.
{"/%A%42", L"/%A%42", "/%A%42", Component(0, 6), true},
{"/%%41B", L"/%%41B", "/%%41B", Component(0, 6), true},
{"/%%41%42", L"/%%41%42", "/%%41%42", Component(0, 8), true},
// Make sure truncated "nested" escapes don't result in reading off the
// string end.
{"/%%41", L"/%%41", "/%A", Component(0, 3), true},
{"/%%41", L"/%%41", "/%%41", Component(0, 5), true},
// Don't unescape the leading '%' if unescaping doesn't result in a valid
// new escape sequence.
{"/%%470", L"/%%470", "/%G0", Component(0, 4), true},
{"/%%2D%41", L"/%%2D%41", "/%-A", Component(0, 4), true},
{"/%%470", L"/%%470", "/%%470", Component(0, 6), true},
{"/%%2D%41", L"/%%2D%41", "/%%2D%41", Component(0, 8), true},
// Don't erroneously downcast a UTF-16 character in a way that makes it
// look like part of an escape sequence.
{nullptr, L"/%%41\x0130", "/%A%C4%B0", Component(0, 9), true},
{nullptr, L"/%%41\x0130", "/%%41%C4%B0", Component(0, 11), true},

// ----- encoding tests -----
// Basic conversions
Expand Down Expand Up @@ -1622,7 +1621,7 @@ TEST(URLCanonTest, CanonicalizeStandardURL) {
"ws://%29w%1ew%81/", false},
// Regression test for the last_invalid_percent_index bug described in
// https://crbug.com/1080890#c10.
{R"(HTTP:S/5%\../>%41)", "http://s/%3EA", true},
{R"(HTTP:S/5%\../>%41)", "http://s/%3E%41", true},
};

for (size_t i = 0; i < std::size(cases); i++) {
Expand Down Expand Up @@ -2742,7 +2741,29 @@ TEST(URLCanonTest, IDNToASCII) {
output.set_length(0);
}

TEST(URLCanonTest, UnescapePathCharHistogram) {
class URLCanonAsciiPercentEncodePathTest
: public ::testing::Test,
public ::testing::WithParamInterface<bool> {
public:
URLCanonAsciiPercentEncodePathTest() {
if (GetParam()) {
scoped_feature_list_.InitAndEnableFeature(
url::kDontDecodeAsciiPercentEncodedURLPath);
} else {
scoped_feature_list_.InitAndDisableFeature(
url::kDontDecodeAsciiPercentEncodedURLPath);
}
}

private:
base::test::ScopedFeatureList scoped_feature_list_;
};

INSTANTIATE_TEST_SUITE_P(All,
URLCanonAsciiPercentEncodePathTest,
::testing::Bool());

TEST_P(URLCanonAsciiPercentEncodePathTest, UnescapePathCharHistogram) {
struct TestCase {
base::StringPiece path;
base::HistogramBase::Count cnt;
Expand All @@ -2760,8 +2781,13 @@ TEST(URLCanonTest, UnescapePathCharHistogram) {
StdStringCanonOutput output(&out_str);
bool success = CanonicalizePath(c.path.data(), in_comp, &output, &out_comp);
ASSERT_TRUE(success);
histogram_tester.ExpectBucketCount("URL.Path.UnescapeEscapedChar", 1,
c.cnt);
if (base::FeatureList::IsEnabled(
url::kDontDecodeAsciiPercentEncodedURLPath)) {
histogram_tester.ExpectBucketCount("URL.Path.UnescapeEscapedChar", 1, 0);
} else {
histogram_tester.ExpectBucketCount("URL.Path.UnescapeEscapedChar", 1,
c.cnt);
}
}
}

Expand Down
5 changes: 5 additions & 0 deletions url_features.cc
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,11 @@ BASE_FEATURE(kResolveBareFragmentWithColonOnNonHierarchical,
"ResolveBareFragmentWithColonOnNonHierarchical",
base::FEATURE_ENABLED_BY_DEFAULT);

// Kill switch for crbug.com/1252531.
BASE_FEATURE(kDontDecodeAsciiPercentEncodedURLPath,
"DontDecodeAsciiPercentEncodedURLPath",
base::FEATURE_ENABLED_BY_DEFAULT);

bool IsUsingIDNA2008NonTransitional() {
// If the FeatureList isn't available yet, fall back to the feature's default
// state. This may happen during early startup, see crbug.com/1441956.
Expand Down
5 changes: 5 additions & 0 deletions url_features.h
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,11 @@ BASE_DECLARE_FEATURE(kStrictIPv4EmbeddedIPv6AddressParsing);
COMPONENT_EXPORT(URL)
BASE_DECLARE_FEATURE(kResolveBareFragmentWithColonOnNonHierarchical);

// When enabled, percent-encoded ASCII characters in URL path are not decoded
// automatically. See https://crbug.com/125231.
COMPONENT_EXPORT(URL)
BASE_DECLARE_FEATURE(kDontDecodeAsciiPercentEncodedURLPath);

} // namespace url

#endif // URL_URL_FEATURES_H_

0 comments on commit cc2a8e4

Please sign in to comment.