Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent validation errors for URLs with leading or trailing spaces #4166

Merged
merged 9 commits into from
Jan 24, 2020
18 changes: 14 additions & 4 deletions includes/sanitizers/class-amp-tag-and-attribute-sanitizer.php
Original file line number Diff line number Diff line change
Expand Up @@ -1626,7 +1626,7 @@ private function check_attr_spec_rule_value_regex_casei( DOMElement $node, $attr
private function check_attr_spec_rule_valid_url( DOMElement $node, $attr_name, $attr_spec_rule ) {
if ( isset( $attr_spec_rule[ AMP_Rule_Spec::VALUE_URL ] ) && $node->hasAttribute( $attr_name ) ) {
foreach ( $this->extract_attribute_urls( $node->getAttributeNode( $attr_name ) ) as $url ) {
$url = urldecode( $url );
westonruter marked this conversation as resolved.
Show resolved Hide resolved
$url = $this->normalize_url_from_attribute_value( $url );

// Check whether the URL is parsable.
$parts = wp_parse_url( $url );
Expand All @@ -1644,7 +1644,7 @@ private function check_attr_spec_rule_valid_url( DOMElement $node, $attr_name, $
}

// Check if the host contains invalid chars (hostCharIsValid: https://github.com/ampproject/amphtml/blob/af1e3a550feeafd732226202b8d1f26dcefefa18/validator/engine/parse-url.js#L62-L103).
$host = wp_parse_url( $url, PHP_URL_HOST );
$host = wp_parse_url( urldecode( $url ), PHP_URL_HOST );
if ( $host && preg_match( '/[!"#$%&\'()*+,\/:;<=>?@[\]^`{|}~\s]/', $host ) ) {
return AMP_Rule_Spec::FAIL;
}
Expand Down Expand Up @@ -1672,6 +1672,16 @@ private function parse_protocol( $url ) {
return null;
}

/**
* Normalize a URL that appeared as a tag attribute.
*
* @param string $url The URL to normalize.
* @return string The normalized URL.
*/
private function normalize_url_from_attribute_value( $url ) {
return preg_replace( '/\s/', '', $url );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't quite right.

Only tabs and newlines should be stripped. See https://github.com/ampproject/amphtml/blob/af1e3a550feeafd732226202b8d1f26dcefefa18/validator/engine/parse-url.js#L217-L218

So this, along with the trimming, should be:

Suggested change
return preg_replace( '/\s/', '', $url );
return preg_replace( '/[\t\r\n]/', '', trim( $url ) );

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll make that change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's applied in f193019

}

/**
* Check if attribute has a protocol value rule determine if it matches.
*
Expand All @@ -1689,7 +1699,7 @@ private function check_attr_spec_rule_allowed_protocol( DOMElement $node, $attr_
if ( isset( $attr_spec_rule[ AMP_Rule_Spec::VALUE_URL ][ AMP_Rule_Spec::ALLOWED_PROTOCOL ] ) ) {
if ( $node->hasAttribute( $attr_name ) ) {
foreach ( $this->extract_attribute_urls( $node->getAttributeNode( $attr_name ) ) as $url ) {
$url_scheme = $this->parse_protocol( $url );
$url_scheme = $this->parse_protocol( $this->normalize_url_from_attribute_value( $url ) );
if ( isset( $url_scheme ) && ! in_array( strtolower( $url_scheme ), $attr_spec_rule[ AMP_Rule_Spec::VALUE_URL ][ AMP_Rule_Spec::ALLOWED_PROTOCOL ], true ) ) {
return AMP_Rule_Spec::FAIL;
}
Expand All @@ -1701,7 +1711,7 @@ private function check_attr_spec_rule_allowed_protocol( DOMElement $node, $attr_
foreach ( $attr_spec_rule[ AMP_Rule_Spec::ALTERNATIVE_NAMES ] as $alternative_name ) {
if ( $node->hasAttribute( $alternative_name ) ) {
foreach ( $this->extract_attribute_urls( $node->getAttributeNode( $alternative_name ), $attr_name ) as $url ) {
$url_scheme = $this->parse_protocol( $url );
$url_scheme = $this->parse_protocol( $this->normalize_url_from_attribute_value( $url ) );
if ( isset( $url_scheme ) && ! in_array( strtolower( $url_scheme ), $attr_spec_rule[ AMP_Rule_Spec::VALUE_URL ][ AMP_Rule_Spec::ALLOWED_PROTOCOL ], true ) ) {
return AMP_Rule_Spec::FAIL;
}
Expand Down
226 changes: 219 additions & 7 deletions tests/php/test-amp-tag-and-attribute-sanitizer-private-methods.php
Original file line number Diff line number Diff line change
Expand Up @@ -1413,6 +1413,66 @@ public function get_check_attr_spec_rule_blacklisted_value_regex() {
];
}

/**
* Gets the test data for test_check_attr_spec_rule_valid_url().
*
* @return array The test data.
*/
public function get_check_attr_spec_rule_valid_url() {
return [
'no_attribute' => [
'<a></a>',
AMP_Rule_Spec::NOT_APPLICABLE,
],
'correct_url' => [
'<a baz="https://wp.org"></a>',
AMP_Rule_Spec::PASS,
],
'correct_url_leading_space' => [
'<a baz=" https://wp.org"></a>',
AMP_Rule_Spec::PASS,
],
'non_parseable_url' => [
'<a baz="//"></a>',
AMP_Rule_Spec::FAIL,
],
'wrong_protocol' => [
'<a baz="@:wp.org"></a>',
AMP_Rule_Spec::FAIL,
],
'wrong_host' => [
'<a baz="https://wp$camp.org"></a>',
AMP_Rule_Spec::FAIL,
],
];
}

/**
* Tests check_attr_spec_rule_valid_url.
*
* @dataProvider get_check_attr_spec_rule_valid_url
* @group allowed-tags-private-methods
* @covers AMP_Tag_And_Attribute_Sanitizer::check_attr_spec_rule_valid_url()
*
* @param array $source The HTML source to test.
* @param string $expected The expected return value.
* @throws ReflectionException If it's not possible to create a reflection to call the private method.
*/
public function test_check_attr_spec_rule_valid_url( $source, $expected ) {
$node_tag_name = 'a';
$dom = AMP_DOM_Utils::get_dom_from_content( $source );
$sanitizer = new AMP_Tag_And_Attribute_Sanitizer( $dom );
$node = $dom->getElementsByTagName( $node_tag_name )->item( 0 );
$attr_name = 'baz';
$attr_spec_rule = [ 'value_url' => [] ];

$this->assertEquals(
$expected,
$this->call_private_method( $sanitizer, 'check_attr_spec_rule_valid_url', [ $node, $attr_name, $attr_spec_rule ] ),
sprintf( 'using source: %s', $source )
);
}

/**
* @dataProvider get_check_attr_spec_rule_blacklisted_value_regex
* @group allowed-tags-private-methods
Expand All @@ -1429,7 +1489,7 @@ public function test_check_attr_spec_rule_blacklisted_value_regex( $data, $expec

public function get_check_attr_spec_rule_allowed_protocol() {
return [
'no_attributes' => [
'no_attributes' => [
[
'source' => '<div></div>',
'node_tag_name' => 'div',
Expand All @@ -1438,7 +1498,7 @@ public function get_check_attr_spec_rule_allowed_protocol() {
],
AMP_Rule_Spec::NOT_APPLICABLE,
],
'protocol_pass' => [
'protocol_pass' => [
[
'source' => '<div attribute1="http://example.com"></div>',
'node_tag_name' => 'div',
Expand All @@ -1454,7 +1514,23 @@ public function get_check_attr_spec_rule_allowed_protocol() {
],
AMP_Rule_Spec::PASS,
],
'protocol_multiple_pass' => [
'protocol_pass_leading_space' => [
[
'source' => '<div attribute1=" http://example.com"></div>',
'node_tag_name' => 'div',
'attr_name' => 'attribute1',
'attr_spec_rule' => [
'value_url' => [
'protocol' => [
'http',
'https',
],
],
],
],
AMP_Rule_Spec::PASS,
],
'protocol_multiple_pass' => [
[
'source' => '<div attribute1="http://example.com, https://domain.com"></div>',
'node_tag_name' => 'div',
Expand All @@ -1470,7 +1546,7 @@ public function get_check_attr_spec_rule_allowed_protocol() {
],
AMP_Rule_Spec::PASS,
],
'protocol_fail' => [
'protocol_fail' => [
[
'source' => '<div attribute1="data://example.com"></div>',
'node_tag_name' => 'div',
Expand All @@ -1486,7 +1562,7 @@ public function get_check_attr_spec_rule_allowed_protocol() {
],
AMP_Rule_Spec::FAIL,
],
'protocol_multiple_fail' => [
'protocol_multiple_fail' => [
[
'source' => '<img srcset="http://example.com, data://domain.com">',
'node_tag_name' => 'img',
Expand All @@ -1502,7 +1578,7 @@ public function get_check_attr_spec_rule_allowed_protocol() {
],
AMP_Rule_Spec::FAIL,
],
'protocol_alternative_pass' => [
'protocol_alternative_pass' => [
[
'source' => '<div attribute1_alternative1="http://example.com"></div>',
'node_tag_name' => 'div',
Expand All @@ -1521,7 +1597,7 @@ public function get_check_attr_spec_rule_allowed_protocol() {
],
AMP_Rule_Spec::PASS,
],
'protocol_alternative_fail' => [
'protocol_alternative_fail' => [
[
'source' => '<div attribute1_alternative1="data://example.com"></div>',
'node_tag_name' => 'div',
Expand All @@ -1543,6 +1619,142 @@ public function get_check_attr_spec_rule_allowed_protocol() {
];
}

/**
* Gets the test data for test_parse_protocol().
*
* @return array The test data.
*/
public function get_parse_protocol_data() {
return [
'empty_string' => [
'',
false,
],
'only_space' => [
' ',
false,
],
'traditional_https' => [
'https://example.com',
'https',
],
'trailing_space' => [
'https://foo.com ',
'https',
],
'no_colon' => [
'//image.png ',
false,
],
'two_colons' => [
'foo:baz://image.png ',
'foo:baz',
],
];
}

/**
* Tests parse_protocol.
*
* @dataProvider get_parse_protocol_data
* @group allowed-tags-private-methods
* @covers AMP_Tag_And_Attribute_Sanitizer::parse_protocol()
*
* @param array $url The URL to parse.
* @param string $expected The expected return value.
* @throws ReflectionException If it's not possible to create a reflection to call the private method.
*/
public function test_parse_protocol( $url, $expected ) {
$dom = AMP_DOM_Utils::get_dom_from_content( '' );
$sanitizer = new AMP_Tag_And_Attribute_Sanitizer( $dom );

$this->assertEquals(
$expected,
$this->call_private_method( $sanitizer, 'parse_protocol', [ $url ] )
);
}

/**
* Gets the test data for test_normalize_url_from_attribute_value().
*
* @return array The test data.
*/
public function get_normalize_url_data() {
$normalized_url = 'https://example.com';

return [
'nothing_to_remove' => [
'https://example.com',
],
'empty_string' => [
'',
],
'only_space' => [
' ',
'',
],
'leading_space' => [
' https://example.com',
$normalized_url,
],
'leading_tab' => [
"\thttps://example.com",
$normalized_url,
],
'trailing_linefeed' => [
"https://example.com \n",
$normalized_url,
],
'trailing_space' => [
'https://example.com ',
$normalized_url,
],
'enclosed_in_spaces' => [
' https://example.com ',
$normalized_url,
],
'space_inside' => [
' https: //exam ple.com ',
$normalized_url,
],
'tabs_inside' => [
"https:\t//exam\tple.com ",
$normalized_url,
],
'leading_slashes' => [
'//example.com',
],
'url_encoded_space_not_removed' => [
'https://example.com?foo=++baz',
],
];
}

/**
* Tests normalize_url_from_attribute_value.
*
* @dataProvider get_normalize_url_data
* @group allowed-tags-private-methods
* @covers AMP_Tag_And_Attribute_Sanitizer::normalize_url_from_attribute_value()
*
* @param array $url The URL to normalize.
* @param string|null $expected The expected return value.
* @throws ReflectionException If it's not possible to create a reflection to call the private method.
*/
public function test_normalize_url_from_attribute_value( $url, $expected = null ) {
if ( null === $expected ) {
$expected = $url;
}

$dom = AMP_DOM_Utils::get_dom_from_content( '' );
$sanitizer = new AMP_Tag_And_Attribute_Sanitizer( $dom );

$this->assertEquals(
$expected,
$this->call_private_method( $sanitizer, 'normalize_url_from_attribute_value', [ $url ] )
);
}

/**
* @dataProvider get_check_attr_spec_rule_allowed_protocol
* @group allowed-tags-private-methods
Expand Down
22 changes: 9 additions & 13 deletions tests/php/test-tag-and-attribute-sanitizer.php
Original file line number Diff line number Diff line change
Expand Up @@ -1256,27 +1256,23 @@ static function() {
[ AMP_Tag_And_Attribute_Sanitizer::INVALID_URL_PROTOCOL ],
],

'a_with_wrong_host' => [
'a_with_space_in_url' => [
'<a class="foo" href="http://foo bar">value</a>',
'<a class="foo">value</a>',
[],
[ AMP_Tag_And_Attribute_Sanitizer::INVALID_URL ],
'<a class="foo" href="http://foo%20bar">value</a>',
],
'a_with_encoded_host' => [
'<a class="foo" href="http://%65%78%61%6d%70%6c%65%2e%63%6f%6d/foo/">value</a>',
null,
],
'a_with_wrong_schemeless_host' => [
'<a class="foo" href="//bad domain with a space.com/foo">value</a>',
'<a class="foo">value</a>',
[],
[ AMP_Tag_And_Attribute_Sanitizer::INVALID_URL ],
'a_with_spaces_inside' => [
'<a class="foo" href="//domain with space.com/foo">value</a>',
'<a class="foo" href="//domain%20with%20space.com/foo">value</a>',
],
'a_with_schemaless_host' => [
'<a class="foo" href="//domain.om/foo">value</a>',
],
'a_with_mail_host' => [
'<a class="foo" href="mail to:[email protected]">value</a>',
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the space in:

...is what caused this be invalid.

Passing <a class="foo" href="mailto:[email protected]"> to the AMP validator shows that it's valid.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be resolved by implementing my change here: #4166 (comment)

The mail to protocol should still be invalid, so this chunk should be reverted.

This code: <a class="foo" href="mail to:[email protected]"> should be invalid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, good point. That's applied in f193019

'<a class="foo">value</a>',
[],
[ AMP_Tag_And_Attribute_Sanitizer::INVALID_URL_PROTOCOL ],
'<a class="foo" href="mailto:[email protected]">value</a>',
],

// font is removed so we should check that other elements are checked as well.
Expand Down