-
Notifications
You must be signed in to change notification settings - Fork 384
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent validation errors for URLs with leading or trailing spaces #4166
Conversation
As Weston found, something like: <a href=" https://wp.org"> would have a validation error.
Those trim( $url ) calls were passed to parse_protocol(). So do something similar inside parse_protocol(), don't match leading space.
There is a matching group of ([^/]), so there probably isn't a need for the lookahead.
Instead of changing this, simply trim() the URL before passing it to parse_protocol()
@@ -1626,7 +1626,7 @@ private function check_attr_spec_rule_value_regex_casei( DOMElement $node, $attr | |||
private function check_attr_spec_rule_valid_url( DOMElement $node, $attr_name, $attr_spec_rule ) { | |||
if ( isset( $attr_spec_rule[ AMP_Rule_Spec::VALUE_URL ] ) && $node->hasAttribute( $attr_name ) ) { | |||
foreach ( $this->extract_attribute_urls( $node->getAttributeNode( $attr_name ) ) as $url ) { | |||
$url = urldecode( $url ); | |||
$url = urldecode( trim( $url ) ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the AMP validator does this same trimming up front: https://github.com/ampproject/amphtml/blob/af1e3a550feeafd732226202b8d1f26dcefefa18/validator/engine/parse-url.js#L214-L215
However, it is also doing one more thing:
// Browsers remove Tab/CR/LF from the entire URL, so we do too.
unparsed = unparsed.replace(/[\t\r\n]/g, '');
I think that should be incorporated here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This being the case, instead of calling trim()
directly, let's create a new private method like normalize_url_from_attribute_value()
which does both the trimming and the tab/newline removal and then use that method everywhere that you are currently using trim()
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, it'll probably be several hours before I could do that, if that's alright.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
911602d adds a method to normalize the URLs, sorry for the delay.
As Weston mentioned, there might not be a need for it to run earlier.
parse_protocol() no longer trims space on its own.
Use this instead of trim() for the URLs in attributes.
@kienstra as opposed to adding empty commits, why not just restart the failed jobs in Travis? In any case, please rebase the branch to remove the throwaway commits when done. |
Yeah, that's a good idea. I had restarted the jobs several times, but pushed an empty commit just to see.
Sure. |
0a02e69
to
911602d
Compare
This build passed, but I'll trigger the |
Another external-https build passed. It's still a little troubling that they failed a few times with this PR. I'll trigger it again. |
I've noticed the |
], | ||
'a_with_mail_host' => [ | ||
'<a class="foo" href="mail to:[email protected]">value</a>', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the space in:
mail to:[email protected]
...is what caused this be invalid.
Passing <a class="foo" href="mailto:[email protected]">
to the AMP validator shows that it's valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be resolved by implementing my change here: #4166 (comment)
The mail to
protocol should still be invalid, so this chunk should be reverted.
This code: <a class="foo" href="mail to:[email protected]">
should be invalid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, good point. That's applied in f193019
* @return string The normalized URL. | ||
*/ | ||
private function normalize_url_from_attribute_value( $url ) { | ||
return preg_replace( '/\s/', '', $url ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't quite right.
Only tabs and newlines should be stripped. See https://github.com/ampproject/amphtml/blob/af1e3a550feeafd732226202b8d1f26dcefefa18/validator/engine/parse-url.js#L217-L218
So this, along with the trimming, should be:
return preg_replace( '/\s/', '', $url ); | |
return preg_replace( '/[\t\r\n]/', '', trim( $url ) ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll make that change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's applied in f193019
Use this from the AMP validator, instead of a simple regex, also use trim().
The behavior of the new parse_url_ method is changed, so these can change back.
Thanks! 🎉 |
Interestingly, the change here suppresses a validation error from occurring. In particular, consider a theme that is outputting invalid HTML like this: <h4 class="name">
<a href="<a href='https://wpthemetestdata.wordpress.com/' rel='external nofollow ugc' class='url'>themedemos</a>">
themedemos
</a>
</h4> Before the change here, this results in an AMP validation error: After the change here, no validation error is raised and the HTML markup output by the sanitizer is: <h4 class="name">
<a href="<a_href=%27https%3A%2F%2Fwpthemetestdata.wordpress.com%2F%27+rel%3D%27external+nofollow+ugc%27+class%3D%27url%27%3Ethemedemos%3C%2Fa%3E&amp" rel="amphtml">
themedemos
</a>
</h4> It turns out this is valid AMP, so this is not a bug. But it shows some difference in handling, however. |
Summary
Prevents a validation error if a URL has a leading or trailing space.
Fixes #4117
Background
As Weston found, a leading space in a URL can cause a validation error:
Preventing that error with uncovered an
INVALID_URL
error, which neededtrim()
here:amp-wp/includes/sanitizers/class-amp-tag-and-attribute-sanitizer.php
Line 1629 in bede7e3
Checklist