-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parse document correctly when AMP emoji is used #79
Conversation
1cdecfb
to
2f6221c
Compare
Codecov Report
@@ Coverage Diff @@
## main #79 +/- ##
============================================
+ Coverage 80.19% 80.87% +0.67%
- Complexity 905 928 +23
============================================
Files 48 48
Lines 2252 2311 +59
============================================
+ Hits 1806 1869 +63
+ Misses 446 442 -4
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
src/Dom/Document.php
Outdated
/** | ||
* Convert AMP bind-attributes back to their original syntax. | ||
* | ||
* This is not guaranteed to produce the exact same result as the initial markup, as it is more of a best guess. | ||
* It can end up replacing the wrong attributes if the initial markup had inconsistent styling, mixing both syntaxes | ||
* for the same attribute. In either case, it will always produce working markup, so this is not that big of a deal. | ||
* | ||
* @see convertAmpBindAttributes() Reciprocal function. | ||
* @link https://www.ampproject.org/docs/reference/components/amp-bind | ||
* | ||
* @param string $html HTML with amp-bind attributes converted. | ||
* @return string HTML with amp-bind attributes restored. | ||
*/ | ||
public function restoreAmpBindAttributes($html) | ||
{ | ||
if (empty($this->convertedAmpBindAttributes)) { | ||
return $html; | ||
} | ||
|
||
$pattern = sprintf( | ||
'#%s(%s)#i', | ||
self::AMP_BIND_DATA_ATTR_PREFIX, | ||
implode('|', array_unique($this->convertedAmpBindAttributes)) | ||
); | ||
$replacement = '[$1]'; | ||
$limit = count($this->convertedAmpBindAttributes); | ||
|
||
$restored = preg_replace($pattern, $replacement, $html, $limit); | ||
|
||
return (null !== $restored) ? $restored : $html; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't believe that restoring is the right approach here since attributes can be added/removed after the DOM has been parsed and before it has been serialized, resulting in a incorrect restoration as the counts will change: bb607f8#commitcomment-47328593
As mentioned in bb607f8#commitcomment-47328668:
Therefore, I think the only reasonable path is to introduce a
useBracketedAmpBindSyntax
flag which controls whether all attributes are serialized using the bracketed syntax or thedata-amp-bind-
syntax. By default it would befalse
, but in unit tests you could make ittrue
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, what I did now is to provide a way to pass options around the document (with BC fallback that mimic the built-in DOMDocument
do its syntax still works) and I added an option (Document::OPTION_AMP_BIND_SYNTAX
) to configure how the conversion should be handled for amp-bind
attributes. There are three available options:
Document::AMP_BIND_SYNTAX_AUTO
=> try a best guess to keep the individual attributes in the syntax they were in originally.Document::AMP_BIND_SYNTAX_DATA_ATTRIBUTE
=> convert allamp-bind
syntax into thedata-amp-bind-*
syntax.Document::AMP_BIND_SYNTAX_SQUARE_BRACKETS
=> convert allamp-bind
attributes into the[*]
syntax.
We can then set the option to Document::AMP_BIND_SYNTAX_DATA_ATTRIBUTE
in the WordPress plugin to keep the current behavior for WP unchanged.
The default is Document::AMP_BIND_SYNTAX_AUTO
, as I think this is the expected behavior, as third-party users of the library would not even be aware the DOMDocument
code is broken and requires such a behind-the-scenes conversion in the first place.
src/Dom/Document.php
Outdated
$htmlTag = $matches[0]; | ||
|
||
return preg_replace( | ||
self::AMP_EMOJI_ATTRIBUTE_PATTERN, | ||
'\1' . self::EMOJI_AMP_ATTRIBUTE_PLACEHOLDER . '="\3"', | ||
$source, | ||
1 | ||
// Extract attributes. | ||
if (!preg_match('#^(<html)(\s[^>]+)>$#i', $htmlTag, $matches)) { | ||
return $source; | ||
} | ||
|
||
// Split into individual attributes. | ||
$attributes = array_map( | ||
'trim', | ||
array_filter( | ||
preg_split( | ||
'#(\s+[^"\'\s=]+(?:=(?:"[^"]+"|\'[^\']+\'|[^"\'\s]+))?)#', | ||
$matches[2], | ||
-1, | ||
PREG_SPLIT_DELIM_CAPTURE | ||
) | ||
) | ||
); | ||
|
||
foreach ($attributes as $index => $attribute) { | ||
$attributeMatches = []; | ||
if ( | ||
preg_match( | ||
'/^(' . Attribute::AMP_EMOJI_ALT . '|' . Attribute::AMP_EMOJI . ')(4(?:ads|email))?$/i', | ||
$attribute, | ||
$attributeMatches | ||
) | ||
) { | ||
$this->usedAmpEmoji = $attributeMatches[1]; | ||
$variant = ! empty($attributeMatches[2]) ? $attributeMatches[2] : ''; | ||
$attributes[$index] = self::EMOJI_AMP_ATTRIBUTE_PLACEHOLDER . "=\"{$variant}\""; | ||
|
||
$source = preg_replace( | ||
self::AMP_EMOJI_ATTRIBUTE_PATTERN, | ||
'<html ' . implode(' ', $attributes) . '>', | ||
$source | ||
); | ||
break; | ||
} | ||
} | ||
|
||
return $source; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be made more efficient by adapting into a preg_replace_callback
where the following code is put inside the callback. In that way, there would not be a need to do an additional preg_replace()
on the $source
. It could be combined with teh preg_match()
call above as well. Something like this I believe:
private function convertAmpEmojiAttribute($source)
{
$this->usedAmpEmoji = '';
return preg_replace_callback(
self::AMP_EMOJI_ATTRIBUTE_PATTERN,
static function ( $matches ) {
// Split into individual attributes.
$attributes = array_map(
'trim',
array_filter(
preg_split(
'#(\s+[^"\'\s=]+(?:=(?:"[^"]+"|\'[^\']+\'|[^"\'\s]+))?)#',
$matches[2],
-1,
PREG_SPLIT_DELIM_CAPTURE
)
)
);
foreach ($attributes as $index => $attribute) {
$attributeMatches = [];
if (
preg_match(
'/^(' . Attribute::AMP_EMOJI_ALT . '|' . Attribute::AMP_EMOJI . ')(4(?:ads|email))?$/i',
$attribute,
$attributeMatches
)
) {
$this->usedAmpEmoji = $attributeMatches[1];
$variant = ! empty($attributeMatches[2]) ? $attributeMatches[2] : '';
$attributes[$index] = self::EMOJI_AMP_ATTRIBUTE_PLACEHOLDER . "=\"{$variant}\"";
break;
}
}
return '<html ' . implode(' ', $attributes) . '>';
},
$source,
1
);
}
Note also the added limit of 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I adapted to code to use a single preg_replace_callback()
. Your code was already pretty close, but it needed a bit more fiddling with the regex pattern(s).
Just to be sure, I profiled the difference across the unit tests we have that cover emojis, and this is the result:
So it's indeed a clear improvement overall.
While looking into the few lines Codecov marked as not being tested, I noticed bugs in the encoding code and fixed them in here as well (and added a few tests for them too). |
Co-authored-by: Weston Ruter <[email protected]>
Fixes #75