Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse document correctly when AMP emoji is used #79

Merged
merged 25 commits into from
Mar 16, 2021

Conversation

schlessera
Copy link
Collaborator

Fixes #75

@schlessera schlessera force-pushed the fix/75-parsing-fails-on-amp-emoji branch from 1cdecfb to 2f6221c Compare March 4, 2021 16:01
@schlessera schlessera requested a review from westonruter March 4, 2021 17:51
@schlessera schlessera added this to the 0.2.0 milestone Mar 4, 2021
@codecov
Copy link

codecov bot commented Mar 4, 2021

Codecov Report

Merging #79 (f8e0bb0) into main (42f0634) will increase coverage by 0.67%.
The diff coverage is 100.00%.

Impacted file tree graph

@@             Coverage Diff              @@
##               main      #79      +/-   ##
============================================
+ Coverage     80.19%   80.87%   +0.67%     
- Complexity      905      928      +23     
============================================
  Files            48       48              
  Lines          2252     2311      +59     
============================================
+ Hits           1806     1869      +63     
+ Misses          446      442       -4     
Flag Coverage Δ Complexity Δ
php 80.87% <100.00%> (+0.67%) 0.00 <32.00> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ Complexity Δ
src/Dom/Document.php 83.22% <100.00%> (+2.47%) 219.00 <32.00> (+23.00)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 42f0634...f8e0bb0. Read the comment docs.

@schlessera schlessera marked this pull request as ready for review March 4, 2021 17:52
Comment on lines 1113 to 1143
/**
* Convert AMP bind-attributes back to their original syntax.
*
* This is not guaranteed to produce the exact same result as the initial markup, as it is more of a best guess.
* It can end up replacing the wrong attributes if the initial markup had inconsistent styling, mixing both syntaxes
* for the same attribute. In either case, it will always produce working markup, so this is not that big of a deal.
*
* @see convertAmpBindAttributes() Reciprocal function.
* @link https://www.ampproject.org/docs/reference/components/amp-bind
*
* @param string $html HTML with amp-bind attributes converted.
* @return string HTML with amp-bind attributes restored.
*/
public function restoreAmpBindAttributes($html)
{
if (empty($this->convertedAmpBindAttributes)) {
return $html;
}

$pattern = sprintf(
'#%s(%s)#i',
self::AMP_BIND_DATA_ATTR_PREFIX,
implode('|', array_unique($this->convertedAmpBindAttributes))
);
$replacement = '[$1]';
$limit = count($this->convertedAmpBindAttributes);

$restored = preg_replace($pattern, $replacement, $html, $limit);

return (null !== $restored) ? $restored : $html;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't believe that restoring is the right approach here since attributes can be added/removed after the DOM has been parsed and before it has been serialized, resulting in a incorrect restoration as the counts will change: bb607f8#commitcomment-47328593

As mentioned in bb607f8#commitcomment-47328668:

Therefore, I think the only reasonable path is to introduce a useBracketedAmpBindSyntax flag which controls whether all attributes are serialized using the bracketed syntax or the data-amp-bind- syntax. By default it would be false, but in unit tests you could make it true.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, what I did now is to provide a way to pass options around the document (with BC fallback that mimic the built-in DOMDocument do its syntax still works) and I added an option (Document::OPTION_AMP_BIND_SYNTAX) to configure how the conversion should be handled for amp-bind attributes. There are three available options:

  • Document::AMP_BIND_SYNTAX_AUTO => try a best guess to keep the individual attributes in the syntax they were in originally.
  • Document::AMP_BIND_SYNTAX_DATA_ATTRIBUTE => convert all amp-bind syntax into the data-amp-bind-* syntax.
  • Document::AMP_BIND_SYNTAX_SQUARE_BRACKETS => convert all amp-bind attributes into the [*] syntax.

We can then set the option to Document::AMP_BIND_SYNTAX_DATA_ATTRIBUTE in the WordPress plugin to keep the current behavior for WP unchanged.

The default is Document::AMP_BIND_SYNTAX_AUTO, as I think this is the expected behavior, as third-party users of the library would not even be aware the DOMDocument code is broken and requires such a behind-the-scenes conversion in the first place.

@schlessera schlessera requested a review from westonruter March 5, 2021 18:31
src/Dom/Document.php Outdated Show resolved Hide resolved
src/Dom/Document.php Outdated Show resolved Hide resolved
src/Dom/Document.php Outdated Show resolved Hide resolved
@schlessera schlessera requested a review from westonruter March 8, 2021 15:43
src/Dom/Document.php Outdated Show resolved Hide resolved
src/Dom/Document.php Show resolved Hide resolved
src/Dom/Document.php Outdated Show resolved Hide resolved
src/Dom/Document.php Outdated Show resolved Hide resolved
Comment on lines 1605 to 1647
$htmlTag = $matches[0];

return preg_replace(
self::AMP_EMOJI_ATTRIBUTE_PATTERN,
'\1' . self::EMOJI_AMP_ATTRIBUTE_PLACEHOLDER . '="\3"',
$source,
1
// Extract attributes.
if (!preg_match('#^(<html)(\s[^>]+)>$#i', $htmlTag, $matches)) {
return $source;
}

// Split into individual attributes.
$attributes = array_map(
'trim',
array_filter(
preg_split(
'#(\s+[^"\'\s=]+(?:=(?:"[^"]+"|\'[^\']+\'|[^"\'\s]+))?)#',
$matches[2],
-1,
PREG_SPLIT_DELIM_CAPTURE
)
)
);

foreach ($attributes as $index => $attribute) {
$attributeMatches = [];
if (
preg_match(
'/^(' . Attribute::AMP_EMOJI_ALT . '|' . Attribute::AMP_EMOJI . ')(4(?:ads|email))?$/i',
$attribute,
$attributeMatches
)
) {
$this->usedAmpEmoji = $attributeMatches[1];
$variant = ! empty($attributeMatches[2]) ? $attributeMatches[2] : '';
$attributes[$index] = self::EMOJI_AMP_ATTRIBUTE_PLACEHOLDER . "=\"{$variant}\"";

$source = preg_replace(
self::AMP_EMOJI_ATTRIBUTE_PATTERN,
'<html ' . implode(' ', $attributes) . '>',
$source
);
break;
}
}

return $source;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be made more efficient by adapting into a preg_replace_callback where the following code is put inside the callback. In that way, there would not be a need to do an additional preg_replace() on the $source. It could be combined with teh preg_match() call above as well. Something like this I believe:

    private function convertAmpEmojiAttribute($source)
    {
        $this->usedAmpEmoji = '';

        return preg_replace_callback(
            self::AMP_EMOJI_ATTRIBUTE_PATTERN,
            static function ( $matches ) {
                // Split into individual attributes.
                $attributes = array_map(
                    'trim',
                    array_filter(
                        preg_split(
                            '#(\s+[^"\'\s=]+(?:=(?:"[^"]+"|\'[^\']+\'|[^"\'\s]+))?)#',
                            $matches[2],
                            -1,
                            PREG_SPLIT_DELIM_CAPTURE
                        )
                    )
                );

                foreach ($attributes as $index => $attribute) {
                    $attributeMatches = [];
                    if (
                        preg_match(
                            '/^(' . Attribute::AMP_EMOJI_ALT . '|' . Attribute::AMP_EMOJI . ')(4(?:ads|email))?$/i',
                            $attribute,
                            $attributeMatches
                        )
                    ) {
                        $this->usedAmpEmoji = $attributeMatches[1];
                        $variant            = ! empty($attributeMatches[2]) ? $attributeMatches[2] : '';
                        $attributes[$index] = self::EMOJI_AMP_ATTRIBUTE_PLACEHOLDER . "=\"{$variant}\"";
                        break;
                    }
                }

                return '<html ' . implode(' ', $attributes) . '>';
            },
            $source,
            1
        );
    }

Note also the added limit of 1.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I adapted to code to use a single preg_replace_callback(). Your code was already pretty close, but it needed a bit more fiddling with the regex pattern(s).

Just to be sure, I profiled the difference across the unit tests we have that cover emojis, and this is the result:

Comparison from Untitled to Untitled - Blackfire - Google Chrome 2021-03-15 at 11 22 19 AM

So it's indeed a clear improvement overall.

@schlessera schlessera requested a review from westonruter March 15, 2021 12:06
@schlessera
Copy link
Collaborator Author

While looking into the few lines Codecov marked as not being tested, I noticed bugs in the encoding code and fixed them in here as well (and added a few tests for them too).

src/Dom/Document.php Outdated Show resolved Hide resolved
@schlessera schlessera merged commit 7d6402e into main Mar 16, 2021
@schlessera schlessera deleted the fix/75-parsing-fails-on-amp-emoji branch March 16, 2021 12:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working DOM
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Document parsing fails when HTML start tag contains ⚡
2 participants