Parse document correctly when AMP emoji is used #79

schlessera · 2021-02-23T16:20:59Z

Fixes #75

…attribute syntax

…e bind compat

codecov · 2021-03-04T17:51:56Z

Codecov Report

Merging #79 (f8e0bb0) into main (42f0634) will increase coverage by 0.67%.
The diff coverage is 100.00%.

@@             Coverage Diff              @@
##               main      #79      +/-   ##
============================================
+ Coverage     80.19%   80.87%   +0.67%     
- Complexity      905      928      +23     
============================================
  Files            48       48              
  Lines          2252     2311      +59     
============================================
+ Hits           1806     1869      +63     
+ Misses          446      442       -4

Flag	Coverage Δ	Complexity Δ
php	`80.87% <100.00%> (+0.67%)`	`0.00 <32.00> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ	Complexity Δ
src/Dom/Document.php	`83.22% <100.00%> (+2.47%)`	`219.00 <32.00> (+23.00)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 42f0634...f8e0bb0. Read the comment docs.

westonruter · 2021-03-04T21:31:09Z

src/Dom/Document.php

+    /**
+     * Convert AMP bind-attributes back to their original syntax.
+     *
+     * This is not guaranteed to produce the exact same result as the initial markup, as it is more of a best guess.
+     * It can end up replacing the wrong attributes if the initial markup had inconsistent styling, mixing both syntaxes
+     * for the same attribute. In either case, it will always produce working markup, so this is not that big of a deal.
+     *
+     * @see convertAmpBindAttributes() Reciprocal function.
+     * @link https://www.ampproject.org/docs/reference/components/amp-bind
+     *
+     * @param string $html HTML with amp-bind attributes converted.
+     * @return string HTML with amp-bind attributes restored.
+     */
+    public function restoreAmpBindAttributes($html)
+    {
+        if (empty($this->convertedAmpBindAttributes)) {
+            return $html;
+        }
+
+        $pattern     = sprintf(
+            '#%s(%s)#i',
+            self::AMP_BIND_DATA_ATTR_PREFIX,
+            implode('|', array_unique($this->convertedAmpBindAttributes))
+        );
+        $replacement = '[$1]';
+        $limit       = count($this->convertedAmpBindAttributes);
+
+        $restored = preg_replace($pattern, $replacement, $html, $limit);
+
+        return (null !== $restored) ? $restored : $html;
+    }


I still don't believe that restoring is the right approach here since attributes can be added/removed after the DOM has been parsed and before it has been serialized, resulting in a incorrect restoration as the counts will change: bb607f8#commitcomment-47328593

As mentioned in bb607f8#commitcomment-47328668:

Therefore, I think the only reasonable path is to introduce a useBracketedAmpBindSyntax flag which controls whether all attributes are serialized using the bracketed syntax or the data-amp-bind- syntax. By default it would be false, but in unit tests you could make it true.

So, what I did now is to provide a way to pass options around the document (with BC fallback that mimic the built-in DOMDocument do its syntax still works) and I added an option (Document::OPTION_AMP_BIND_SYNTAX) to configure how the conversion should be handled for amp-bind attributes. There are three available options:

Document::AMP_BIND_SYNTAX_AUTO => try a best guess to keep the individual attributes in the syntax they were in originally.

Document::AMP_BIND_SYNTAX_DATA_ATTRIBUTE => convert all amp-bind syntax into the data-amp-bind-* syntax.

Document::AMP_BIND_SYNTAX_SQUARE_BRACKETS => convert all amp-bind attributes into the [*] syntax.

We can then set the option to Document::AMP_BIND_SYNTAX_DATA_ATTRIBUTE in the WordPress plugin to keep the current behavior for WP unchanged.

The default is Document::AMP_BIND_SYNTAX_AUTO, as I think this is the expected behavior, as third-party users of the library would not even be aware the DOMDocument code is broken and requires such a behind-the-scenes conversion in the first place.

src/Dom/Document.php

…ontent

src/Dom/Document.php

westonruter · 2021-03-11T19:29:12Z

src/Dom/Document.php

+        $htmlTag = $matches[0];

-        return preg_replace(
-            self::AMP_EMOJI_ATTRIBUTE_PATTERN,
-            '\1' . self::EMOJI_AMP_ATTRIBUTE_PLACEHOLDER . '="\3"',
-            $source,
-            1
+        // Extract attributes.
+        if (!preg_match('#^(<html)(\s[^>]+)>$#i', $htmlTag, $matches)) {
+            return $source;
+        }
+
+        // Split into individual attributes.
+        $attributes = array_map(
+            'trim',
+            array_filter(
+                preg_split(
+                    '#(\s+[^"\'\s=]+(?:=(?:"[^"]+"|\'[^\']+\'|[^"\'\s]+))?)#',
+                    $matches[2],
+                    -1,
+                    PREG_SPLIT_DELIM_CAPTURE
+                )
+            )
        );
+
+        foreach ($attributes as $index => $attribute) {
+            $attributeMatches = [];
+            if (
+                preg_match(
+                    '/^(' . Attribute::AMP_EMOJI_ALT . '|' . Attribute::AMP_EMOJI . ')(4(?:ads|email))?$/i',
+                    $attribute,
+                    $attributeMatches
+                )
+            ) {
+                $this->usedAmpEmoji = $attributeMatches[1];
+                $variant            = ! empty($attributeMatches[2]) ? $attributeMatches[2] : '';
+                $attributes[$index] = self::EMOJI_AMP_ATTRIBUTE_PLACEHOLDER . "=\"{$variant}\"";
+
+                $source = preg_replace(
+                    self::AMP_EMOJI_ATTRIBUTE_PATTERN,
+                    '<html ' . implode(' ', $attributes) . '>',
+                    $source
+                );
+                break;
+            }
+        }
+
+        return $source;


This could be made more efficient by adapting into a preg_replace_callback where the following code is put inside the callback. In that way, there would not be a need to do an additional preg_replace() on the $source. It could be combined with teh preg_match() call above as well. Something like this I believe:

private function convertAmpEmojiAttribute($source) { $this->usedAmpEmoji = ''; return preg_replace_callback( self::AMP_EMOJI_ATTRIBUTE_PATTERN, static function ( $matches ) { // Split into individual attributes. $attributes = array_map( 'trim', array_filter( preg_split( '#(\s+[^"\'\s=]+(?:=(?:"[^"]+"|\'[^\']+\'|[^"\'\s]+))?)#', $matches[2], -1, PREG_SPLIT_DELIM_CAPTURE ) ) ); foreach ($attributes as $index => $attribute) { $attributeMatches = []; if ( preg_match( '/^(' . Attribute::AMP_EMOJI_ALT . '|' . Attribute::AMP_EMOJI . ')(4(?:ads|email))?$/i', $attribute, $attributeMatches ) ) { $this->usedAmpEmoji = $attributeMatches[1]; $variant = ! empty($attributeMatches[2]) ? $attributeMatches[2] : ''; $attributes[$index] = self::EMOJI_AMP_ATTRIBUTE_PLACEHOLDER . "=\"{$variant}\""; break; } } return '<html ' . implode(' ', $attributes) . '>'; }, $source, 1 ); }

Note also the added limit of 1.

I adapted to code to use a single preg_replace_callback(). Your code was already pretty close, but it needed a bit more fiddling with the regex pattern(s).

Just to be sure, I profiled the difference across the unit tests we have that cover emojis, and this is the result:

So it's indeed a clear improvement overall.

schlessera · 2021-03-15T12:57:06Z

While looking into the few lines Codecov marked as not being tested, I noticed bugs in the encoding code and fixed them in here as well (and added a few tests for them too).

src/Dom/Document.php

Co-authored-by: Weston Ruter <[email protected]>

schlessera added Bug Something isn't working DOM labels Feb 23, 2021

schlessera mentioned this pull request Feb 23, 2021

Document parsing fails when HTML start tag contains ⚡ #75

Closed

Add breaking test cases

2f6221c

schlessera force-pushed the fix/75-parsing-fails-on-amp-emoji branch from 1cdecfb to 2f6221c Compare March 4, 2021 16:01

schlessera added 5 commits March 4, 2021 16:11

Add missing emoji in test case

1bd3177

Make emoji replacement more robust

6efcf72

Try to restore amp-bind attributes faithfully so as to preserve bind …

e6227be

…attribute syntax

Adapt tests for restored bind attributes

fdaf878

Change order of conversions to ensure the AMP emoji does not break th…

27e90e0

…e bind compat

schlessera requested a review from westonruter March 4, 2021 17:51

schlessera added this to the 0.2.0 milestone Mar 4, 2021

schlessera marked this pull request as ready for review March 4, 2021 17:52

westonruter requested changes Mar 4, 2021

View reviewed changes

schlessera added 3 commits March 5, 2021 17:58

Add option mechanism to Dom\Document

0a766d4

Add logic to configure amp-bind conversions

ab4e380

Test different amp-bind conversion options

083cd3f

schlessera requested a review from westonruter March 5, 2021 18:31

westonruter requested changes Mar 5, 2021

View reviewed changes

src/Dom/Document.php Outdated Show resolved Hide resolved

src/Dom/Document.php Outdated Show resolved Hide resolved

src/Dom/Document.php Outdated Show resolved Hide resolved

schlessera added 2 commits March 6, 2021 15:04

Add test case to ensure amp-bind syntax within content remains untouched

0735110

Use tag & attribute traversal to avoid replacing amp-bind syntax in c…

8b92e2a

…ontent

schlessera requested a review from westonruter March 8, 2021 15:43

Correct type hint for option default values array

5b21890

westonruter requested changes Mar 11, 2021

View reviewed changes

schlessera added 5 commits March 15, 2021 11:29

Optimize AMP emoji conversion algorithm

bd58f49

Extract option constants into their own interface

0cd3a32

Extract encoding constants into a separate interface

f497bec

Always add a value on bound attributes

60130e5

Import Encoding & Option interfaces

7719483

schlessera requested a review from westonruter March 15, 2021 12:06

schlessera added 4 commits March 15, 2021 12:39

Fix bug in encoding sanitization

bde0ee8

Improve encoding auto-detection on bad charset markup

ca995d0

Fix bug with ignored originalEncoding

c392699

Test whether upper case encodings work

07f12ac

schlessera added 3 commits March 15, 2021 14:18

Test options parsing

0346a71

Add coverage hints

187bdb9

Add test case for latin-1 mapping

61b04d2

westonruter approved these changes Mar 15, 2021

View reviewed changes

src/Dom/Document.php Outdated Show resolved Hide resolved

Ignore code coverage for untestable edge case

f8e0bb0

Co-authored-by: Weston Ruter <[email protected]>

schlessera merged commit 7d6402e into main Mar 16, 2021

schlessera deleted the fix/75-parsing-fails-on-amp-emoji branch March 16, 2021 12:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parse document correctly when AMP emoji is used #79

Parse document correctly when AMP emoji is used #79

schlessera commented Feb 23, 2021

codecov bot commented Mar 4, 2021 •

edited

Loading

westonruter Mar 4, 2021

schlessera Mar 5, 2021

westonruter Mar 11, 2021

schlessera Mar 15, 2021

schlessera commented Mar 15, 2021

Parse document correctly when AMP emoji is used #79

Parse document correctly when AMP emoji is used #79

Conversation

schlessera commented Feb 23, 2021

codecov bot commented Mar 4, 2021 • edited Loading

Codecov Report

westonruter Mar 4, 2021

Choose a reason for hiding this comment

schlessera Mar 5, 2021

Choose a reason for hiding this comment

westonruter Mar 11, 2021

Choose a reason for hiding this comment

schlessera Mar 15, 2021

Choose a reason for hiding this comment

schlessera commented Mar 15, 2021

codecov bot commented Mar 4, 2021 •

edited

Loading