Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse document correctly when AMP emoji is used #79

Merged
merged 25 commits into from
Mar 16, 2021
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
2f6221c
Add breaking test cases
schlessera Feb 23, 2021
1bd3177
Add missing emoji in test case
schlessera Mar 4, 2021
6efcf72
Make emoji replacement more robust
schlessera Mar 4, 2021
e6227be
Try to restore amp-bind attributes faithfully so as to preserve bind …
schlessera Feb 19, 2021
fdaf878
Adapt tests for restored bind attributes
schlessera Mar 4, 2021
27e90e0
Change order of conversions to ensure the AMP emoji does not break th…
schlessera Mar 4, 2021
0a766d4
Add option mechanism to Dom\Document
schlessera Mar 5, 2021
ab4e380
Add logic to configure amp-bind conversions
schlessera Mar 5, 2021
083cd3f
Test different amp-bind conversion options
schlessera Mar 5, 2021
0735110
Add test case to ensure amp-bind syntax within content remains untouched
schlessera Mar 6, 2021
8b92e2a
Use tag & attribute traversal to avoid replacing amp-bind syntax in c…
schlessera Mar 8, 2021
5b21890
Correct type hint for option default values array
schlessera Mar 9, 2021
bd58f49
Optimize AMP emoji conversion algorithm
schlessera Mar 15, 2021
0cd3a32
Extract option constants into their own interface
schlessera Mar 15, 2021
f497bec
Extract encoding constants into a separate interface
schlessera Mar 15, 2021
60130e5
Always add a value on bound attributes
schlessera Mar 15, 2021
7719483
Import Encoding & Option interfaces
schlessera Mar 15, 2021
bde0ee8
Fix bug in encoding sanitization
schlessera Mar 15, 2021
ca995d0
Improve encoding auto-detection on bad charset markup
schlessera Mar 15, 2021
c392699
Fix bug with ignored originalEncoding
schlessera Mar 15, 2021
07f12ac
Test whether upper case encodings work
schlessera Mar 15, 2021
0346a71
Test options parsing
schlessera Mar 15, 2021
187bdb9
Add coverage hints
schlessera Mar 15, 2021
61b04d2
Add test case for latin-1 mapping
schlessera Mar 15, 2021
f8e0bb0
Ignore code coverage for untestable edge case
schlessera Mar 16, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
223 changes: 195 additions & 28 deletions src/Dom/Document.php
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,42 @@
*/
final class Document extends DOMDocument
{
/**
* Option to configure the preferred amp-bind syntax.
*
* @var string
*/
const OPTION_AMP_BIND_SYNTAX = 'amp_bind_syntax';

// Different flags for the amp-bind syntax option.
schlessera marked this conversation as resolved.
Show resolved Hide resolved
const AMP_BIND_SYNTAX_AUTO = 'auto';
const AMP_BIND_SYNTAX_SQUARE_BRACKETS = 'square_brackets';
const AMP_BIND_SYNTAX_DATA_ATTRIBUTE = 'data_attribute';

/**
* Option to provide the encoding of the document.
*
* @var string
*/
const OPTION_ENCODING = 'encoding';

/**
* Option to provide additional libxml flags to configure parsing of the document.
*
* @var string
*/
const OPTION_LIBXML_FLAGS = 'libxml_flags';

/**
* Associative array of known options and their respective default value.
*
* @var array<string>
*/
const KNOWN_OPTIONS = [
self::OPTION_AMP_BIND_SYNTAX => self::AMP_BIND_SYNTAX_AUTO,
self::OPTION_ENCODING => null,
self::OPTION_LIBXML_FLAGS => 0,
];

/**
* AMP requires the HTML markup to be encoded in UTF-8.
Expand Down Expand Up @@ -176,11 +212,11 @@ final class Document extends DOMDocument
*
* @var string
*/
const AMP_EMOJI_ATTRIBUTE_PATTERN = '/(<html\s[^>]*?)('
const AMP_EMOJI_ATTRIBUTE_PATTERN = '/<html\s[^>]*?(?:'
. Attribute::AMP_EMOJI_ALT
. '|'
. Attribute::AMP_EMOJI
. ')([^\s^>]*)/iu';
. ')[^>]*?>/i';

// Attribute to use as a placeholder to move the emoji AMP symbol (⚡) over to DOM.
const EMOJI_AMP_ATTRIBUTE_PLACEHOLDER = 'emoji-amp';
Expand Down Expand Up @@ -224,6 +260,15 @@ final class Document extends DOMDocument
*/
const XPATH_INLINE_STYLE_ATTRIBUTES_QUERY = './/@style';

/**
* Associative array of options to configure the behavior of the DOM document abstraction.
*
* @see Document::KNOWN_OPTIONS For a list of available options.
*
* @var array
*/
private $options;

/**
* Whether `data-ampdevmode` was initially set on the the document element.
*
Expand Down Expand Up @@ -309,6 +354,13 @@ final class Document extends DOMDocument
*/
private $cssMaxByteCountEnforced = -1;

/**
* Store the names of the amp-bind attributes that were converted so that we can restore them later on.
*
* @var array<string>
*/
private $convertedAmpBindAttributes = [];

/**
* Creates a new AmpProject\Dom\Document object
*
Expand All @@ -327,15 +379,23 @@ public function __construct($version = '', $encoding = null)
/**
* Named constructor to provide convenient way of transforming HTML into DOM.
*
* @param string $html HTML to turn into a DOM.
* @param string $encoding Optional. Encoding of the provided HTML string.
* @param string $html HTML to turn into a DOM.
* @param array|string $options Optional. Array of options to configure the document. Used as encoding if a string
* is passed. Defaults to an empty array.
* @return Document|false DOM generated from provided HTML, or false if the transformation failed.
*/
public static function fromHtml($html, $encoding = null)
public static function fromHtml($html, $options = [])
{
// Assume options are the encoding if a string is passed, for BC reasons.
if (is_string($options)) {
$options = [self::OPTION_ENCODING => $options];
}

$encoding = isset($options[self::OPTION_ENCODING]) ? $options[self::OPTION_ENCODING] : null;

$dom = new self('', $encoding);
schlessera marked this conversation as resolved.
Show resolved Hide resolved

if (! $dom->loadHTML($html)) {
if (! $dom->loadHTML($html, $options)) {
return false;
}

Expand All @@ -347,15 +407,23 @@ public static function fromHtml($html, $encoding = null)
*
* The difference to Document::fromHtml() is that fragments are not normalized as to their structure.
*
* @param string $html HTML to turn into a DOM.
* @param string $encoding Optional. Encoding of the provided HTML string.
* @param string $html HTML to turn into a DOM.
* @param array|string $options Optional. Array of options to configure the document. Used as encoding if a string
* is passed. Defaults to an empty array.
* @return Document|false DOM generated from provided HTML, or false if the transformation failed.
*/
public static function fromHtmlFragment($html, $encoding = null)
public static function fromHtmlFragment($html, $options = [])
{
// Assume options are the encoding if a string is passed, for BC reasons.
if (is_string($options)) {
$options = [self::OPTION_ENCODING => $options];
}

$encoding = isset($options[self::OPTION_ENCODING]) ? $options[self::OPTION_ENCODING] : null;

$dom = new self('', $encoding);
schlessera marked this conversation as resolved.
Show resolved Hide resolved

if (! $dom->loadHTMLFragment($html)) {
if (! $dom->loadHTMLFragment($html, $options)) {
return false;
}

Expand Down Expand Up @@ -418,11 +486,12 @@ private function reset()
*
* @link https://php.net/manual/domdocument.loadhtml.php
*
* @param string $source The HTML string.
* @param int|string $options Optional. Specify additional Libxml parameters.
* @param string $source The HTML string.
* @param array|int|string $options Optional. Array of options to configure the document. Used as additional Libxml
* parameters if an int or string is passed. Defaults to an empty array.
* @return bool true on success or false on failure.
*/
public function loadHTML($source, $options = 0)
public function loadHTML($source, $options = [])
{
$source = $this->normalizeDocumentStructure($source);
$success = $this->loadHTMLFragment($source, $options);
Expand All @@ -444,20 +513,31 @@ public function loadHTML($source, $options = 0)
/**
* Load a HTML fragment from a string.
*
* @param string $source The HTML fragment string.
* @param int|string $options Optional. Specify additional Libxml parameters.
* @param string $source The HTML fragment string.
* @param array|int|string $options Optional. Array of options to configure the document. Used as additional Libxml
* parameters if an int or string is passed. Defaults to an empty array.
* @return bool true on success or false on failure.
*/
public function loadHTMLFragment($source, $options = 0)
public function loadHTMLFragment($source, $options = [])
{
// Assume options are the additional libxml flags if a string or int is passed, for BC reasons.
if (is_string($options)) {
$options = (int) $options;
}
if (is_int($options)) {
$options = [self::OPTION_LIBXML_FLAGS => $options];
}

$this->options = array_merge(self::KNOWN_OPTIONS, $options);

$this->reset();

$source = $this->convertAmpEmojiAttribute($source);
$source = $this->convertAmpBindAttributes($source);
$source = $this->replaceSelfClosingTags($source);
$source = $this->maybeReplaceNoscriptElements($source);
$source = $this->secureMustacheScriptTemplates($source);
$source = $this->secureDoctypeNode($source);
$source = $this->convertAmpEmojiAttribute($source);

list($source, $this->originalEncoding) = $this->detectAndStripEncoding($source);

Expand All @@ -476,18 +556,18 @@ public function loadHTMLFragment($source, $options = 0)

$libxml_previous_state = libxml_use_internal_errors(true);

$options |= LIBXML_COMPACT;
$this->options[self::OPTION_LIBXML_FLAGS] |= LIBXML_COMPACT;

/*
* LIBXML_HTML_NODEFDTD is only available for libxml 2.7.8+.
* This should be the case for PHP 5.4+, but some systems seem to compile against a custom libxml version that
* is lower than expected.
*/
if (defined('LIBXML_HTML_NODEFDTD')) {
$options |= constant('LIBXML_HTML_NODEFDTD');
$this->options[self::OPTION_LIBXML_FLAGS] |= constant('LIBXML_HTML_NODEFDTD');
}

$success = parent::loadHTML($source, $options);
$success = parent::loadHTML($source, $this->options[self::OPTION_LIBXML_FLAGS]);

libxml_clear_errors();
libxml_use_internal_errors($libxml_previous_state);
Expand Down Expand Up @@ -564,6 +644,7 @@ public function saveHTMLFragment(DOMNode $node = null)
$html = $this->restoreDoctypeNode($html);
$html = $this->restoreMustacheTemplateTokens($html);
$html = $this->restoreSelfClosingTags($html);
$html = $this->restoreAmpBindAttributes($html);
$html = $this->restoreAmpEmojiAttribute($html);
$html = $this->fixSvgSourceAttributeEncoding($html);

Expand Down Expand Up @@ -1040,6 +1121,7 @@ private function restoreMustacheScriptTemplates()
* get dropped with an error raised:
* > Warning: DOMDocument::loadHTML(): error parsing attribute name
*
* @see restoreAmpBindAttributes() Reciprocal function.
* @link https://www.ampproject.org/docs/reference/components/amp-bind
*
* @param string $html HTML containing amp-bind attributes.
Expand All @@ -1053,7 +1135,7 @@ private function convertAmpBindAttributes($html)
* @param array $tagMatches Tag matches.
* @return string Replacement.
*/
$replaceCallback = static function ($tagMatches) {
$replaceCallback = function ($tagMatches) {

// Strip the self-closing slash as long as it is not an attribute value, like for the href attribute.
$oldAttrs = rtrim(preg_replace('#(?<!=)/$#', '', $tagMatches['attrs']));
Expand All @@ -1064,10 +1146,12 @@ private function convertAmpBindAttributes($html)
$offset += strlen($attrMatches[0]);

if ('[' === $attrMatches['name'][0]) {
$newAttrs .= ' ' . self::AMP_BIND_DATA_ATTR_PREFIX . trim($attrMatches['name'], '[]');
$attrName = trim($attrMatches['name'], '[]');
$newAttrs .= ' ' . self::AMP_BIND_DATA_ATTR_PREFIX . $attrName;
if (isset($attrMatches['value'])) {
$newAttrs .= $attrMatches['value'];
}
$this->convertedAmpBindAttributes[] = $attrName;
} else {
$newAttrs .= $attrMatches[0];
}
Expand Down Expand Up @@ -1099,6 +1183,54 @@ private function convertAmpBindAttributes($html)
return (null !== $converted) ? $converted : $html;
}

/**
* Convert AMP bind-attributes back to their original syntax.
*
* This is not guaranteed to produce the exact same result as the initial markup, as it is more of a best guess.
* It can end up replacing the wrong attributes if the initial markup had inconsistent styling, mixing both syntaxes
* for the same attribute. In either case, it will always produce working markup, so this is not that big of a deal.
*
* @see convertAmpBindAttributes() Reciprocal function.
* @link https://www.ampproject.org/docs/reference/components/amp-bind
*
* @param string $html HTML with amp-bind attributes converted.
* @return string HTML with amp-bind attributes restored.
*/
public function restoreAmpBindAttributes($html)
{
if ($this->options[self::OPTION_AMP_BIND_SYNTAX] === self::AMP_BIND_SYNTAX_DATA_ATTRIBUTE) {
// All amp-bind attributes should remain in their converted data attribute form.
return $html;
}

if (
$this->options[self::OPTION_AMP_BIND_SYNTAX] === self::AMP_BIND_SYNTAX_AUTO
&&
empty($this->convertedAmpBindAttributes)
) {
// Only previously converted amp-bind attributes should be restored, but none were converted.
return $html;
}

// Depending on options, restore previously converted or all amp-bind attributes to square bracket syntax.
$pattern = $this->options[self::OPTION_AMP_BIND_SYNTAX] === self::AMP_BIND_SYNTAX_AUTO
? sprintf(
'#%s(%s)#i',
self::AMP_BIND_DATA_ATTR_PREFIX,
implode('|', array_unique($this->convertedAmpBindAttributes))
)
: sprintf(
'#%s([a-z0-9-_]+)#i', // TODO: Check what exact regex makes sense here.
self::AMP_BIND_DATA_ATTR_PREFIX
);
schlessera marked this conversation as resolved.
Show resolved Hide resolved

$replacement = '[$1]';

$restored = preg_replace($pattern, $replacement, $html);

return (null !== $restored) ? $restored : $html;
}

/**
* Adapt the encoding of the content.
*
Expand Down Expand Up @@ -1397,14 +1529,49 @@ private function convertAmpEmojiAttribute($source)
return $source;
}

$this->usedAmpEmoji = $matches[2];
$htmlTag = $matches[0];

return preg_replace(
self::AMP_EMOJI_ATTRIBUTE_PATTERN,
'\1' . self::EMOJI_AMP_ATTRIBUTE_PLACEHOLDER . '="\3"',
$source,
1
// Extract attributes.
if (!preg_match('#^(<html)(\s[^>]+)>$#i', $htmlTag, $matches)) {
return $source;
}

// Split into individual attributes.
$attributes = array_map(
'trim',
array_filter(
preg_split(
'#(\s+[^"\'\s=]+(?:=(?:"[^"]+"|\'[^\']+\'|[^"\'\s]+))?)#',
$matches[2],
-1,
PREG_SPLIT_DELIM_CAPTURE
)
)
);

foreach ($attributes as $index => $attribute) {
$attributeMatches = [];
if (
preg_match(
'/^(' . Attribute::AMP_EMOJI_ALT . '|' . Attribute::AMP_EMOJI . ')(4(?:ads|email))?$/i',
$attribute,
$attributeMatches
)
) {
$this->usedAmpEmoji = $attributeMatches[1];
$variant = ! empty($attributeMatches[2]) ? $attributeMatches[2] : '';
$attributes[$index] = self::EMOJI_AMP_ATTRIBUTE_PLACEHOLDER . "=\"{$variant}\"";

$source = preg_replace(
self::AMP_EMOJI_ATTRIBUTE_PATTERN,
'<html ' . implode(' ', $attributes) . '>',
$source
);
break;
}
}

return $source;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be made more efficient by adapting into a preg_replace_callback where the following code is put inside the callback. In that way, there would not be a need to do an additional preg_replace() on the $source. It could be combined with teh preg_match() call above as well. Something like this I believe:

    private function convertAmpEmojiAttribute($source)
    {
        $this->usedAmpEmoji = '';

        return preg_replace_callback(
            self::AMP_EMOJI_ATTRIBUTE_PATTERN,
            static function ( $matches ) {
                // Split into individual attributes.
                $attributes = array_map(
                    'trim',
                    array_filter(
                        preg_split(
                            '#(\s+[^"\'\s=]+(?:=(?:"[^"]+"|\'[^\']+\'|[^"\'\s]+))?)#',
                            $matches[2],
                            -1,
                            PREG_SPLIT_DELIM_CAPTURE
                        )
                    )
                );

                foreach ($attributes as $index => $attribute) {
                    $attributeMatches = [];
                    if (
                        preg_match(
                            '/^(' . Attribute::AMP_EMOJI_ALT . '|' . Attribute::AMP_EMOJI . ')(4(?:ads|email))?$/i',
                            $attribute,
                            $attributeMatches
                        )
                    ) {
                        $this->usedAmpEmoji = $attributeMatches[1];
                        $variant            = ! empty($attributeMatches[2]) ? $attributeMatches[2] : '';
                        $attributes[$index] = self::EMOJI_AMP_ATTRIBUTE_PLACEHOLDER . "=\"{$variant}\"";
                        break;
                    }
                }

                return '<html ' . implode(' ', $attributes) . '>';
            },
            $source,
            1
        );
    }

Note also the added limit of 1.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I adapted to code to use a single preg_replace_callback(). Your code was already pretty close, but it needed a bit more fiddling with the regex pattern(s).

Just to be sure, I profiled the difference across the unit tests we have that cover emojis, and this is the result:

Comparison from Untitled to Untitled - Blackfire - Google Chrome 2021-03-15 at 11 22 19 AM

So it's indeed a clear improvement overall.

}

/**
Expand Down
Loading