Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In paired mode, sanitization replaces certain punctuation with "â"€™ #1067

Closed
douglyuckling opened this issue Apr 11, 2018 · 6 comments
Closed
Milestone

Comments

@douglyuckling
Copy link
Contributor

douglyuckling commented Apr 11, 2018

I'm in the early stages of updating my theme to add AMP support in "paired mode" per 0.7. Something is replacing certain punctuation marks (curly quotes, curly apostrophes, and em-dashes at least) with "â".

I suspect this may relate to the quote issues mentioned in #855, however my blog is already fully UTF-8 so I would think no encoding conversions would be necessary.

It looks like this is happening only to special characters that are part of the raw post content or the template itself. In cases where the raw post content has plain (straight) quotes and apostrophes, they are ultimately rendered correctly (with their "curly" versions, as presumably replaced by wptexturize).

If I remove AMP support from my theme and just use the legacy templates, these characters render fine. These characters also display fine on my non-AMP endpoints. So I think it must be caused by the full-HTML sanitization being done per #888.

Details:

  • My database and all of its tables are in utf8, and my config.php also has DB_CHARSET set to utf8.
  • In all cases, the rendered page has <meta charset="UTF-8"> and the Content-Type header also has charset=UTF-8.
  • The PHP version is 5.6.32.
  • The version of this plugin I'm using is commit 5ba9f5b from the 0.7 branch.
@westonruter
Copy link
Member

Interesting. Thanks for the report. I'm trying to duplicate the issue but I can't seem to. I've tried creating a post that contains the following in post_content:

Check out ‘this’ and “that” and—other things.

Check out 'this' and "that" and---other things.

Check out &lsquo;this&rsquo; and &ldquo;that&rdquo; and&mdash;other things.

In both non-AMP and AMP alike I see:

image

The HTML serializer normalizes all three paragraphs to be identical:

<p>Check out ‘this’ and “that” and—other things.</p>
<p>Check out ‘this’ and “that” and—other things.</p>
<p>Check out ‘this’ and “that” and—other things.</p>

If you create a new post with the same post_content can you reproduce the issue with that post as well? Or is it only for an existing post?

@douglyuckling
Copy link
Contributor Author

When I create a new post with that content, I see the same issue:

screen shot 2018-04-11 at 7 58 25 am

Interestingly, here's what I see if I do "edit HTML" on the parent element in Chrome's dev tools:

screen shot 2018-04-11 at 8 17 35 am

I did a quick check in phpMyAdmin to make sure that the content is correct in post_content:

screen shot 2018-04-11 at 8 16 29 am

I also had all other plugins disabled for this test, just in case. My WordPress version is 4.9.5.

I'm sure there are plenty of other differences between your environment and mine, but I don't know enough about WP or this plugin to know what they might be or which ones might make a difference. I'm happy to do other tests.

@westonruter
Copy link
Member

I am confounded. So non-ASCII characters defined as entities and characters that are transformed by wptexturize are successfully coming through unscathed. When looking at the same content rendered without AMP I see:

<p>Check out ‘this’ and “that” and—other things.</p>
<p>Check out &#8216;this&#8217; and &#8220;that&#8221; and&#8212;other things.</p>
<p>Check out &lsquo;this&rsquo; and &ldquo;that&rdquo; and&mdash;other things.</p>

So in your case the this means that HTML entities are being parsed and serialized properly, but when literal UTF-8 characters are read they get corrupted.

If you would, please do some digging around here:

https://github.com/Automattic/amp-wp/blob/5ba9f5bbf21ea0dfdf44d08899940c3e7d228afa/includes/class-amp-theme-support.php#L1050-L1063

And:

https://github.com/Automattic/amp-wp/blob/5ba9f5bbf21ea0dfdf44d08899940c3e7d228afa/includes/utils/class-amp-dom-utils.php#L63-L120

In particular, can we confirm that DOMDocument is truly getting instantiated with UTF-8 as the encoding? What does $document->encoding contain before and after loadHTML is done? Does it make a difference if $document is instantiated via new DOMDocument( 1.0, 'utf-8' )? The default is supposed to be UTF-8 per the PHP docs, but maybe this isn't the case on your env?

@douglyuckling
Copy link
Contributor Author

Okay, I'll try to hack around in those files tonight.

I found a few leads I may try to investigate:

  • https://stackoverflow.com/a/20675396:

    The problem is with saveHTML() and saveXML(), both of them do not work correctly in Unix. They do not save UTF-8 characters correctly when used in Unix, but they work in Windows.

  • http://php.net/manual/en/domdocument.construct.php#98209:

    The constuctor arguments are useful if you want to build a new document using createElement, appendChild etc.

    By contrast, these arguments are overriden as soon as you load a document from source by calling load() or loadXML().

  • https://stackoverflow.com/a/8218649:

    DOMDocument::loadHTML will treat your string as being in ISO-8859-1 unless you tell it otherwise. This results in UTF-8 strings being interpreted incorrectly.

@douglyuckling
Copy link
Contributor Author

Phew! Finally found the issue. The version of libxml2 installed on my server is old (2.7.6) and doesn't recognize <meta charset="utf-8"> as a declaration of the encoding. I found that adding an older http-equiv-style meta tag got it to correctly detect the encoding, which resolved the issue. See https://bugzilla.gnome.org/show_bug.cgi?id=655218.

I'll create a pull request.

@westonruter
Copy link
Member

@douglyuckling amazing! Thanks so much. I'll review tomorrow.

@westonruter westonruter added this to the v0.7 milestone Apr 12, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants