-
Notifications
You must be signed in to change notification settings - Fork 384
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
In paired mode, sanitization replaces certain punctuation with "â" #1067
Comments
Interesting. Thanks for the report. I'm trying to duplicate the issue but I can't seem to. I've tried creating a post that contains the following in Check out ‘this’ and “that” and—other things.
Check out 'this' and "that" and---other things.
Check out ‘this’ and “that” and—other things. In both non-AMP and AMP alike I see: The HTML serializer normalizes all three paragraphs to be identical: <p>Check out ‘this’ and “that” and—other things.</p>
<p>Check out ‘this’ and “that” and—other things.</p>
<p>Check out ‘this’ and “that” and—other things.</p> If you create a new post with the same |
When I create a new post with that content, I see the same issue: Interestingly, here's what I see if I do "edit HTML" on the parent element in Chrome's dev tools: I did a quick check in phpMyAdmin to make sure that the content is correct in I also had all other plugins disabled for this test, just in case. My WordPress version is 4.9.5. I'm sure there are plenty of other differences between your environment and mine, but I don't know enough about WP or this plugin to know what they might be or which ones might make a difference. I'm happy to do other tests. |
I am confounded. So non-ASCII characters defined as entities and characters that are transformed by <p>Check out ‘this’ and “that” and—other things.</p>
<p>Check out ‘this’ and “that” and—other things.</p>
<p>Check out ‘this’ and “that” and—other things.</p> So in your case the this means that HTML entities are being parsed and serialized properly, but when literal UTF-8 characters are read they get corrupted. If you would, please do some digging around here: And: In particular, can we confirm that |
Okay, I'll try to hack around in those files tonight. I found a few leads I may try to investigate:
|
Phew! Finally found the issue. The version of libxml2 installed on my server is old (2.7.6) and doesn't recognize I'll create a pull request. |
@douglyuckling amazing! Thanks so much. I'll review tomorrow. |
I'm in the early stages of updating my theme to add AMP support in "paired mode" per 0.7. Something is replacing certain punctuation marks (curly quotes, curly apostrophes, and em-dashes at least) with "â".
I suspect this may relate to the quote issues mentioned in #855, however my blog is already fully UTF-8 so I would think no encoding conversions would be necessary.
It looks like this is happening only to special characters that are part of the raw post content or the template itself. In cases where the raw post content has plain (straight) quotes and apostrophes, they are ultimately rendered correctly (with their "curly" versions, as presumably replaced by
wptexturize
).If I remove AMP support from my theme and just use the legacy templates, these characters render fine. These characters also display fine on my non-AMP endpoints. So I think it must be caused by the full-HTML sanitization being done per #888.
Details:
utf8
, and myconfig.php
also hasDB_CHARSET
set toutf8
.<meta charset="UTF-8">
and theContent-Type
header also hascharset=UTF-8
.5ba9f5b
from the 0.7 branch.The text was updated successfully, but these errors were encountered: