Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unescaped HTML entities within <pre> tags break DOM parsing #3492

Closed
schlessera opened this issue Oct 14, 2019 · 3 comments
Closed

Unescaped HTML entities within <pre> tags break DOM parsing #3492

schlessera opened this issue Oct 14, 2019 · 3 comments
Labels
Bug Something isn't working P2 Low priority Sanitizers WS:Perf Work stream for Metrics, Performance and Optimizer

Comments

@schlessera
Copy link
Collaborator

Bug Description

As seen in #3403 (comment) , having </> characters inside of a <pre> tag (which seems to be valid WordPress output) trips up the DOMDocument parser.

I think this should probably be solved by escaping (at least some) HTML entities within <pre> tags.

Expected Behaviour

The <pre> should work as expected and not result in broken markup after our transformations.

Steps to reproduce

I could replicate it here: https://3v4l.org/kSM6Q

Screenshots

Markup:
Image 2019-10-12 at 11 42 05 AM

Result:
Image 2019-10-12 at 11 42 12 AM


Do not alter or remove anything below. The following sections will be managed by moderators only.

Acceptance criteria

Implementation brief

QA testing instructions

Demo

Changelog entry

@westonruter
Copy link
Member

Is this actually considered valid HTML? If you try putting it into a Gutenberg block will it not get the entities encoded when serialized?

Related: perhaps the Masterminds HTML5 parser (#3293) doesn't have this bug. Perhaps another reason to consider it.

Otherwise, the only thing I can think of to fix this is to have some regex replacement run on the HTML before parsing, similar to how we are handling the AMP bind bracketed attribute syntax.

@schlessera
Copy link
Collaborator Author

schlessera commented Oct 15, 2019

I don't think this is valid HTML, however:

  • WordPress can produce it; and
  • Browsers support it

Right now, our additional processing for the AMP transformations turns this "sketchy" HTML into broken HTML that renders the page unusable.

So, while I'm not saying we should turn invalid HTML into valid HTML by default (out of scope for the plugin), I do think we shouldn't make bad things worse, as people will blame the plugin, not the HTML.

@westonruter
Copy link
Member

If I try creating the markup in Gutenberg via the Preformatted block I get this saved in the DB, with HTML entities:

<!-- wp:preformatted -->
<pre class="wp-block-preformatted">Two roads diverged in a yellow wood,
   And sorry I could not travel both          (_/)
   And be one traveler, long I stood         (='.'=)
   And looked down one as far as I could     (")_(")
   To where it bent in the undergrowth;
 Then took the other, as just as fair,
   And having perhaps the better claim,          |_/|
   Because it was grassy and wanted wear;       / @ @ \
   Though as for that the passing there        ( > º &lt; )   Had worn them really about the same,         `>>x&lt;&lt;´
                                                /  O  \
   And both that morning equally lay
   In leaves no step had trodden black.
   Oh, I kept the first for another day!
   Yet knowing how way leads on to way,
   I doubted if I should ever come back.
 I shall be telling this with a sigh
   Somewhere ages and ages hence:
   Two roads diverged in a wood, and I—
   I took the one less traveled by,
   And that has made all the difference.
 and here's a line of some really, really, really, really long text, just to see how it is handled and to find out how it overflows;</pre>
<!-- /wp:preformatted -->

And the AMP and non-AMP versions are the same with no validation errors:

image

So I think that this should be addressed, but since it is not something which can be reproduced with normal WordPress usage, I'd deem it low priority. I don't think the level of effort to hack support in PHP's dom extension may be worth it, and it should rather be fixed perhaps by using a proper HTML5 parser, like in #3293.

@westonruter westonruter added the P2 Low priority label Oct 15, 2019
@amedina amedina removed the P2 Low priority label Mar 31, 2020
@westonruter westonruter added the P2 Low priority label Apr 12, 2020
@kmyram kmyram added the WS:Perf Work stream for Metrics, Performance and Optimizer label Aug 5, 2020
@westonruter westonruter closed this as not planned Won't fix, can't repro, duplicate, stale Nov 30, 2023
@github-project-automation github-project-automation bot moved this to Backlog in Ongoing Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working P2 Low priority Sanitizers WS:Perf Work stream for Metrics, Performance and Optimizer
Projects
Archived in project
Development

No branches or pull requests

4 participants