Unescaped HTML entities within `<pre>` tags break DOM parsing #3492

schlessera · 2019-10-14T08:34:51Z

Bug Description

As seen in #3403 (comment) , having </> characters inside of a <pre> tag (which seems to be valid WordPress output) trips up the DOMDocument parser.

I think this should probably be solved by escaping (at least some) HTML entities within <pre> tags.

Expected Behaviour

The <pre> should work as expected and not result in broken markup after our transformations.

Steps to reproduce

I could replicate it here: https://3v4l.org/kSM6Q

Screenshots

Markup:

Result:

Do not alter or remove anything below. The following sections will be managed by moderators only.

Acceptance criteria

Implementation brief

QA testing instructions

Demo

Changelog entry

The text was updated successfully, but these errors were encountered:

westonruter · 2019-10-14T14:07:54Z

Is this actually considered valid HTML? If you try putting it into a Gutenberg block will it not get the entities encoded when serialized?

Related: perhaps the Masterminds HTML5 parser (#3293) doesn't have this bug. Perhaps another reason to consider it.

Otherwise, the only thing I can think of to fix this is to have some regex replacement run on the HTML before parsing, similar to how we are handling the AMP bind bracketed attribute syntax.

schlessera · 2019-10-15T07:08:22Z

I don't think this is valid HTML, however:

WordPress can produce it; and
Browsers support it

Right now, our additional processing for the AMP transformations turns this "sketchy" HTML into broken HTML that renders the page unusable.

So, while I'm not saying we should turn invalid HTML into valid HTML by default (out of scope for the plugin), I do think we shouldn't make bad things worse, as people will blame the plugin, not the HTML.

westonruter · 2019-10-15T21:09:36Z

If I try creating the markup in Gutenberg via the Preformatted block I get this saved in the DB, with HTML entities:

<!-- wp:preformatted -->
<pre class="wp-block-preformatted">Two roads diverged in a yellow wood,
   And sorry I could not travel both          (_/)
   And be one traveler, long I stood         (='.'=)
   And looked down one as far as I could     (")_(")
   To where it bent in the undergrowth;
 Then took the other, as just as fair,
   And having perhaps the better claim,          |_/|
   Because it was grassy and wanted wear;       / @ @ \
   Though as for that the passing there        ( > º &lt; )   Had worn them really about the same,         `>>x&lt;&lt;´
                                                /  O  \
   And both that morning equally lay
   In leaves no step had trodden black.
   Oh, I kept the first for another day!
   Yet knowing how way leads on to way,
   I doubted if I should ever come back.
 I shall be telling this with a sigh
   Somewhere ages and ages hence:
   Two roads diverged in a wood, and I—
   I took the one less traveled by,
   And that has made all the difference.
 and here's a line of some really, really, really, really long text, just to see how it is handled and to find out how it overflows;</pre>
<!-- /wp:preformatted -->

And the AMP and non-AMP versions are the same with no validation errors:

So I think that this should be addressed, but since it is not something which can be reproduced with normal WordPress usage, I'd deem it low priority. I don't think the level of effort to hack support in PHP's dom extension may be worth it, and it should rather be fixed perhaps by using a proper HTML5 parser, like in #3293.

schlessera added Bug Something isn't working Sanitizers labels Oct 14, 2019

schlessera mentioned this issue Oct 14, 2019

Add compat support for twentytwenty core theme #3403

Merged

23 tasks

westonruter added the P2 Low priority label Oct 15, 2019

amedina removed the P2 Low priority label Mar 31, 2020

westonruter added the P2 Low priority label Apr 12, 2020

kmyram added the WS:Perf Work stream for Metrics, Performance and Optimizer label Aug 5, 2020

westonruter closed this as not planned Won't fix, can't repro, duplicate, stale Nov 30, 2023

github-project-automation bot added this to Ongoing Aug 28, 2024

github-project-automation bot moved this to Backlog in Ongoing Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unescaped HTML entities within `<pre>` tags break DOM parsing #3492

Unescaped HTML entities within `<pre>` tags break DOM parsing #3492

schlessera commented Oct 14, 2019

westonruter commented Oct 14, 2019

schlessera commented Oct 15, 2019 •

edited

Loading

westonruter commented Oct 15, 2019

Unescaped HTML entities within <pre> tags break DOM parsing #3492

Unescaped HTML entities within <pre> tags break DOM parsing #3492

Comments

schlessera commented Oct 14, 2019

Bug Description

Expected Behaviour

Steps to reproduce

Screenshots

Acceptance criteria

Implementation brief

QA testing instructions

Demo

Changelog entry

westonruter commented Oct 14, 2019

schlessera commented Oct 15, 2019 • edited Loading

westonruter commented Oct 15, 2019

Unescaped HTML entities within `<pre>` tags break DOM parsing #3492

Unescaped HTML entities within `<pre>` tags break DOM parsing #3492

schlessera commented Oct 15, 2019 •

edited

Loading