HTML parser: don't alter raw HTML #10551

johngodley · 2018-10-12T12:09:50Z

Currently if you use an HTML block it will be parsed and reformatted when saved. This has the effect of transforming valid HTML such as:

<path d="M0,0h24v24H0V0z M0,0h24v24H0V0z" fill="none" />

Into:

<path d="M0,0h24v24H0V0z M0,0h24v24H0V0z" fill="none"></path>

Although technically correct it does mean the HTML changes on save, which is unexpected.

The reason for this is the HTML goes into hpq, which uses the browser and createHTMLDocument. This converts the user HTML into a valid DOM tree, modifying as appropriate, expanding self-closing tags, and removing unnecessary whitespace.

When the HTML is then returned it is the browser's version, and the original version is discarded. For most blocks this is fine as the HTML is generated by Gutenberg, but for the HTML block the HTML is generated by the user.

This is another follow-on from #10066, and in conjunction with #10474.

How has this been tested?

Additional unit test added. Manually test by creating a custom HTML block and adding:

<path d="M0,0h24v24H0V0z M0,0h24v24H0V0z" fill="none" />

Verify that on saving the post and reloading the page the block HTML remains the same.

Checklist:

My code is tested.
My code follows the WordPress code style.
My code follows the accessibility standards.
My code has proper inline documentation.

packages/block-library/src/html/index.js

aduth · 2018-10-12T16:37:47Z

Aside from proposed workaround, this feels like it could be classified as an upstream issue of hpq, where the following might not be expected:

⇒ node                     
> global.document = new ( require( 'jsdom' ).JSDOM )().window.document
Document { location: [Getter/Setter] }
> var hpq = require( 'hpq' );
undefined
> hpq.parse( '<div />', { html: hpq.html() } )
{ html: '<div></div>' }

bobbingwide · 2018-10-23T11:58:20Z

This solution looks like it goes someway towards achieving my requirements in #5123

johngodley · 2018-11-06T10:08:40Z

Circling back, I looked at both the suggestions above.

Using source = html and selector = undefined is a possible trigger and neatly removes the need for a raw source type.

However, it does mean the same effect is applied to the core/freeform block though. I'm not sure if this is a desired effect?

Putting the change in hpq requires keeping a copy of the original HTML and changing the parse flow a bit to return the original. Gutenberg uses a custom html matcher, so a change is needed there too. core/freeform is still affected as described above.

I'm not sure if hpq is the best place though. Although not explicitly stated, in all situations hpq returns a browser normalised version of the source HTML and the selector, making this change an exception to the standard behaviour, and one that only applies if there is no selector.

That is, if I could expect:

hpq.parse( '<div />', { html: hpq.html() } )

To return <div />

Then I could equally expect:

hpq.parse( '<div class="test"><div /></div>', { html: hpq.html( '.test' ) } )

To return <div />, but it wouldn't because a selector is passed.

Changing hpq to return the expected result in all situations seems like a much bigger change.

aduth · 2018-11-12T16:02:59Z

However, it does mean the same effect is applied to the core/freeform block though. I'm not sure if this is a desired effect?

I think it would be fine; preferred even. core/freeform should be left as-is. I'd not be surprised if there were another issue already for reporting unintentional effects of the hpq normalization.

I don't love that the behavior becomes a bit less predictable in having the condition atop delegating to hpq, but I'd prefer it to a new selector in the interim of deciding whether to handle this internal to hpq.

Although not explicitly stated, in all situations hpq returns a browser normalised version of the source HTML and the selector, making this change an exception to the standard behaviour

I wouldn't say this is explicitly intentional, more an artifact of how it's internally implemented.

johngodley · 2018-11-26T11:03:19Z

Based on the feedback I've updated this PR to just check for source = html and an empty selector, removing the raw block type.

As mentioned this does affect core/freeform and I've updated existing tests to reflect this.

If this PR still makes sense to include I can also create a separate issue (here or in hpq) summarising the discussion above about implementing it in hpq.

youknowriad · 2018-11-26T11:06:46Z

packages/blocks/src/api/parser.js

@@ -226,6 +226,10 @@ export function matcherFromSource( sourceConfig ) {
 * @return {*} Attribute value.
 */
 export function parseWithAttributeSchema( innerHTML, attributeSchema ) {
+	if ( attributeSchema.source === 'html' && ! attributeSchema.selector ) {


Minor: this seems better in the matcherFromSource function because there's already a switch case there.

matcherFromSource always gets fed into hpq and will then get modified. The idea with the change is to bypass hpq entirely, for which parseWithAttributeSchema seems a good place:

export function parseWithAttributeSchema( innerHTML, attributeSchema ) { if ( attributeSchema.source === 'html' && ! attributeSchema.selector ) { return innerHTML; } return hpqParse( innerHTML, matcherFromSource( attributeSchema ) ); }

test/integration/full-content/fixtures/core__freeform.json

aduth · 2018-11-26T17:29:30Z

I'm still conflicted on this one, and am yet to reach some happy conclusion in my mind. The main hang-up for me would be the introduced unpredictability of the 'html' selector in considering whether there's a selector to evaluate, when in truth we'd just want the parse to not normalize the whitespace at all. In some additional exploration there doesn't seem to be much of a way around this in DOM APIs (I'd explored ideas with assigning outerHTML or using different DOMParser MIME types). We could probably retain the original information if we wrote our own parser (maybe with the same simple-html-tokenizer used in block validation), but this seems like it'd be both hefty and a huge burden to maintain.

In order to keep this moving along, I think I'll be content with what's proposed here now. We should probably consider both of the prior feedback comments before approval though.

This allows the HTML to be retained in the original format without processing and reformatting

johngodley · 2019-01-21T15:26:35Z

A bit delayed, but I've rebased and #12610 removes the need for changes to the integration tests.

Agreed that there doesn't appear a way around this using DOM apis, and building a parser for this doesn't seem like fun.

youknowriad · 2020-03-18T10:41:48Z

@aduth What are your current thoughts here? Trying to see what's the best path forward for this PR? close/merge?

aduth · 2020-05-27T14:26:23Z

@aduth What are your current thoughts here? Trying to see what's the best path forward for this PR? close/merge?

I think it comes down to the question of whether we're happy to take a fix for one case, at the expense of introducing some inconsistency where it remains an issue for anything not falling into that condition (i.e. html can still alter if selector is specified).

Personally I tend to rather avoid those in favor of a universally consistent fix, depending on the severity of the issue. Given how long this problem has existed, I don't know that it's an especially urgent one.

With that in mind, I'd be more inclined to close, at least as it stands today. There's no great alternative, though it doesn't seem like an impossible problem to solve (mentioned in #10551 (comment)).

(Given how I've contradicted my previous assessment in #10551 (comment), it should be clear that I'm quite on the fence with this one!)

youknowriad · 2020-05-28T09:53:36Z

Sounds reasonable, and yeah no easy decision here. Let's close for the moment.

johngodley added the [Feature] Blocks Overall functionality of blocks label Oct 12, 2018

youknowriad reviewed Oct 12, 2018

View reviewed changes

packages/block-library/src/html/index.js Outdated Show resolved Hide resolved

This was referenced Oct 12, 2018

Add self-closing tag support when comparing HTML #10474

Merged

Allow block HTML validation to be disabled #10066

Closed

This was referenced Oct 17, 2018

Block validation error for 'viewbox' attribute #9898

Closed

Don't show 'convert to HTML' on a HTML block #10889

Closed

johngodley mentioned this pull request Oct 23, 2018

Resolve block Convert to HTML doesn't create the After Conversion HTML #10848

Closed

johngodley force-pushed the update/raw-html-parsing-type branch from 7d46c89 to 7502650 Compare November 26, 2018 10:36

youknowriad reviewed Nov 26, 2018

View reviewed changes

aduth reviewed Nov 26, 2018

View reviewed changes

test/integration/full-content/fixtures/core__freeform.json Outdated Show resolved Hide resolved

Exclude blocks of type 'html' from processed by hpq

ba9fd5c

This allows the HTML to be retained in the original format without processing and reformatting

johngodley force-pushed the update/raw-html-parsing-type branch from 7502650 to ba9fd5c Compare January 21, 2019 14:08

johngodley changed the title ~~Add a ‘raw’ block attribute type~~ HTML parser: don't alter raw HTML Jan 21, 2019

gziolo added this to the 5.1 (Gutenberg) milestone Feb 1, 2019

noisysocks modified the milestones: 5.1 (Gutenberg), 5.2 (Gutenberg) Feb 18, 2019

gziolo modified the milestones: 5.2 (Gutenberg), 5.3 (Gutenberg) Mar 4, 2019

youknowriad removed this from the 5.3 (Gutenberg) milestone Mar 18, 2019

youknowriad closed this May 28, 2020

youknowriad deleted the update/raw-html-parsing-type branch May 28, 2020 09:53

youknowriad mentioned this pull request Nov 25, 2020

HTML block: fix parsing #27268

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTML parser: don't alter raw HTML #10551

HTML parser: don't alter raw HTML #10551

johngodley commented Oct 12, 2018 •

edited

Loading

aduth commented Oct 12, 2018

bobbingwide commented Oct 23, 2018

johngodley commented Nov 6, 2018

aduth commented Nov 12, 2018

johngodley commented Nov 26, 2018

youknowriad Nov 26, 2018

johngodley Dec 5, 2018

aduth commented Nov 26, 2018

johngodley commented Jan 21, 2019

youknowriad commented Mar 18, 2020

aduth commented May 27, 2020

youknowriad commented May 28, 2020

HTML parser: don't alter raw HTML #10551

HTML parser: don't alter raw HTML #10551

Conversation

johngodley commented Oct 12, 2018 • edited Loading

How has this been tested?

Checklist:

aduth commented Oct 12, 2018

bobbingwide commented Oct 23, 2018

johngodley commented Nov 6, 2018

aduth commented Nov 12, 2018

johngodley commented Nov 26, 2018

youknowriad Nov 26, 2018

Choose a reason for hiding this comment

johngodley Dec 5, 2018

Choose a reason for hiding this comment

aduth commented Nov 26, 2018

johngodley commented Jan 21, 2019

youknowriad commented Mar 18, 2020

aduth commented May 27, 2020

youknowriad commented May 28, 2020

johngodley commented Oct 12, 2018 •

edited

Loading