Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML parser: don't alter raw HTML #10551

Closed
wants to merge 1 commit into from

Conversation

johngodley
Copy link
Contributor

@johngodley johngodley commented Oct 12, 2018

Currently if you use an HTML block it will be parsed and reformatted when saved. This has the effect of transforming valid HTML such as:

<path d="M0,0h24v24H0V0z M0,0h24v24H0V0z" fill="none" />

Into:

<path d="M0,0h24v24H0V0z M0,0h24v24H0V0z" fill="none"></path>

Although technically correct it does mean the HTML changes on save, which is unexpected.

The reason for this is the HTML goes into hpq, which uses the browser and createHTMLDocument. This converts the user HTML into a valid DOM tree, modifying as appropriate, expanding self-closing tags, and removing unnecessary whitespace.

When the HTML is then returned it is the browser's version, and the original version is discarded. For most blocks this is fine as the HTML is generated by Gutenberg, but for the HTML block the HTML is generated by the user.

This is another follow-on from #10066, and in conjunction with #10474.

How has this been tested?

Additional unit test added. Manually test by creating a custom HTML block and adding:

<path d="M0,0h24v24H0V0z M0,0h24v24H0V0z" fill="none" />

Verify that on saving the post and reloading the page the block HTML remains the same.

Checklist:

  • My code is tested.
  • My code follows the WordPress code style.
  • My code follows the accessibility standards.
  • My code has proper inline documentation.

@johngodley johngodley added the [Feature] Blocks Overall functionality of blocks label Oct 12, 2018
@aduth
Copy link
Member

aduth commented Oct 12, 2018

Aside from proposed workaround, this feels like it could be classified as an upstream issue of hpq, where the following might not be expected:

⇒ node                     
> global.document = new ( require( 'jsdom' ).JSDOM )().window.document
Document { location: [Getter/Setter] }
> var hpq = require( 'hpq' );
undefined
> hpq.parse( '<div />', { html: hpq.html() } )
{ html: '<div></div>' }

@bobbingwide
Copy link
Contributor

This solution looks like it goes someway towards achieving my requirements in #5123

@johngodley
Copy link
Contributor Author

Circling back, I looked at both the suggestions above.

Using source = html and selector = undefined is a possible trigger and neatly removes the need for a raw source type.

However, it does mean the same effect is applied to the core/freeform block though. I'm not sure if this is a desired effect?

Putting the change in hpq requires keeping a copy of the original HTML and changing the parse flow a bit to return the original. Gutenberg uses a custom html matcher, so a change is needed there too. core/freeform is still affected as described above.

I'm not sure if hpq is the best place though. Although not explicitly stated, in all situations hpq returns a browser normalised version of the source HTML and the selector, making this change an exception to the standard behaviour, and one that only applies if there is no selector.

That is, if I could expect:

hpq.parse( '<div />', { html: hpq.html() } )

To return <div />

Then I could equally expect:

hpq.parse( '<div class="test"><div /></div>', { html: hpq.html( '.test' ) } )

To return <div />, but it wouldn't because a selector is passed.

Changing hpq to return the expected result in all situations seems like a much bigger change.

@aduth
Copy link
Member

aduth commented Nov 12, 2018

However, it does mean the same effect is applied to the core/freeform block though. I'm not sure if this is a desired effect?

I think it would be fine; preferred even. core/freeform should be left as-is. I'd not be surprised if there were another issue already for reporting unintentional effects of the hpq normalization.

I don't love that the behavior becomes a bit less predictable in having the condition atop delegating to hpq, but I'd prefer it to a new selector in the interim of deciding whether to handle this internal to hpq.

Although not explicitly stated, in all situations hpq returns a browser normalised version of the source HTML and the selector, making this change an exception to the standard behaviour

I wouldn't say this is explicitly intentional, more an artifact of how it's internally implemented.

@johngodley johngodley force-pushed the update/raw-html-parsing-type branch from 7d46c89 to 7502650 Compare November 26, 2018 10:36
@johngodley
Copy link
Contributor Author

Based on the feedback I've updated this PR to just check for source = html and an empty selector, removing the raw block type.

As mentioned this does affect core/freeform and I've updated existing tests to reflect this.

If this PR still makes sense to include I can also create a separate issue (here or in hpq) summarising the discussion above about implementing it in hpq.

@@ -226,6 +226,10 @@ export function matcherFromSource( sourceConfig ) {
* @return {*} Attribute value.
*/
export function parseWithAttributeSchema( innerHTML, attributeSchema ) {
if ( attributeSchema.source === 'html' && ! attributeSchema.selector ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: this seems better in the matcherFromSource function because there's already a switch case there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

matcherFromSource always gets fed into hpq and will then get modified. The idea with the change is to bypass hpq entirely, for which parseWithAttributeSchema seems a good place:

export function parseWithAttributeSchema( innerHTML, attributeSchema ) {
	if ( attributeSchema.source === 'html' && ! attributeSchema.selector ) {
		return innerHTML;
	}

	return hpqParse( innerHTML, matcherFromSource( attributeSchema ) );
}

@aduth
Copy link
Member

aduth commented Nov 26, 2018

I'm still conflicted on this one, and am yet to reach some happy conclusion in my mind. The main hang-up for me would be the introduced unpredictability of the 'html' selector in considering whether there's a selector to evaluate, when in truth we'd just want the parse to not normalize the whitespace at all. In some additional exploration there doesn't seem to be much of a way around this in DOM APIs (I'd explored ideas with assigning outerHTML or using different DOMParser MIME types). We could probably retain the original information if we wrote our own parser (maybe with the same simple-html-tokenizer used in block validation), but this seems like it'd be both hefty and a huge burden to maintain.

In order to keep this moving along, I think I'll be content with what's proposed here now. We should probably consider both of the prior feedback comments before approval though.

This allows the HTML to be retained in the original format without processing and reformatting
@johngodley johngodley force-pushed the update/raw-html-parsing-type branch from 7502650 to ba9fd5c Compare January 21, 2019 14:08
@johngodley johngodley changed the title Add a ‘raw’ block attribute type HTML parser: don't alter raw HTML Jan 21, 2019
@johngodley
Copy link
Contributor Author

A bit delayed, but I've rebased and #12610 removes the need for changes to the integration tests.

Agreed that there doesn't appear a way around this using DOM apis, and building a parser for this doesn't seem like fun.

@gziolo gziolo added this to the 5.1 (Gutenberg) milestone Feb 1, 2019
@youknowriad youknowriad removed this from the 5.3 (Gutenberg) milestone Mar 18, 2019
@youknowriad
Copy link
Contributor

@aduth What are your current thoughts here? Trying to see what's the best path forward for this PR? close/merge?

@aduth
Copy link
Member

aduth commented May 27, 2020

@aduth What are your current thoughts here? Trying to see what's the best path forward for this PR? close/merge?

I think it comes down to the question of whether we're happy to take a fix for one case, at the expense of introducing some inconsistency where it remains an issue for anything not falling into that condition (i.e. html can still alter if selector is specified).

Personally I tend to rather avoid those in favor of a universally consistent fix, depending on the severity of the issue. Given how long this problem has existed, I don't know that it's an especially urgent one.

With that in mind, I'd be more inclined to close, at least as it stands today. There's no great alternative, though it doesn't seem like an impossible problem to solve (mentioned in #10551 (comment)).

(Given how I've contradicted my previous assessment in #10551 (comment), it should be clear that I'm quite on the fence with this one!)

@youknowriad
Copy link
Contributor

Sounds reasonable, and yeah no easy decision here. Let's close for the moment.

@youknowriad youknowriad deleted the update/raw-html-parsing-type branch May 28, 2020 09:53
@youknowriad youknowriad mentioned this pull request Nov 25, 2020
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[Feature] Blocks Overall functionality of blocks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants