Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix document.body not including content from DOMParser parsed <body> #108

Conversation

MadLittleMods
Copy link
Contributor

Fix document.body not including content from DOMParser parsed <body>

Fix #106

Before:

new DOMParser().parseFromString(`<html><body><div>asdf</div></body></html>`, "text/html").body.outerHTML
// '<body></body>'

After:

new DOMParser().parseFromString(`<html><body><div>asdf</div></body></html>`, "text/html").body.outerHTML
// '<body><div>asdf</div></body>'

@@ -48,6 +48,11 @@ assert(document.all[1], document.querySelector('head'));
assert(document.all[2], document.querySelector('title'));
assert(document.all[3], document.querySelector('body'));

document = (new DOMParser).parseFromString('<!DOCTYPE html><html><body><div>foo</div></body></html>', 'text/html');
assert(document.body.tagName, 'BODY');
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making sure document.body isn't null

@@ -48,6 +48,11 @@ assert(document.all[1], document.querySelector('head'));
assert(document.all[2], document.querySelector('title'));
assert(document.all[3], document.querySelector('body'));

document = (new DOMParser).parseFromString('<!DOCTYPE html><html><body><div>foo</div></body></html>', 'text/html');
assert(document.body.tagName, 'BODY');
assert(document.body, document.querySelector('body'));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making sure document.body resolves to the same Node as document.querySelector('body')

document = (new DOMParser).parseFromString('<!DOCTYPE html><html><body><div>foo</div></body></html>', 'text/html');
assert(document.body.tagName, 'BODY');
assert(document.body, document.querySelector('body'));
assert(document.body.textContent, 'foo');
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making sure the document.body includes the content we parsed.

// Accessing the document.body property will create the <body>
node = document.body;
create = false;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The HTML spec says "The body element of a document is the first of the html element's children that is either a body element or a frameset element, or null if there is no such element." -- https://html.spec.whatwg.org/multipage/dom.html#dom-document-body-dev

But in terms of multiple <body> and how it behaves, the browser seems to resolve to the same Node.

new DOMParser().parseFromString('<html><body>foo</body><body>bar</body></html>', "text/html").documentElement.outerHTML
// '<html><head></head><body>foobar</body></html>'

For reference, linkedom and the browser both behave the same here ✅

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still it's no goal of this project to care about these shenanigans ... but good to know, thanks

Copy link
Contributor Author

@MadLittleMods MadLittleMods Feb 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. In this case, it's the simplest way to support and turns out to behave identically 👌

@WebReflection
Copy link
Owner

Fix #106 - ... is the right commit message, so it relates to the issue, thanks!

@MadLittleMods MadLittleMods force-pushed the 106-fix-parsed-document-body-not-including-children branch from 068a70d to 8edc4e3 Compare February 9, 2022 20:26
@MadLittleMods
Copy link
Contributor Author

@WebReflection Updated ✅

@WebReflection
Copy link
Owner

Build failing. This MR is ignoring SVG and XML constraints, plus it's not considering that the current logic works already for document.body so it's kinda patching it in the wrong place.

The body getter is likely the place to look for a change, not the parse-from-string one, as that file has zero responsibility on how that document or its getters should react .. and bear in mind document.head is another one to eventually fix in the original issue.

…parsed <body>

Fix WebReflection#106

Before:
```js
new DOMParser().parseFromString(`<html><body><div>asdf</div></body></html>`, "text/html").body.outerHTML
// '<body></body>'
```

After:
```js
new DOMParser().parseFromString(`<html><body><div>asdf</div></body></html>`, "text/html").body.outerHTML
// '<body><div>asdf</div></body>'
```
@MadLittleMods MadLittleMods force-pushed the 106-fix-parsed-document-body-not-including-children branch from 8edc4e3 to 176bec3 Compare February 9, 2022 21:13
@@ -59,7 +59,12 @@ const parseFromString = (document, isHTML, markupLanguage) => {
onopentag(name, attributes) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Build failing.

Build fixed.

@@ -59,7 +59,12 @@ const parseFromString = (document, isHTML, markupLanguage) => {
onopentag(name, attributes) {
Copy link
Contributor Author

@MadLittleMods MadLittleMods Feb 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This MR is ignoring SVG and XML constraints

I'm not sure what you mean (can you explain this more obviously)? The logic was under the isHTML flag and doesn't apply to SVG and XML as far as I know.

@@ -16692,7 +16692,7 @@ class HTMLDocument extends Document$1 {
get head() {
const {documentElement} = this;
let {firstElementChild} = documentElement;
if (!firstElementChild) {
if (!firstElementChild || firstElementChild.tagName !== 'HEAD') {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[...] plus it's not considering that the current logic works already for document.body so it's kinda patching it in the wrong place.

The body getter is likely the place to look for a change, not the parse-from-string one, as that file has zero responsibility on how that document or its getters should react .. and bear in mind document.head is another one to eventually fix in the original issue.

Is this more what you mean?


Previously, the code would always assume the that first child was a <head> element but in the case of <!DOCTYPE html><html><body><div>foo</div></body></html>, it's a <body>. This means that when the body getter calls the head getter, it returned the existing <body> and created another <body> for itself after the supposed <head>.

The new code makes sure that the first child is a <head> element otherwise creates one.

This current assumptions about node order here has the downside that if someone defines stuff out of order, it will have 2 <head> when you try to access document.head

// notice <head> after <body>
const document = new DOMParser().parseFromString(`<!DOCTYPE html><html><body></body><head><title>"hello"</title></head></html>`, "text/html").head.outerHTML
// '<head></head>'

We could fix this up better in parse-from-string if we enforce <head> as the first child and <body> after when documents are initially parsed and created. But I'd go back to the document.body creation pattern like I had previously to keep things simple.

@WebReflection
Copy link
Owner

Fixed with latest.

@MadLittleMods
Copy link
Contributor Author

Thanks for incorporating the changes @WebReflection 😀

Looks pretty similar: 535c5bc

MadLittleMods added a commit to element-hq/hydrogen-web that referenced this pull request Feb 12, 2022
MadLittleMods added a commit to element-hq/hydrogen-web that referenced this pull request Feb 12, 2022
…er and linkedom (for SSR)

Before:
```js
// Browser (Chrome, Firefox)
new DOMParser().parseFromString(`<div>foo</div>`, "text/html").body.outerHTML;
// '<body><div>foo</div></body>'

// `linkedom` ❌
new DOMParser().parseFromString(`<div>foo</div>`, "text/html").body.outerHTML;
// '<body></body>'
```

After (consistent matching output):

```js
// Browser (Chrome, Firefox)
new DOMParser().parseFromString(`<!DOCTYPE html><html><body><div>foo</div></body></html>`, "text/html").body.outerHTML;
// '<body><div>foo</div></body>'

// `linkedom`
new DOMParser().parseFromString(`<!DOCTYPE html><html><body><div>foo</div></body></html>`, "text/html").body.outerHTML;
// '<body><div>foo</div></body>'
```

---

`linkedom` goal is to be close to the current DOM standard, but [not too close](https://github.com/WebReflection/linkedom#faq). Focused on the streamlined cases for server-side rendering (SSR).

Here is some context around getting `DOMParser` to interpret things better. The conclusion was to only support the explicit standard cases with a `<html><body></body></html>` specified instead of adding the magic HTML document creation and massaging that the browser does.

 - WebReflection/linkedom#106
 - WebReflection/linkedom#108
MadLittleMods added a commit to element-hq/hydrogen-web that referenced this pull request Feb 12, 2022
…dom`](https://github.com/WebReflection/linkedom) (SSR).

 - [`linkedom`](https://github.com/WebReflection/linkedom) is being used https://github.com/matrix-org/matrix-public-archive to server-side render (SSR) Hydrogen (`hydrogen-view-sdk`)
 - This is being fixed by using a explicit HTML wrapper boilerplate with `DOMParser` to get a matching result in the browser and `linkedom`.

Currently `parseHTML` is only used for HTML content bodies in events. Events with replies have content bodies that look like `<mx-reply>Hello</mx-reply> What's up` so they're parsed as HTML to strip out the `<mx-reply>` part.

Before | After
---  |  ---
![](https://user-images.githubusercontent.com/558581/153692011-2f0e7114-fcb4-481f-b217-49f461b1740a.png) | ![](https://user-images.githubusercontent.com/558581/153692016-52582fdb-abd9-439d-9dce-3f04da6959db.png)

Before:
```js
// Browser (Chrome, Firefox)
new DOMParser().parseFromString(`<div>foo</div>`, "text/html").body.outerHTML;
// '<body><div>foo</div></body>'

// `linkedom` ❌
new DOMParser().parseFromString(`<div>foo</div>`, "text/html").body.outerHTML;
// '<body></body>'
```

After (consistent matching output):

```js
// Browser (Chrome, Firefox)
new DOMParser().parseFromString(`<!DOCTYPE html><html><body><div>foo</div></body></html>`, "text/html").body.outerHTML;
// '<body><div>foo</div></body>'

// `linkedom`
new DOMParser().parseFromString(`<!DOCTYPE html><html><body><div>foo</div></body></html>`, "text/html").body.outerHTML;
// '<body><div>foo</div></body>'
```

`linkedom` goal is to be close to the current DOM standard, but [not too close](https://github.com/WebReflection/linkedom#faq). Focused on the streamlined cases for server-side rendering (SSR).

Here is some context around getting `DOMParser` to interpret things better. The conclusion was to only support the explicit standard cases with a `<html><body></body></html>` specified instead of adding the magic HTML document creation and massaging that the browser does.

 - WebReflection/linkedom#106
 - WebReflection/linkedom#108

 ---

Part of #653 to support server-side rendering Hydrogen for the [`matrix-public-archive`](https://github.com/matrix-org/matrix-public-archive) project.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DOMParser.parseFromString(...) does not insert HTML into <html><body> document boilerplate
2 participants