[Bug] `CheerioAPI#html()` removes the `<html>`, `<head>`, and `<body>` elements when `isDocument` is set to `false` #4188

TheCSDev · 2024-10-31T07:26:09Z

What happened?

Erreta

As described in the issue title, the (what I believe to be, a) bug lies in the fact that CheerioAPI's .html() function outputs an HTML string that excludes the <html>, <head>, and <body> tags, when loading via isDocument: false. The contents of those HTML tags do remain in-tact, but their parent tags simply get removed or excluded.

My general assumption regarding isDocument: false was that Cheerio would simply give you more free control over the document, without enforcing any HTML-like rules, such as how head and body and their children should be defined. I never expected Cheerio to outright remove elements relating to HTML documents when isDocument: false.

Lack of duplicate issues

(Note that GitHub's search function doesn't just search "word for word", as it also looks for issues containing your key words.)

Reproducing the issue

Below is a simple .js script I have written to demonstrate how this issue is easily reproduced:

// Required modules
const mCheerio = require("cheerio");

// A proper HTML document, including all important tags, including the DOCTYPE declaration
const oHtml = `
<!DOCTYPE html>
<html>
	<head>
		<title>Hello World!</title>
	</head>
	<body>This is a 'hello world' document!</body>
</html>
`;

console.log("==================================================");
console.log("                   INPUT HTML")
console.log("==================================================");
console.log(oHtml);


console.log("==================================================");
console.log("      cheerio(oHtml, undefined, undefined)")
console.log("==================================================");
console.log(mCheerio.load(oHtml).html()); //isDocument is true by default here


console.log("==================================================");
console.log("       cheerio(oHtml, undefined, false)")
console.log("==================================================");
console.log(mCheerio.load(oHtml, undefined, false).html()); //isDocument is set to false intentionally

console.log("==================================================");

Once executed, we get the following output:

R:\cheerio-test>test.js
==================================================
                   INPUT HTML
==================================================

<!DOCTYPE html>
<html>
        <head>
                <title>Hello World!</title>
        </head>
        <body>This is a 'hello world' document!</body>
</html>

==================================================
      cheerio(oHtml, undefined, undefined)
==================================================
<!DOCTYPE html><html><head>
                <title>Hello World!</title>
        </head>
        <body>This is a 'hello world' document!

</body></html>
==================================================
       cheerio(oHtml, undefined, false)
==================================================




                <title>Hello World!</title>

        This is a 'hello world' document!


==================================================

R:\cheerio-test>

In the 2nd result, where isDocument is set to false explicitly, the expectation is that the <html>, <head>, and <body> elements would remain in-tact.

Why don't you simply use `isDocument: true`, and avoid this issue?

As mentioned earlier, I would like to have much more granular control over the contents of the HTML before outputting it via .html().

In my case, I use Cheerio to dynamically render web-pages, using custom "server-side-only" elements at various points in my HTML documents, that then get parsed, handled, and rendered. Often-times, those elements need to be placed at various points in the HTML document that are outside of the body or sometimes even inside the head. Using isDocument: true would end up messing with this, as it would be taking all "inappropriately-placed" elements and moving them into body, which is undesirable in my case.

Consider the following HTML document that I just made up, with made-up custom elements:

<!DOCTYPE html>
<html lang="en">
	<head>
		<x-generate-title someMetaData="blah blah blah"></x-generate-title>
		<x-generate-meta-tag context="you get the point..."></x-generate-meta-tag>
	</head>
	<body>
		<x-insert-body from="somewhere specific"></x-insert-body>
	</body>
</html>

<x-execute-after whatToExecute="thing"></x-execute-after>

In this example, I can then do stuff like

let $ = mCheerio.load(html, undefined, false);
$("x-generate-title").replaceWith(genTitle($("x-generate-title").attr("someMetaData")));
return $.html()

This would have been impossible to do properly with isDocumet: true, as Cheerio would then take all those custom elements and move them into body, thus messing up the order and placement of dynamically generated elements.

And lastly, what if I simply wanted to work with a snippet of a full HTML document, like for example <head>...</head>, without the entire HTML schema, and I output the results via .html(), only to find that the head is missing?

Edit: All edits are typo corrections.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] `CheerioAPI#html()` removes the `<html>`, `<head>`, and `<body>` elements when `isDocument` is set to `false` #4188

[Bug] `CheerioAPI#html()` removes the `<html>`, `<head>`, and `<body>` elements when `isDocument` is set to `false` #4188

TheCSDev commented Oct 31, 2024 •

edited

Loading

[Bug] CheerioAPI#html() removes the <html>, <head>, and <body> elements when isDocument is set to false #4188

[Bug] CheerioAPI#html() removes the <html>, <head>, and <body> elements when isDocument is set to false #4188

Comments

TheCSDev commented Oct 31, 2024 • edited Loading

What happened?

Erreta

Lack of duplicate issues

Reproducing the issue

Why don't you simply use isDocument: true, and avoid this issue?

[Bug] `CheerioAPI#html()` removes the `<html>`, `<head>`, and `<body>` elements when `isDocument` is set to `false` #4188

[Bug] `CheerioAPI#html()` removes the `<html>`, `<head>`, and `<body>` elements when `isDocument` is set to `false` #4188

TheCSDev commented Oct 31, 2024 •

edited

Loading

Why don't you simply use `isDocument: true`, and avoid this issue?