Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] CheerioAPI#html() removes the <html>, <head>, and <body> elements when isDocument is set to false #4188

Open
TheCSDev opened this issue Oct 31, 2024 · 0 comments

Comments

@TheCSDev
Copy link

TheCSDev commented Oct 31, 2024

What happened?

Erreta

As described in the issue title, the (what I believe to be, a) bug lies in the fact that CheerioAPI's .html() function outputs an HTML string that excludes the <html>, <head>, and <body> tags, when loading via isDocument: false. The contents of those HTML tags do remain in-tact, but their parent tags simply get removed or excluded.

My general assumption regarding isDocument: false was that Cheerio would simply give you more free control over the document, without enforcing any HTML-like rules, such as how head and body and their children should be defined. I never expected Cheerio to outright remove elements relating to HTML documents when isDocument: false.

Lack of duplicate issues

(Note that GitHub's search function doesn't just search "word for word", as it also looks for issues containing your key words.)

Reproducing the issue

Below is a simple .js script I have written to demonstrate how this issue is easily reproduced:

// Required modules
const mCheerio = require("cheerio");

// A proper HTML document, including all important tags, including the DOCTYPE declaration
const oHtml = `
<!DOCTYPE html>
<html>
	<head>
		<title>Hello World!</title>
	</head>
	<body>This is a 'hello world' document!</body>
</html>
`;

console.log("==================================================");
console.log("                   INPUT HTML")
console.log("==================================================");
console.log(oHtml);


console.log("==================================================");
console.log("      cheerio(oHtml, undefined, undefined)")
console.log("==================================================");
console.log(mCheerio.load(oHtml).html()); //isDocument is true by default here


console.log("==================================================");
console.log("       cheerio(oHtml, undefined, false)")
console.log("==================================================");
console.log(mCheerio.load(oHtml, undefined, false).html()); //isDocument is set to false intentionally

console.log("==================================================");

Once executed, we get the following output:

R:\cheerio-test>test.js
==================================================
                   INPUT HTML
==================================================

<!DOCTYPE html>
<html>
        <head>
                <title>Hello World!</title>
        </head>
        <body>This is a 'hello world' document!</body>
</html>

==================================================
      cheerio(oHtml, undefined, undefined)
==================================================
<!DOCTYPE html><html><head>
                <title>Hello World!</title>
        </head>
        <body>This is a 'hello world' document!

</body></html>
==================================================
       cheerio(oHtml, undefined, false)
==================================================




                <title>Hello World!</title>

        This is a 'hello world' document!


==================================================

R:\cheerio-test>

In the 2nd result, where isDocument is set to false explicitly, the expectation is that the <html>, <head>, and <body> elements would remain in-tact.

Why don't you simply use isDocument: true, and avoid this issue?

As mentioned earlier, I would like to have much more granular control over the contents of the HTML before outputting it via .html().

In my case, I use Cheerio to dynamically render web-pages, using custom "server-side-only" elements at various points in my HTML documents, that then get parsed, handled, and rendered. Often-times, those elements need to be placed at various points in the HTML document that are outside of the body or sometimes even inside the head. Using isDocument: true would end up messing with this, as it would be taking all "inappropriately-placed" elements and moving them into body, which is undesirable in my case.

Consider the following HTML document that I just made up, with made-up custom elements:

<!DOCTYPE html>
<html lang="en">
	<head>
		<x-generate-title someMetaData="blah blah blah"></x-generate-title>
		<x-generate-meta-tag context="you get the point..."></x-generate-meta-tag>
	</head>
	<body>
		<x-insert-body from="somewhere specific"></x-insert-body>
	</body>
</html>

<x-execute-after whatToExecute="thing"></x-execute-after>

In this example, I can then do stuff like

let $ = mCheerio.load(html, undefined, false);
$("x-generate-title").replaceWith(genTitle($("x-generate-title").attr("someMetaData")));
return $.html()

This would have been impossible to do properly with isDocumet: true, as Cheerio would then take all those custom elements and move them into body, thus messing up the order and placement of dynamically generated elements.

And lastly, what if I simply wanted to work with a snippet of a full HTML document, like for example <head>...</head>, without the entire HTML schema, and I output the results via .html(), only to find that the head is missing?

Edit: All edits are typo corrections.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant