You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As described in the issue title, the (what I believe to be, a) bug lies in the fact that CheerioAPI's .html() function outputs an HTML string that excludes the <html>, <head>, and <body> tags, when loading via isDocument: false. The contents of those HTML tags do remain in-tact, but their parent tags simply get removed or excluded.
My general assumption regarding isDocument: false was that Cheerio would simply give you more free control over the document, without enforcing any HTML-like rules, such as how head and body and their children should be defined. I never expected Cheerio to outright remove elements relating to HTML documents when isDocument: false.
(Note that GitHub's search function doesn't just search "word for word", as it also looks for issues containing your key words.)
Reproducing the issue
Below is a simple .js script I have written to demonstrate how this issue is easily reproduced:
// Required modulesconstmCheerio=require("cheerio");// A proper HTML document, including all important tags, including the DOCTYPE declarationconstoHtml=`<!DOCTYPE html><html> <head> <title>Hello World!</title> </head> <body>This is a 'hello world' document!</body></html>`;console.log("==================================================");console.log(" INPUT HTML")console.log("==================================================");console.log(oHtml);console.log("==================================================");console.log(" cheerio(oHtml, undefined, undefined)")console.log("==================================================");console.log(mCheerio.load(oHtml).html());//isDocument is true by default hereconsole.log("==================================================");console.log(" cheerio(oHtml, undefined, false)")console.log("==================================================");console.log(mCheerio.load(oHtml,undefined,false).html());//isDocument is set to false intentionallyconsole.log("==================================================");
Once executed, we get the following output:
R:\cheerio-test>test.js
==================================================
INPUT HTML
==================================================
<!DOCTYPE html>
<html>
<head>
<title>Hello World!</title>
</head>
<body>This is a 'hello world' document!</body>
</html>
==================================================
cheerio(oHtml, undefined, undefined)
==================================================
<!DOCTYPE html><html><head>
<title>Hello World!</title>
</head>
<body>This is a 'hello world' document!
</body></html>
==================================================
cheerio(oHtml, undefined, false)
==================================================
<title>Hello World!</title>
This is a 'hello world' document!
==================================================
R:\cheerio-test>
In the 2nd result, where isDocument is set to false explicitly, the expectation is that the <html>, <head>, and <body> elements would remain in-tact.
Why don't you simply use isDocument: true, and avoid this issue?
As mentioned earlier, I would like to have much more granular control over the contents of the HTML before outputting it via .html().
In my case, I use Cheerio to dynamically render web-pages, using custom "server-side-only" elements at various points in my HTML documents, that then get parsed, handled, and rendered. Often-times, those elements need to be placed at various points in the HTML document that are outside of the body or sometimes even inside the head. Using isDocument: true would end up messing with this, as it would be taking all "inappropriately-placed" elements and moving them into body, which is undesirable in my case.
Consider the following HTML document that I just made up, with made-up custom elements:
<!DOCTYPE html><htmllang="en"><head><x-generate-titlesomeMetaData="blah blah blah"></x-generate-title><x-generate-meta-tagcontext="you get the point..."></x-generate-meta-tag></head><body><x-insert-bodyfrom="somewhere specific"></x-insert-body></body></html><x-execute-afterwhatToExecute="thing"></x-execute-after>
This would have been impossible to do properly with isDocumet: true, as Cheerio would then take all those custom elements and move them into body, thus messing up the order and placement of dynamically generated elements.
And lastly, what if I simply wanted to work with a snippet of a full HTML document, like for example <head>...</head>, without the entire HTML schema, and I output the results via .html(), only to find that the head is missing?
Edit: All edits are typo corrections.
The text was updated successfully, but these errors were encountered:
What happened?
Erreta
As described in the issue title, the (what I believe to be, a) bug lies in the fact that
CheerioAPI
's.html()
function outputs an HTML string that excludes the<html>
,<head>
, and<body>
tags, when loading viaisDocument: false
. The contents of those HTML tags do remain in-tact, but their parent tags simply get removed or excluded.My general assumption regarding
isDocument: false
was that Cheerio would simply give you more free control over the document, without enforcing any HTML-like rules, such as howhead
andbody
and their children should be defined. I never expected Cheerio to outright remove elements relating to HTML documents whenisDocument: false
.Lack of duplicate issues
(Note that GitHub's search function doesn't just search "word for word", as it also looks for issues containing your key words.)
Reproducing the issue
Below is a simple
.js
script I have written to demonstrate how this issue is easily reproduced:Once executed, we get the following output:
In the 2nd result, where
isDocument
is set tofalse
explicitly, the expectation is that the<html>
,<head>
, and<body>
elements would remain in-tact.Why don't you simply use
isDocument: true
, and avoid this issue?As mentioned earlier, I would like to have much more granular control over the contents of the HTML before outputting it via
.html()
.In my case, I use Cheerio to dynamically render web-pages, using custom "server-side-only" elements at various points in my HTML documents, that then get parsed, handled, and rendered. Often-times, those elements need to be placed at various points in the HTML document that are outside of the
body
or sometimes even inside thehead
. UsingisDocument: true
would end up messing with this, as it would be taking all "inappropriately-placed" elements and moving them intobody
, which is undesirable in my case.Consider the following HTML document that I just made up, with made-up custom elements:
In this example, I can then do stuff like
This would have been impossible to do properly with
isDocumet: true
, as Cheerio would then take all those custom elements and move them into body, thus messing up the order and placement of dynamically generated elements.And lastly, what if I simply wanted to work with a snippet of a full HTML document, like for example
<head>...</head>
, without the entire HTML schema, and I output the results via.html()
, only to find that the head is missing?Edit: All edits are typo corrections.
The text was updated successfully, but these errors were encountered: