-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PRE block child nodes not parsed & DOCTYPE not recognized #78
Comments
|
That will work. Thanks for such a fast response!
I understand if you don't want to ignore it. That makes sense. I should have probably clarified the problem better. There are three types of nodes that this parser is creating during parsing.
It turns out that The difficulty here is that this parser is treating DOCTYPE as a In this case, if you'd like to preserve it, it would likely be best to create a proper Otherwise, anything which assumes a This should be doable without hurting performance. If you'd like help, let me know and I can look into it and make sure perf numbers aren't affected.
In terms of performance, you'd likely see little to no difference. The reason I'd argue for changing it is because an HTML parser should comply with HTML spec. In the case of But I want to clarify that I'm not trying to nit-pick. Your I decided that I'd leave the comment here because due to the difference in this and HTML spec, there is a chance that duplicate issues may be opened. So if in the future this is something you want to look into, I'll be interested in following the conversation. |
As a side note, if you'd ever like help in maintaining this, feel free to send a ping any time. I do a lot of work with parsers and compilers. Thanks again for the fast response and fix! |
Hi all! I'm using node-html-parser 2.1.0 with the blockTextElements to get the nodes from
|
@sebromero Yes. If you want get code inside pre, remove parser.parse(this.rawHTML, {
blockTextElements: {
script: true,
noscript: true,
style: true,
code: false
}
}); |
@taoqf That worked perfectly, thank you! It wasn't fully clear to me from the documentation how to use the config object. I thought those were boolean flags on whether or not to parse those elements as blockTextElements. 😅 |
Hi, from my tests on using this library
Would I be correct to assume this is the intended behavior? If there's anything I'm missing I'd appreciate any heads up! |
@dlemfh It looks like the logic is that the presence of the key indicates that it's block text, and the value (true or false) indicates whether its contents should be ignored or preserved. In other words, it seems that it's behaving as expected. If you want to not treat |
@nonara Seems that way. Thanks!! |
But it seems you must still pass an empty object as the blockTextElements, otherwise the pre elements will not be parsed. If this was intended, then it's too confusing, otherwise it's too buggy. |
Hi! First, thanks for the great work on this library! This is truly the fastest and best out there.
I needed a lightning fast html-to-markdown library, and everything out there was extremely slow. With the help of
node-html-parser
, I was able to write something that blazes through it! (I'll be writing a readme soon, but it is a full release)In the process, I discovered two issues, which I'll detail below.
PRE blocks don't parse children
In pre-formatted elements, contents are always being treated as text, so child-nodes do not get parsed.
This causes an issue for multi-line code blocks. Per spec, a multi-line code block is
<code>
wrapped in<pre>
. (see: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/code#Notes)The following treats
<code ... >
as a text node, where spec behaviour is to allow HTML child nodes withinpre
You can see how we workaround this, here: https://github.com/crosstype/node-html-markdown/blob/master/src/config.ts#L79-L89
DOCTYPE is parsed as a text node
<!DOCTYPE ...>
nodes are being parsed as text nodes. I think the desired behaviour here would be to simply ignore the node.To work around this, we're temporarily replacing them ahead of time. (see: https://github.com/crosstype/node-html-markdown/blob/master/src/main.ts#L40-L42)
Side-note
This and another of my libraries broke after a recent yarn install, due to the fact that the options we passed were no longer valid. We were specifying
{ pre: true, style: false }
Not a major issue, but if you're following semver, changing public API options is generally considered a breaking change, as it could cause projects which use it as a dependency to fail to compile.
Thanks again for the great work, and I hope you have a good day!
The text was updated successfully, but these errors were encountered: