Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xml support? #124

Closed
paulcarroty opened this issue Jun 1, 2021 · 7 comments
Closed

xml support? #124

paulcarroty opened this issue Jun 1, 2021 · 7 comments

Comments

@paulcarroty
Copy link

Worth mention in README.
Tested on Atom feed - working fine.

@milahu
Copy link

milahu commented Jun 4, 2021

fails to parse query docx, which uses namespaced tags like <w:t>hello</w:t>

var fs = require("fs");
var JSZip = require("jszip");
const { parse } = require('node-html-parser');

const docxPath = process.argv[2];

async function main() {

const data = fs.readFileSync(docxPath);
const zip = await JSZip.loadAsync(data);
const xml = await zip.files["word/document.xml"].async("text");
const doc = parse(xml);

//console.dir(doc.querySelectorAll('w:t')); // Error: unmatched pseudo-class :t

console.dir(doc.querySelectorAll('w\\:t')); // == [] (empty result)

} // async function main

main();

alternatives: xml2js, ...

@taoqf
Copy link
Owner

taoqf commented Jun 7, 2021

I believe the exception thrown out is because we cannot select a node which tagname contains :, not because we can not parse it.

@milahu
Copy link

milahu commented Jun 7, 2021

yes, sorry ... its a parser bug in fb55/css-what#512

@fb55
Copy link

fb55 commented Jun 7, 2021

Not a parser bug, but CSS requires the colon to be escaped here.

@milahu
Copy link

milahu commented Jun 7, 2021

CSS requires the colon to be escaped here

aah, thanks!

fixed my sample code, now css-what gives

{ rules: [ { type: 'tag', name: 'w:t', namespace: null } ] }

and querySelectorAll returns an empty array ...

new problem seems to be in node-html-parser:
only a few xml tags are parsed, and the rest is parsed as a TextNode
including </w:body></w:document></documentfragmentcontainer>

sample input docx, generated by libreoffice writer

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

<w:document
 xmlns:o="urn:schemas-microsoft-com:office:office"
 xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships"
 xmlns:v="urn:schemas-microsoft-com:vml"
 xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
 xmlns:w10="urn:schemas-microsoft-com:office:word"
 xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing"
 xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape"
 xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup"
 xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006"
 xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing"
 xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml"
 mc:Ignorable="w14 wp14"
>

<w:body>
<w:p>
<w:pPr>

<w:pStyle w:val="Normal"/>

<w:bidi w:val="0"/>
<w:jc w:val="left"/>
<w:rPr>
<w:rFonts w:ascii="Liberation Sans" w:hAnsi="Liberation Sans"/>
<w:sz w:val="20"/>
<w:szCs w:val="20"/>
</w:rPr>
</w:pPr>
<w:r>
<w:rPr>
<w:rFonts w:ascii="Liberation Sans" w:hAnsi="Liberation Sans"/>
<w:sz w:val="20"/>
<w:szCs w:val="20"/>
</w:rPr>
<w:t>hello</w:t>
<w:tab/>
<w:tab/>

the textnode starts at <w:bidi w:val="0"/>

<w:body><w:p><w:pPr> are missing

@bamadesigner
Copy link

It also won't parse <link> elements in XML. It's not returning the value. I'm digging through the code now but I'm guessing the code is written to assume <link>s would never have innerText because in usual HTML they do not. I'm going to have to find another parser because this is a requirement for my project. But would love to use your faster library if/when it supports XML. Thanks for the great work!

@nonara
Copy link
Collaborator

nonara commented Sep 29, 2021

It also won't parse elements in XML. It's not returning the value.

I had a look into this. In HTML5 spec, link is a void element (meaning it is self closing). Because we don't have a mode for XML spec, this unfortunately can't be addressed, as it's beyond the scope of the library.

only a few xml tags are parsed, and the rest is parsed as a TextNode

@milahu I actually think you've run into the same issue as this: #156

I believe it matched w:pStyle as style and treated it as a block-text element. I'll try to get a fix out for this quickly.

A temporary workaround is to use the following config:

{ blockTextElements: { script: true, noscript: true } }

I'm going to go ahead and close this issue for housekeeping, but you can track the applicable bug here:

@nonara nonara closed this as completed Sep 29, 2021
nonara added a commit to nonara/node-html-parser that referenced this issue Oct 3, 2021
taoqf#124)

Tags 'premises' is matched as 'pre', 'pstyle' as 'style', etc.
nonara added a commit to nonara/node-html-parser that referenced this issue Oct 3, 2021
@nonara nonara mentioned this issue Oct 3, 2021
nonara added a commit to nonara/node-html-parser that referenced this issue Oct 3, 2021
taoqf#156 fixes taoqf#124)

Tags 'premises' is matched as 'pre', 'pstyle' as 'style', etc.
nonara added a commit that referenced this issue Oct 10, 2021
#156 fixes #124)

Tags 'premises' is matched as 'pre', 'pstyle' as 'style', etc.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants