xml support? #124

paulcarroty · 2021-06-01T19:44:19Z

Worth mention in README.
Tested on Atom feed - working fine.

milahu · 2021-06-04T11:25:54Z

fails to ~~parse~~ query docx, which uses namespaced tags like <w:t>hello</w:t>

var fs = require("fs");
var JSZip = require("jszip");
const { parse } = require('node-html-parser');

const docxPath = process.argv[2];

async function main() {

const data = fs.readFileSync(docxPath);
const zip = await JSZip.loadAsync(data);
const xml = await zip.files["word/document.xml"].async("text");
const doc = parse(xml);

//console.dir(doc.querySelectorAll('w:t')); // Error: unmatched pseudo-class :t

console.dir(doc.querySelectorAll('w\\:t')); // == [] (empty result)

} // async function main

main();

alternatives: xml2js, ...

taoqf · 2021-06-07T03:26:17Z

I believe the exception thrown out is because we cannot select a node which tagname contains :, not because we can not parse it.

milahu · 2021-06-07T09:54:17Z

yes, sorry ... its a parser bug in fb55/css-what#512

fb55 · 2021-06-07T19:05:22Z

Not a parser bug, but CSS requires the colon to be escaped here.

milahu · 2021-06-07T21:33:07Z

CSS requires the colon to be escaped here

aah, thanks!

fixed my sample code, now css-what gives

{ rules: [ { type: 'tag', name: 'w:t', namespace: null } ] }

and querySelectorAll returns an empty array ...

new problem seems to be in node-html-parser:
only a few xml tags are parsed, and the rest is parsed as a TextNode
including </w:body></w:document></documentfragmentcontainer>

sample input docx, generated by libreoffice writer

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

<w:document
 xmlns:o="urn:schemas-microsoft-com:office:office"
 xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships"
 xmlns:v="urn:schemas-microsoft-com:vml"
 xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
 xmlns:w10="urn:schemas-microsoft-com:office:word"
 xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing"
 xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape"
 xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup"
 xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006"
 xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing"
 xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml"
 mc:Ignorable="w14 wp14"
>

<w:body>
<w:p>
<w:pPr>

<w:pStyle w:val="Normal"/>

<w:bidi w:val="0"/>
<w:jc w:val="left"/>
<w:rPr>
<w:rFonts w:ascii="Liberation Sans" w:hAnsi="Liberation Sans"/>
<w:sz w:val="20"/>
<w:szCs w:val="20"/>
</w:rPr>
</w:pPr>
<w:r>
<w:rPr>
<w:rFonts w:ascii="Liberation Sans" w:hAnsi="Liberation Sans"/>
<w:sz w:val="20"/>
<w:szCs w:val="20"/>
</w:rPr>
<w:t>hello</w:t>
<w:tab/>
<w:tab/>

the textnode starts at <w:bidi w:val="0"/>

<w:body><w:p><w:pPr> are missing

bamadesigner · 2021-06-21T15:32:11Z

It also won't parse <link> elements in XML. It's not returning the value. I'm digging through the code now but I'm guessing the code is written to assume <link>s would never have innerText because in usual HTML they do not. I'm going to have to find another parser because this is a requirement for my project. But would love to use your faster library if/when it supports XML. Thanks for the great work!

nonara · 2021-09-29T00:06:49Z

It also won't parse elements in XML. It's not returning the value.

I had a look into this. In HTML5 spec, link is a void element (meaning it is self closing). Because we don't have a mode for XML spec, this unfortunately can't be addressed, as it's beyond the scope of the library.

only a few xml tags are parsed, and the rest is parsed as a TextNode

@milahu I actually think you've run into the same issue as this: #156

I believe it matched w:pStyle as style and treated it as a block-text element. I'll try to get a fix out for this quickly.

A temporary workaround is to use the following config:

{ blockTextElements: { script: true, noscript: true } }

I'm going to go ahead and close this issue for housekeeping, but you can track the applicable bug here:

Something strange with premises tag and querySelector() / querySelectorAll() #156

taoqf#124) Tags 'premises' is matched as 'pre', 'pstyle' as 'style', etc.

…oqf#124)

taoqf#156 fixes taoqf#124) Tags 'premises' is matched as 'pre', 'pstyle' as 'style', etc.

#156 fixes #124) Tags 'premises' is matched as 'pre', 'pstyle' as 'style', etc.

taoqf added the enhancement label Jun 2, 2021

nonara closed this as completed Sep 29, 2021

nonara added a commit to nonara/node-html-parser that referenced this issue Oct 3, 2021

fix: blockTextElements incorrectly matching partial tag (detail) (fixes

2593be5

taoqf#124) Tags 'premises' is matched as 'pre', 'pstyle' as 'style', etc.

nonara added a commit to nonara/node-html-parser that referenced this issue Oct 3, 2021

fix: Add null to return type for HTMLElement#querySelector (closes ta…

eb50555

…oqf#124)

nonara mentioned this issue Oct 3, 2021

Multiple fixes #161

Merged

nonara added a commit to nonara/node-html-parser that referenced this issue Oct 3, 2021

fix: blockTextElements incorrectly matching partial tag (detail) (fixes

5a2dd0c

taoqf#156 fixes taoqf#124) Tags 'premises' is matched as 'pre', 'pstyle' as 'style', etc.

nonara added a commit that referenced this issue Oct 10, 2021

fix: blockTextElements incorrectly matching partial tag (detail) (fixes

6823349

#156 fixes #124) Tags 'premises' is matched as 'pre', 'pstyle' as 'style', etc.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xml support? #124

xml support? #124

paulcarroty commented Jun 1, 2021

milahu commented Jun 4, 2021 •

edited

Loading

taoqf commented Jun 7, 2021

milahu commented Jun 7, 2021

fb55 commented Jun 7, 2021

milahu commented Jun 7, 2021 •

edited

Loading

bamadesigner commented Jun 21, 2021

nonara commented Sep 29, 2021 •

edited

Loading

xml support? #124

xml support? #124

Comments

paulcarroty commented Jun 1, 2021

milahu commented Jun 4, 2021 • edited Loading

taoqf commented Jun 7, 2021

milahu commented Jun 7, 2021

fb55 commented Jun 7, 2021

milahu commented Jun 7, 2021 • edited Loading

bamadesigner commented Jun 21, 2021

nonara commented Sep 29, 2021 • edited Loading

milahu commented Jun 4, 2021 •

edited

Loading

milahu commented Jun 7, 2021 •

edited

Loading

nonara commented Sep 29, 2021 •

edited

Loading