Parsing fails and there is raw html code in rendered html #200

Appress · 2023-03-20T17:08:30Z

Hi there,
Parsing fails for some pages ( eg. this article )

To replicate, open the generated html in a browser

    const document = (new DOMParser).parseFromString(htmlFromTheArticle, 'text/html');
    const html = document.body.innerHTML;

Instead of the original page, it now includes raw html code.

Διαβάστε το πλήρες κείμενο του σημειώματος του CEO της UBS στο <a href="https://www.newmoney.gr/roh/bloomberg/to-esoteriko-simioma-tou-ceo-tis-ubs-pros-tous-ergazomenous-meta-tin-exagora-tis-credit-suisse/" target="_blank" rel="noopener noreferrer">newmoney.gr</a> <a href="https://www.protothema.gr/oles-oi-eidiseis/" target="_blank" rel="noopener noreferrer">Ειδήσεις σήμερα:</a> <a href="https://www.protothema.gr/greece/article/1351532/xanthi-ston-eisaggelea-simera-o-36hronos-pou-skotose-ton-45hrono-epeidi-ton-theorise-roufiano/" target="_blank" rel="noopener noreferrer">...

It happened for many html documents already. The culprit is htmlparser2, if I downgrade to v6.1.0, it works properly.

I tried to debug and the problem is caused in Tokenizer.ts. When I simply replace these lines

            if (this.isSpecial) {
                this.state = State.InSpecialTag;
                this.sequenceIndex = 0;
            } else {
                this.state = State.Text;
            }

With

this.state = State.Text;

It works properly. I'm not sure what is the proper fix, which will not affect the performance of htmlparser2, so I opened this issue instead.

The text was updated successfully, but these errors were encountered:

WebReflection · 2023-03-20T18:06:22Z

The culprit is htmlparser2, if I downgrade to v6.1.0

so ... this bug is for a library used by this repository? if that's the case, what are you expecting me to do here? 🤔

WebReflection added the 3rd Party Issue from external dependencies label Mar 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parsing fails and there is raw html code in rendered html #200

Parsing fails and there is raw html code in rendered html #200

Appress commented Mar 20, 2023 •

edited

Loading

WebReflection commented Mar 20, 2023

Parsing fails and there is raw html code in rendered html #200

Parsing fails and there is raw html code in rendered html #200

Comments

Appress commented Mar 20, 2023 • edited Loading

WebReflection commented Mar 20, 2023

Appress commented Mar 20, 2023 •

edited

Loading