We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hi there, Parsing fails for some pages ( eg. this article )
To replicate, open the generated html in a browser
const document = (new DOMParser).parseFromString(htmlFromTheArticle, 'text/html'); const html = document.body.innerHTML;
Instead of the original page, it now includes raw html code.
Διαβάστε το πλήρες κείμενο του σημειώματος του CEO της UBS στο <a href="https://www.newmoney.gr/roh/bloomberg/to-esoteriko-simioma-tou-ceo-tis-ubs-pros-tous-ergazomenous-meta-tin-exagora-tis-credit-suisse/" target="_blank" rel="noopener noreferrer">newmoney.gr</a><br> <br> <strong><a href="https://www.protothema.gr/oles-oi-eidiseis/" target="_blank" rel="noopener noreferrer">Ειδήσεις σήμερα:</a><br> <br> <a href="https://www.protothema.gr/greece/article/1351532/xanthi-ston-eisaggelea-simera-o-36hronos-pou-skotose-ton-45hrono-epeidi-ton-theorise-roufiano/" target="_blank" rel="noopener noreferrer">...
It happened for many html documents already. The culprit is htmlparser2, if I downgrade to v6.1.0, it works properly.
I tried to debug and the problem is caused in Tokenizer.ts. When I simply replace these lines
Tokenizer.ts
if (this.isSpecial) { this.state = State.InSpecialTag; this.sequenceIndex = 0; } else { this.state = State.Text; }
With
this.state = State.Text;
It works properly. I'm not sure what is the proper fix, which will not affect the performance of htmlparser2, so I opened this issue instead.
htmlparser2
The text was updated successfully, but these errors were encountered:
The culprit is htmlparser2, if I downgrade to v6.1.0
so ... this bug is for a library used by this repository? if that's the case, what are you expecting me to do here? 🤔
Sorry, something went wrong.
No branches or pull requests
Hi there,
Parsing fails for some pages ( eg. this article )
To replicate, open the generated html in a browser
Instead of the original page, it now includes raw html code.
Διαβάστε το πλήρες κείμενο του σημειώματος του CEO της UBS στο <a href="https://www.newmoney.gr/roh/bloomberg/to-esoteriko-simioma-tou-ceo-tis-ubs-pros-tous-ergazomenous-meta-tin-exagora-tis-credit-suisse/" target="_blank" rel="noopener noreferrer">newmoney.gr</a><br> <br> <strong><a href="https://www.protothema.gr/oles-oi-eidiseis/" target="_blank" rel="noopener noreferrer">Ειδήσεις σήμερα:</a><br> <br> <a href="https://www.protothema.gr/greece/article/1351532/xanthi-ston-eisaggelea-simera-o-36hronos-pou-skotose-ton-45hrono-epeidi-ton-theorise-roufiano/" target="_blank" rel="noopener noreferrer">...
It happened for many html documents already. The culprit is htmlparser2, if I downgrade to v6.1.0, it works properly.
I tried to debug and the problem is caused in
Tokenizer.ts
. When I simply replace these linesWith
this.state = State.Text;
It works properly. I'm not sure what is the proper fix, which will not affect the performance of
htmlparser2
, so I opened this issue instead.The text was updated successfully, but these errors were encountered: