Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AngleSharp.Dom.DomException: Invalid character detected. #54

Closed
iansmirlis opened this issue Feb 3, 2023 · 3 comments
Closed

AngleSharp.Dom.DomException: Invalid character detected. #54

iansmirlis opened this issue Feb 3, 2023 · 3 comments

Comments

@iansmirlis
Copy link

Exception on URL https://www.capital.gr/story/14138/to-keimeno-tis-kataggelias-kouri-gia-tin-altec

This seems like an AngleSharp bug, maybe it's worth trying updating to the latest AngleSharp version?

AngleSharp.Dom.DomException: Invalid character detected.
at AngleSharp.Dom.Element.SetAttribute(String name, String value)
at SmartReader.NodeUtility.SetNodeTag(IElement node, String tag)
at SmartReader.Reader.b__125_0(IElement br)
at SmartReader.NodeUtility.ForEachElement(IHtmlCollection1 nodeList, Action1 fn)
at SmartReader.Reader.PrepDocument()
at SmartReader.Reader.Parse()
at SmartReader.Reader.GetArticle()

@gabriele-tomassetti
Copy link
Member

Thanks for the bug report and an example. It seems related to issue #42. So, it is certainly possible but I cannot reproduce this issue myself. Are you having this issue with the latest version of the library?

In any case I introduced a method in the latest commit that should make the issue impossible to happen. Let me know if this fixes for you.

It is not strictly an issue of AngleSharp, it seems an issue of the HTML standard that would probably never be solved because it is too much work. It is a good idea to update AngleSharp anyway, so I did that in one of the recent commits.

@iansmirlis
Copy link
Author

Thank you for your detailed explanation. You are right, I cannot reproduce this again on this url, either.

I took it directly from the log files of a spider I wrote using your library. Since there are a few of these exceptions but all very near in time and from the same website, it seems that it was a temporary issue with the responses from the webserver of this site in the first place. So not a bug but a feature.

Feel free to close this, unless you think it's a good idea to wrap such exceptions inside a more meaningful exception with a more detailed error, i.e. MalformedHtmlException: Invalid html syntax near token [...], line xxx. In any case, I will also log the http response in such cases to give better feedback in the future.

Thanks

@iansmirlis
Copy link
Author

Ok, I caught a similar buggy html, I am uploading it for further investigation

Source URL: https://www.capital.gr/story/72655/tin-kataskeui-neou-agogou-petrelaiou-enekrinan-oi-tourkikes-arxes
Problematic html: error.html.gz

Exception:
AngleSharp.Dom.DomException: Invalid character detected.
at AngleSharp.Dom.Element.SetAttribute(String name, String value)
at SmartReader.NodeUtility.SetNodeTag(IElement node, String tag)
at SmartReader.NodeUtility.ReplaceNodeTags(IHtmlCollection`1 nodeList, String newTagName)
at SmartReader.Reader.Parse()
at SmartReader.Reader.GetArticle()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants