-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AngleSharp.Dom.DomException: Invalid character detected. #54
Comments
Thanks for the bug report and an example. It seems related to issue #42. So, it is certainly possible but I cannot reproduce this issue myself. Are you having this issue with the latest version of the library? In any case I introduced a method in the latest commit that should make the issue impossible to happen. Let me know if this fixes for you. It is not strictly an issue of AngleSharp, it seems an issue of the HTML standard that would probably never be solved because it is too much work. It is a good idea to update AngleSharp anyway, so I did that in one of the recent commits. |
Thank you for your detailed explanation. You are right, I cannot reproduce this again on this url, either. I took it directly from the log files of a spider I wrote using your library. Since there are a few of these exceptions but all very near in time and from the same website, it seems that it was a temporary issue with the responses from the webserver of this site in the first place. So not a bug but a feature. Feel free to close this, unless you think it's a good idea to wrap such exceptions inside a more meaningful exception with a more detailed error, i.e. MalformedHtmlException: Invalid html syntax near token [...], line xxx. In any case, I will also log the http response in such cases to give better feedback in the future. Thanks |
Ok, I caught a similar buggy html, I am uploading it for further investigation Source URL: https://www.capital.gr/story/72655/tin-kataskeui-neou-agogou-petrelaiou-enekrinan-oi-tourkikes-arxes Exception: |
Exception on URL https://www.capital.gr/story/14138/to-keimeno-tis-kataggelias-kouri-gia-tin-altec
This seems like an AngleSharp bug, maybe it's worth trying updating to the latest AngleSharp version?
AngleSharp.Dom.DomException: Invalid character detected.
at AngleSharp.Dom.Element.SetAttribute(String name, String value)
at SmartReader.NodeUtility.SetNodeTag(IElement node, String tag)
at SmartReader.Reader.b__125_0(IElement br)
at SmartReader.NodeUtility.ForEachElement(IHtmlCollection
1 nodeList, Action
1 fn)at SmartReader.Reader.PrepDocument()
at SmartReader.Reader.Parse()
at SmartReader.Reader.GetArticle()
The text was updated successfully, but these errors were encountered: