Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Angle Sharp parsing xml attributes #42

Closed
prestonkell opened this issue Jun 8, 2022 · 3 comments
Closed

Angle Sharp parsing xml attributes #42

prestonkell opened this issue Jun 8, 2022 · 3 comments

Comments

@prestonkell
Copy link

Hi there,
I'm seeing a number of pages that throw Invalid character detected. AngleSharp.Dom.DomException because AngleSharp's attributes get validated with their IsXmlName and IsXmlNameStart method https://github.com/AngleSharp/AngleSharp/blob/cdc88b1c0e71476f35fe9405d38b66e33fa1969c/src/AngleSharp/Text/XmlExtensions.cs#L47.

It's called from the SimplifyNestedElements method in Readability when setting attributes https://github.com/Strumenta/SmartReader/blob/master/src/SmartReader/Readability.cs#L190.

Can this be optional or somehow be cleaned or something in order to avoid these errors? I've seen this with attributes that start with '@' or other chars that xml deems wrong but are in html.

@gabriele-tomassetti
Copy link
Member

Thanks for the bug report and the investigation regarding the cause. I am going to try solving the issue this weekend.

@gabriele-tomassetti
Copy link
Member

@prestonkell I cannot reproduce this issue. Could you provide an example HTML page that generate this error?

@prestonkell
Copy link
Author

Yeah no problem - I get the error with this site: https://www.mumsnet.com/talk/parenting/4301360-second-child-looks-more-like-mum

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants