-
-
Notifications
You must be signed in to change notification settings - Fork 380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Site is not parsed properly #1
Comments
Hello @ivanicin , Thank you for reporting. We will try to investigate this issue during the weekend and try to come back with a solution next Monday. Best Regards, Jonathan |
Hello @ivanicin , We started to investigate the issue. The current problem is the "p" is not handled like a normal tag. See: https://github.com/zzzprojects/html-agility-pack/blob/master/src/HtmlAgilityPack/HtmlNode.cs#L118 Something that both me a lot is when I read the following comment:
I would have expected the following scenario A Anyway, since the That's what is currently happening. We are continuing our investigation. Best Regards, Jonathan |
Hi,
thanks for your feedback. For now I found a workaround (detecting the problem and using web browser to get the properly parsed content in that case), which isn’t perfect - for example Apple TV has no browser and this also slows down everything.
Best regards,
Ivan Icin
Labsii
[email protected]
www.labsii.com
From: Jonathan Magnan
Sent: Saturday, May 20, 2017 4:33 PM
To: zzzprojects/html-agility-pack
Cc: ivanicin ; Mention
Subject: Re: [zzzprojects/html-agility-pack] Site is not parsed properly (#1)
Hello @ivanicin ,
We started to investigate the issue.
The current problem is the "p" is not handled like a normal tag. See: https://github.com/zzzprojects/html-agility-pack/blob/master/src/HtmlAgilityPack/HtmlNode.cs#L118
Something that both me a lot is when I read the following comment:
<p>bla<p>bla will be transformed into <p>bla<p>bla and not <p>bla></p><p>bla</p> or <p>bla<p>bla</p></p>
I would have expected the following scenario <p>bla></p><p>bla</p>, certainly not the "p" remain unchanged!
A p tag cannot contain another p tag, so it should close automatically the previous one.
Anyway, since the learn tag is unclosed, and p tag doesn't work like other tags, the learn tag is closed to the closest parent which is the div element instead of the p tag.
That's what is currently happening.
We are continuing our investigation.
Best Regards,
Jonathan
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Hello @ivanicin , We didn't successfully fixed the issue yet but today we will release a new version. In this version, an option will be added to make the Until we re-write the parser, I believe this will be a good alternative for you. This version should be available in a few hours. Best Regards, Jonathan |
Hello @ivanicin , The v1.5.0-beta3 has been released: We added a temporary fix, you can read it more in detail here: Could you confirm us if this temporary fix is working? Best Regards, Jonathan |
Closing comment: No confirmation, feel free to re-open if you still have the issue. |
Hi @JonathanMagnan, I believe I'm seeing a related issue in 1.5.1 - When encountering a P block with no content, HAP will remove the closing P. For example: Using HtmlAgilityPack.HtmlDocument.DisableBehavaiorTagP = true; seem to resolve this issue. Should we expect to continue to use that option in future releases? |
Hello @chrisnelsondotca , I'm currently right now looking how to fix all this kind of issue. People report them with The current library seems to doesn't work well when the HTML is badly formatted with some kind of tag. btw, please try to ask or report a issue/question on a new thread, it makes easier on my side to follow them. By example, since this thread is closed, there is not a lot of chance that I will answer to it once it fixed. Best Regards, Jonathan |
You can easily reference an issue by simply adding the issue number... Ex: Issue #1 |
Hi @JonathanMagnan thank you, I appreciate the response. Should I create a new issue referencing this one then? |
Please yes ;) So this issue could stay closed. |
XPathAttribute.NodeReturnType related tests and minor adjustments in HtmlNode.Encapsulator.cs
HtmlDocument.CreateComment without spaces
This page is not parsed like in browsers: http://chandra.harvard.edu/photo/2015/ngc6388/
I believe that the cause are 'learn' tags that are not closed (which causes then their parent
tags to be removed).
The text was updated successfully, but these errors were encountered: