Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Site is not parsed properly #1

Closed
ivanlabsii opened this issue May 14, 2017 · 11 comments
Closed

Site is not parsed properly #1

ivanlabsii opened this issue May 14, 2017 · 11 comments
Assignees

Comments

@ivanlabsii
Copy link

ivanlabsii commented May 14, 2017

This page is not parsed like in browsers: http://chandra.harvard.edu/photo/2015/ngc6388/

I believe that the cause are 'learn' tags that are not closed (which causes then their parent

tags to be removed).

@JonathanMagnan JonathanMagnan self-assigned this May 15, 2017
@JonathanMagnan
Copy link
Member

Hello @ivanicin ,

Thank you for reporting.

We will try to investigate this issue during the weekend and try to come back with a solution next Monday.

Best Regards,

Jonathan

@JonathanMagnan
Copy link
Member

Hello @ivanicin ,

We started to investigate the issue.

The current problem is the "p" is not handled like a normal tag. See: https://github.com/zzzprojects/html-agility-pack/blob/master/src/HtmlAgilityPack/HtmlNode.cs#L118

Something that both me a lot is when I read the following comment:

<p>bla<p>bla will be transformed into <p>bla<p>bla and not <p>bla></p><p>bla</p> or <p>bla<p>bla</p></p>

I would have expected the following scenario <p>bla></p><p>bla</p>, certainly not the "p" remain unchanged!

A p tag cannot contain another p tag, so it should close automatically the previous one.

Anyway, since the learn tag is unclosed, and p tag doesn't work like other tags, the learn tag is closed to the closest parent which is the div element instead of the p tag.

That's what is currently happening.

We are continuing our investigation.

Best Regards,

Jonathan

@ivanlabsii
Copy link
Author

ivanlabsii commented May 21, 2017 via email

@JonathanMagnan
Copy link
Member

Hello @ivanicin ,

We didn't successfully fixed the issue yet but today we will release a new version.

In this version, an option will be added to make the p tag act like a div and span tag.

Until we re-write the parser, I believe this will be a good alternative for you.

This version should be available in a few hours.

Best Regards,

Jonathan

JonathanMagnan pushed a commit that referenced this issue May 21, 2017
Issue #1 && #3
@JonathanMagnan
Copy link
Member

Hello @ivanicin ,

The v1.5.0-beta3 has been released:
https://www.nuget.org/packages/HtmlAgilityPack/1.5.0-beta3

We added a temporary fix, you can read it more in detail here:
https://github.com/zzzprojects/html-agility-pack/releases/tag/v1.5.0-beta3

Could you confirm us if this temporary fix is working?

Best Regards,

Jonathan

@JonathanMagnan
Copy link
Member

Closing comment: No confirmation, feel free to re-open if you still have the issue.

@ghost
Copy link

ghost commented Jul 28, 2017

Hi @JonathanMagnan,

I believe I'm seeing a related issue in 1.5.1 - When encountering a P block with no content, HAP will remove the closing P. For example:
Starting HTML: <p style="font-size:1px;"></p><p>test</p>
HTML after HAP: <p style="font-size:1px;"><p>test</p>

Using HtmlAgilityPack.HtmlDocument.DisableBehavaiorTagP = true; seem to resolve this issue. Should we expect to continue to use that option in future releases?

@JonathanMagnan
Copy link
Member

Hello @chrisnelsondotca ,

I'm currently right now looking how to fix all this kind of issue. People report them with p, a, etc.

The current library seems to doesn't work well when the HTML is badly formatted with some kind of tag.

btw, please try to ask or report a issue/question on a new thread, it makes easier on my side to follow them. By example, since this thread is closed, there is not a lot of chance that I will answer to it once it fixed.

Best Regards,

Jonathan

@JonathanMagnan
Copy link
Member

You can easily reference an issue by simply adding the issue number...

Ex: Issue #1

@ghost
Copy link

ghost commented Jul 28, 2017

Hi @JonathanMagnan thank you, I appreciate the response. Should I create a new issue referencing this one then?

@JonathanMagnan
Copy link
Member

Please yes ;) So this issue could stay closed.

JonathanMagnan pushed a commit that referenced this issue Aug 1, 2023
XPathAttribute.NodeReturnType related tests and minor adjustments in HtmlNode.Encapsulator.cs
JonathanMagnan pushed a commit that referenced this issue Dec 21, 2023
HtmlDocument.CreateComment without spaces
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants