Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After applying HAP1.11.57, InnerText cannot be obtained correctly. #531

Closed
104kazu opened this issue Jan 12, 2024 · 11 comments
Closed

After applying HAP1.11.57, InnerText cannot be obtained correctly. #531

104kazu opened this issue Jan 12, 2024 · 11 comments
Assignees

Comments

@104kazu
Copy link

104kazu commented Jan 12, 2024

I wrote the following program

Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click
    Dim objDocument As HtmlDocument = New HtmlDocument()
    Using objFileStream = New FileStream(".\TestHtml.txt", FileMode.Open, FileAccess.Read)
        objDocument.Load(objFileStream)
    End Using
    Dim strInnerText = objDocument.DocumentNode.SelectSingleNode("//*[@id='Price']").InnerText
    Dim strInnerText2 = objDocument.DocumentNode.SelectSingleNode("//*[@id='Price']/parent::node()").InnerText
End Sub

TestHtml.txt contains the following.

96,000
送料無料

I ran the program and checked the InnerText variable, but it only has newlines and spaces, and I am not getting 96,000 yen.
If you retrieve from the parent node, as in the variable InnerText2, you will get 96,000 yen.
Also, if you revert the HAP to 1.11.56, you will get 96,000 yen.

Thank you in advance for your investigation.

@104kazu
Copy link
Author

104kazu commented Jan 12, 2024

It is difficult to understand the HTML content, so I will paste in an image.
無題

@elgonzo
Copy link
Contributor

elgonzo commented Jan 12, 2024

FYI, it would be easier for anyone attempting to troubleshoot if you would put the HTML content as text inside a code block using triple back-ticks. By doing so, the HTML code will appear as (copy&paste-able) text like this:

<table>
  <tbody>
    <tr>
      Hello world!
    </tr>
  </tbody>
</table>

@JonathanMagnan JonathanMagnan self-assigned this Jan 12, 2024
@JonathanMagnan
Copy link
Member

Hello @104kazu ,

Could you provide the HTML text in the format that @elgonzo suggested?

So we will make sure that we work on the right issue.

If you could create a Fiddle with it, that will be even better: https://dotnetfiddle.net/25Vjsn

Best Regards,

Jon


Sponsorship
Help us improve this library

Performance Libraries
context.BulkInsert(list, options => options.BatchSize = 1000);
Entity Framework ExtensionsDapper Plus

Runtime Evaluation
Eval.Execute("x + y", new {x = 1, y = 2}); // return 3
C# Eval Function

@JonathanMagnan
Copy link
Member

JonathanMagnan commented Jan 12, 2024

After reviewing your image, you currently have a div in a span.

Indeed starting with the v1.11.57, we automatically close the span if a div is found: d643b0a

However I believe we did indeed the wrong fix on this version. Even if it doesn't make sense, a div look to be able to be in a span tag. We should have instead make a logic that a div cannot be in a p tag (from the original issue).

We will look at it.

Best Regards,

Jon

@104kazu
Copy link
Author

104kazu commented Jan 13, 2024

Hello @elgonzo
Thank you for all the information you have given us.

Hello @jon
Thank you for your research.
Please keep us posted.

@JonathanMagnan
Copy link
Member

Hello @104kazu ,

Just to let you know, it will take more time to fix this issue (well, more of the fix in v1.11.57).

The primary reason is that we only explicitly end a tag depending on the direct parent. However, to make it work, we need to check parents recursively.

The fix by itself is not hard (already coded); however due to HAP having only a few unit tests, it makes it more dangerous, so we need to take our time to make sure that's the direction we want to take for the div and p tag from he issue #529

Best Regards,

Jon

@104kazu
Copy link
Author

104kazu commented Jan 15, 2024

Hello, @jon
Thank you for your research and response.
I understand about the correction taking time.
If there is anything we can do to assist you here, please let us know.

@Dagger55555
Copy link

Dagger55555 commented Jan 17, 2024

As a start, this is the first time I've every had to revert a version of this package so I really can't complain.
That being said, this is a "wow, I can't believe they did this".
I understand that you want to handle badly formed html, but in our case we have lots of "generators" that build the html and we end up with totally balanced tags. However, it also means span, div, and p tags are all nested within each other (style="display=???") making things work properly (and yes I know spans shouldn't contain divs but it works). Giving a higher priority to "bad" html over "valid" wouldn't be my vote as the default operation.

The result is that innerText is now totally different than the previous version.
This upgrade breaks a cardinal rule: don't break anything.

May I suggest, this is a setting that defaults to what it did before so no code changes required.
You could deprecate the current functions, forcing a new parameter to slowly make people switch to the new model and/or think about whether they might be dealing with bad html.

As another option, you could at least count open and close tags and assume if net zero that you can assume things are correct.

@104kazu
Copy link
Author

104kazu commented Jan 19, 2024

Hi, @Dagger55555
As you say, there is the issue of how far to guarantee behavior for inappropriate HTML.
However, I think it is problematic that what worked until at least HAP 1.11.56 will no longer work.
Since the response to HAP 1.11.57 is to deal with inappropriate HTML, I feel that it would be better to revert this response and make this specification work only if some parameters are set at the time of the HAP call.

@JonathanMagnan
Copy link
Member

Hello @104kazu , @Dagger55555 ,

The v1.11.58 has been released (which reverted the change made in v1.11.57)

The v1.11.57 is really a mistake on my part. We try to be very careful in changes we made for not break anything but unfortunately, I misread something that I thought was invalid but was instead valid
(even if that doesn't make sense to me).

Let me know if everything works on this new version.

Best Regards,

Jon

@104kazu
Copy link
Author

104kazu commented Jan 30, 2024

Hello, @jon
We have applied HAP 1.11.58 and have confirmed that the InnerText, which was the problem this time, can be obtained in the same way as before 1.11.56.
Thank you very much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

4 participants