Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[java] Fix errors by collapsed white spaces and <br> #367

Merged
merged 1 commit into from
Nov 15, 2023

Conversation

kojiishi
Copy link
Collaborator

@kojiishi kojiishi commented Nov 15, 2023

This patch fixes HTMLProcessor raising exceptions for:

  1. When the source has collapsible spaces (i.e., consecutive spaces, or leading/trailing spaces.)
  2. When the source has <br> elements.

The jsoup Element.text() collapses white spaces. This is fixed by replacing it with whoteText() which doesn't normalize spaces.

Also, they return \n for <br> element. This is fixed by incrementing scanIndex for <br>.

Fixes #366.

@kojiishi kojiishi force-pushed the gettext branch 2 times, most recently from 2a2d782 to adfeffa Compare November 15, 2023 11:45
@kojiishi kojiishi marked this pull request as ready for review November 15, 2023 11:49
@kojiishi
Copy link
Collaborator Author

PTAL.

This patch fixes `HTMLProcessor` raising exceptions for:
1. When the source has collapsible spaces (i.e., consecutive spaces, or leading/trailing spaces.)
2. When the source has `<br>` elements.

The jsoup `Element.text()` collapses white spaces. This is fixed by replacing it with []`whoteText()`](https://jsoup.org/apidocs/org/jsoup/nodes/Element.html#wholeText()) which doesn't normalize spaces.

It also returns `\n` for `<br>` element. This is fixed by incrementing `scanIndex` for `<br>`.

Fixes google#366.
Copy link
Member

@tushuhei tushuhei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix!

@tushuhei tushuhei merged commit 152683e into google:main Nov 15, 2023
13 checks passed
@kojiishi kojiishi deleted the gettext branch November 15, 2023 12:26
kojiishi added a commit to kojiishi/budoux that referenced this pull request Nov 15, 2023
This patch replaces `wholeText()` implemented in google#367 by a subclass of `NodeVisitor`.

Whether the `wholeText()` emits `\n` for `<br>` depends on the jsoup versions. To ensure that `getText()` always matches what `resolve()` does, this patch changes to its own logic.
kojiishi added a commit to kojiishi/budoux that referenced this pull request Nov 15, 2023
This patch replaces `wholeText()` implemented in google#367 by a subclass of `NodeVisitor`.

Whether the `wholeText()` emits `\n` for `<br>` depends on the jsoup versions. To ensure that `getText()` always matches what `resolve()` does, this patch changes to its own logic.
kojiishi added a commit to kojiishi/budoux that referenced this pull request Nov 15, 2023
This patch replaces `wholeText()` implemented in google#367 by a subclass of `NodeVisitor`.

Whether the `wholeText()` emits `\n` for `<br>` depends on the jsoup versions. To ensure that `getText()` always matches what `resolve()` does, this patch changes to its own logic.
kojiishi added a commit to kojiishi/budoux that referenced this pull request Nov 15, 2023
This patch replaces `wholeText()` implemented in google#367 by a subclass of `NodeVisitor`.

Whether the `wholeText()` emits `\n` for `<br>` depends on the jsoup versions. To ensure that `getText()` always matches what `resolve()` does, this patch changes to its own logic.
kojiishi added a commit to kojiishi/budoux that referenced this pull request Nov 15, 2023
This patch replaces `wholeText()` implemented in google#367 by a subclass of `NodeVisitor`.

Whether the `wholeText()` emits `\n` for `<br>` depends on the jsoup versions. To ensure that `getText()` always matches what `resolve()` does, this patch changes to its own logic.
kojiishi added a commit to kojiishi/budoux that referenced this pull request Nov 15, 2023
This patch replaces `wholeText()` implemented in google#367 by a subclass of `NodeVisitor`.

Whether the `wholeText()` emits `\n` for `<br>` depends on the jsoup versions. To ensure that `getText()` always matches what `resolve()` does, this patch changes to its own logic.
tushuhei pushed a commit that referenced this pull request Nov 15, 2023
This patch replaces `wholeText()` implemented in #367 by a subclass of `NodeVisitor`.

Whether the `wholeText()` emits `\n` for `<br>` depends on the jsoup versions. To ensure that `getText()` always matches what `resolve()` does, this patch changes to its own logic.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[java] HTMLProcessor.getText() collapses whitespaces
2 participants