Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should wholeText() introduce newlines between block elements? #2083

Open
h920526 opened this issue Dec 14, 2023 · 5 comments
Open

Should wholeText() introduce newlines between block elements? #2083

h920526 opened this issue Dec 14, 2023 · 5 comments

Comments

@h920526
Copy link

h920526 commented Dec 14, 2023

Hi team,

Jsoup v1.16.1

<div><p>Hello</p><p>World</p></div>

after calling wholeText()

expected:
Hello
World

but actual:
HelloWorld

does not wrap with new line
thanks

@jhy
Copy link
Owner

jhy commented Dec 14, 2023

This is "as designed" currently - wholeText gets only the non-normalized text values from the elements.

I have considered changing it to emit a newline when encountering a new block tag as that seems more useful.

text() will give you normalized text with a (space, not newline) between the nodes. That's designed for e.g. indexing / searching / extracting.

Would be good to hear opinions from folks on this. It seems safe and information preserving.

@jhy jhy changed the title Neighboring <p> elements does not wrap with new line Should wholeText() introduce newlines between block elements? Dec 14, 2023
@akashsahu25
Copy link

use br Tag

@andyrozman
Copy link

I tried to use wholeText() as a way to convert html to text, but it doesn't really work...
\n are not ignored (they should be)
and after that whole text had some weird identation...

and text() is even worse...

Is there any other command that could be used to convert html content into text that produces better results?

@andyrozman
Copy link

@h920526 For your case I think you need to wrap your text into html tags, I needed to do that, so something like this:

<html><body><div><p>Hello</p><p>World</p></div></body></html>

@andyrozman
Copy link

@jhy It might be useful to have command so that it can be converted to text. At the moment wholeText does this, but there are problems, see 1st message.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants