Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fix Issue #1070] Jsoup parse string not skipping BOM character correctly #1073

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

zuozhiw
Copy link

@zuozhiw zuozhiw commented Jun 12, 2018

Issue #1070 shows a case where Jsoup not parsing an HTML page correctly and putting all the information in the body instead of head. The problem can be reproducible only if use Jsoup.parse(response.body()). Using another method Jsoup.connect(url).get() won't reproduce the same problem.

After digging around for a while, it turns out that the HTML page has the BOM character 65279 in its first character. The reason that the second method, Jsoup.connect(url).get(), internally uses DataUtil.parseInputStream(), which handles the BOM character correctly. However, Jsoup.parse(string) constructs a HTMLTreeBuilder to parse the document directly, therefore the BOM handling process is not there.

This PR fixes this issue by introducing the same handling process in TreeBuilder constructor. Before it passes the input reader to the tokenizer, it reuses the same helper function DataUtil.detectCharsetFromBom to detect the BOM character, and skip the BOM character if needed.

This PR also adds a new test case in DataUtilTest to test if Jsoup.parse(string) can work correctly with BOM. After adding the fix, the problem in issue #1070 can produce the correct result.

There are already several efforts to fix the problem caused by the BOM character:
Issue #348 is the first issue mentioning the problem, and it is fixed in commit 3f9f33d , this commit only fixes the issue in parseByteData (which later becomes parseInputStream).

Later there are several commits to refactor the BOM handling code, such as c3cbe1b , 4eb4f2b

The latest issue mentioning the problem is Issue #1003, and the corresponding fix is in 0f7e0cc , however, this commit still only fixes the function parseInputStream, which is not used by Jsoup.parse(string).

@jhy I'm not sure if TreeBuilder class is the best place to put in the fix code. I am willing to devote more efforts to improve it if you could give some more advice. Thanks :)

@panthony
Copy link

Any idea when this could be merged ?

@panthony
Copy link

Using this fix I eventually encountered a nasty issue that made me lost a fair amount of time.

Long story short, I had a piece of code that sometimes where running on a machine that had US-ASCII as default charset instead of UTF-8.

Because of this, the BOM was badly handled was it relies on Charset.defaultCharset().

This is often bad practice to rely on default Locale/Charset, maybe JSoup could take a look at https://github.com/policeman-tools/forbidden-apis (story behind this project: http://blog.thetaphi.de/2012/07/default-locales-default-charsets-and.html)

@jhy
Copy link
Owner

jhy commented Dec 27, 2018

Hi, thanks for this. I am worried that this only handles the UTF8 BOM, and probably only works if the string was actually decoded as a UTF8 string. It doesn't help (afaict) for UTF16 and 32 BOMs.

My thought is that we should implement so that we expect any input String to have been correctly decoded.

The methods in Connection and Jsoup.load (for file and input stream) correctly handle the different BOMs. The user is being impacted for getting the request body string and that's where the BOM is not handles (and the string is invalid effectively). How about we add the BOM routine to the body() method?

@panthony
Copy link

panthony commented Dec 27, 2018

Not sure which method we are talking about but you cannot convert back the given string to bytes without taking which charset it was since you cannot rely on defaultCharset.

In my case, if JSoup.parse were to take an InputStream instead of a String I could just wrap it inside a DOMInputStream (Apache commons-io).

Maybe this could just be documented:

  • This is the connection layer that handles the BOM implicitly
  • The String input must be cleared from the BOM markup (only the user can do this since he's the only one to know how it is encoded)
  • The JSoup could provide an helper that takes a the necessary information (string, charset) to cleanup the string if necessary

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants