Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardize reading from IO like objects, including StringIO #1897

Merged
merged 11 commits into from
Apr 23, 2019

Conversation

jvshahid
Copy link
Member

The main objective of this PR is to fix #1888, i.e. handling of StringIO when parsing HTML documents. There are other commits to refactor the JRuby code and remove any duplicate code. The two areas were we had duplicate code were:

  1. Determining if the object being parsed was an IO-like or String object. This is already known since the entry points from Ruby is either read_memory or read_io.
  2. Determining the encoding of the object. We were doing this in multiple places and had multiple helper methods to find the encoding. Most of that code is now consolidated in ParserContext.java
  3. Letting the parser know about the encoding. We were doing this in multiple places. This is now consolidated into one place ParserContext.java.

flavorjones and others added 9 commits April 7, 2019 11:31
because implementing read like this can result in nondeterministic
behavior. see related #1821 and #1888.
related to incomplete application of fix from #1124
We don't have to figure out the encoding again.  This was already figured out
in the Ruby code.

fixes #1888
We know data can only be a String or IO like object.
The tests are passing without it.
This commit also consolidates some of the encoding handling logic that was
repeated in multiple places.
ext/java/nokogiri/HtmlDocument.java Show resolved Hide resolved
ext/java/nokogiri/HtmlDocument.java Show resolved Hide resolved
ext/java/nokogiri/XmlDocument.java Show resolved Hide resolved
ext/java/nokogiri/XmlDocument.java Show resolved Hide resolved
@codeclimate
Copy link

codeclimate bot commented Apr 20, 2019

Code Climate has analyzed commit bf2c547 and detected 6 issues on this pull request.

Here's the issue category breakdown:

Category Count
Duplication 6

The test coverage on the diff in this pull request is 100.0% (80% is the threshold).

This pull request will bring the total coverage in the repository to 93.5%.

View more on Code Climate.

@jvshahid
Copy link
Member Author

FYI, the failures in CI are reproducible on my local dev machine after running the following to change the locale:

export LANG=C.UTF-8
export LANGUAGE=
export LC_CTYPE="C.UTF-8"
export LC_NUMERIC="C.UTF-8"
export LC_TIME="C.UTF-8"
export LC_COLLATE="C.UTF-8"
export LC_MONETARY="C.UTF-8"
export LC_MESSAGES="C.UTF-8"
export LC_PAPER="C.UTF-8"
export LC_NAME="C.UTF-8"
export LC_ADDRESS="C.UTF-8"
export LC_TELEPHONE="C.UTF-8"
export LC_MEASUREMENT="C.UTF-8"
export LC_IDENTIFICATION="C.UTF-8"
export LC_ALL=

Prior to setting those env vars, locale output was:

locale
LANG=en_US.UTF-8
LC_CTYPE=en_US.UTF-8
LC_NUMERIC=en_US.UTF-8
LC_TIME=en_US.UTF-8
LC_COLLATE=en_US.UTF-8
LC_MONETARY=en_US.UTF-8
LC_MESSAGES=en_US.UTF-8
LC_PAPER=en_US.UTF-8
LC_NAME=en_US.UTF-8
LC_ADDRESS=en_US.UTF-8
LC_TELEPHONE=en_US.UTF-8
LC_MEASUREMENT=en_US.UTF-8
LC_IDENTIFICATION=en_US.UTF-8
LC_ALL=

and after:

locale
/usr/bin/locale: Cannot set LC_CTYPE to default locale: No such file or directory
/usr/bin/locale: Cannot set LC_MESSAGES to default locale: No such file or directory
/usr/bin/locale: Cannot set LC_ALL to default locale: No such file or directory
LANG=C.UTF-8
LC_CTYPE=C.UTF-8
LC_NUMERIC=C.UTF-8
LC_TIME=C.UTF-8
LC_COLLATE=C.UTF-8
LC_MONETARY=C.UTF-8
LC_MESSAGES=C.UTF-8
LC_PAPER=C.UTF-8
LC_NAME=C.UTF-8
LC_ADDRESS=C.UTF-8
LC_TELEPHONE=C.UTF-8
LC_MEASUREMENT=C.UTF-8
LC_IDENTIFICATION=C.UTF-8
LC_ALL=

The file.encoding property could have an invalid encoding name because it is
not normalized, as opposed to Charset.defaultCharset which is.
@flavorjones
Copy link
Member

@jvshahid This changeset is fabulous. Thank you for doing all this work. Merging and queueing up for v1.11.0.

@flavorjones flavorjones merged commit a224c71 into master Apr 23, 2019
@flavorjones flavorjones deleted the 1888-jruby-html-read-io branch April 23, 2019 01:36
@flavorjones flavorjones added this to the v1.11.0 milestone Apr 23, 2019
flavorjones added a commit that referenced this pull request Apr 23, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Need help parsing a standard nginx directory listing. Different results with ruby and jruby.
2 participants