[PRED-2644] Fix decoding error in case of dialect detection #161

falkerson · 2019-07-05T13:18:06Z

BATCH SCORING PULL REQUEST

This is a pull request into a public repository for Batch Scoring script maintained by DataRobot.

RATIONALE

We open some file in binary mode and read some N bytes to detect encoding. Later we use this bytes to detect dialect but before it we decode() them into string(unicode). Because we have const number of N, it's possible that during bytes reading last character may be torn apart and then during decode() we can't identify that character.

CHANGES

As solution we will read first N characters instead of bytes to detect dialect. Because it's called only once no huge performance degradation expected.

TESTING

devexp-slackbot · 2019-07-05T13:18:08Z

JIRA

PRED-2644 - BatchScoring: Error with 'utf8' decoding for Japanese data sets

Jarvis

All executed builds

ikalnytskyi · 2019-07-05T13:59:35Z

datarobot_batch_scoring/reader.py

@@ -432,6 +432,19 @@ def sniff_dialect(sample, encoding, sep, skip_dialect, ui):
    return dialect


+def get_opener_and_mode(is_gz, text=False):


I'm not sure it's going to work for gzipped japanese dataset. I think we need some sort of combination for is_gz and text.

in PY3 gzip.open has encoding parameter. For PY2 please take a look here: https://github.com/datarobot/batch-scoring/pull/161/files#diff-6c240e73b54162e5fce6a481761017a2R527

We open some file in binary mode and read some N bytes to detect encoding. Later we use this bytes to detect dialect but before it we decode() them into string(unicode). Because we have const number of N, it's possible that during bytes reading last character may be torn apart and then during decode() we can't identify that character.

coveralls · 2019-07-15T17:57:18Z

Coverage increased (+0.2%) to 84.24% when pulling 1b56f0e on andriy-popovych/PRED-2644 into 13da21e on master.

ikalnytskyi · 2019-07-17T08:02:00Z

datarobot_batch_scoring/reader.py

+        else:
+            mode = 'rt' if text else 'rb'
+            return (gzip.open, mode)
+    else:


It should work, however, I think the whole else can be written using io:

mode = 'rt' if text else 'rb' return (io.open, mode)

because:

io.open uses universal newlines by default (no need to pass U in openning mode)

io.open is open in Python 3

ikalnytskyi

LGTM. I'd just add a bunch of tests to cover gzipped version of datasets, and ensure encoding and encoding sniffing works fine with them, and maybe add some test japanese dataset too.

ikalnytskyi · 2019-07-19T11:26:13Z

tests/test_functional.py

+    actual = out.read_text('utf-8')
+    with open('tests/fixtures/jpReview_books_reg_out.csv', 'rU') as f:
+        expected = f.read()
+    assert str(actual) == str(expected), expected


How does str works on Python 2? I mean .read_text should return unicode, and casting a unicode to str should probably fail o_O I'd expect to see here six.text_type instead of str.

doesn't matter in case of output. Here we have:
row_id,0.0,1.0
0,1.0,0.0
1,1.0,0.0
2,1.0,0.0

so it can be converted as ascii

falkerson force-pushed the andriy-popovych/PRED-2644 branch from fc417b9 to 4a676da Compare July 5, 2019 13:20

ikalnytskyi reviewed Jul 5, 2019

View reviewed changes

falkerson force-pushed the andriy-popovych/PRED-2644 branch 3 times, most recently from 78b621c to bd1eb21 Compare July 15, 2019 11:51

falkerson force-pushed the andriy-popovych/PRED-2644 branch from bd1eb21 to f38605f Compare July 15, 2019 14:33

fix tests

7679fcd

falkerson added the 00 - Ready for Review label Jul 15, 2019

devexp-slackbot bot added the Needs Review: Predictions label Jul 15, 2019

datarobotspy requested a review from tsh July 16, 2019 15:55

devexp-slackbot bot removed the Needs Review: Predictions label Jul 16, 2019

tsh previously approved these changes Jul 16, 2019

View reviewed changes

ikalnytskyi reviewed Jul 17, 2019

View reviewed changes

ikalnytskyi previously approved these changes Jul 17, 2019

View reviewed changes

falkerson dismissed stale reviews from ikalnytskyi and tsh via f0be1b8 July 19, 2019 10:15

add tests

1b56f0e

falkerson force-pushed the andriy-popovych/PRED-2644 branch from f0be1b8 to 1b56f0e Compare July 19, 2019 10:46

ikalnytskyi reviewed Jul 19, 2019

View reviewed changes

ikalnytskyi approved these changes Jul 19, 2019

View reviewed changes

falkerson merged commit 95ff508 into master Jul 19, 2019

devexp-slackbot bot removed the 00 - Ready for Review label Jul 19, 2019

falkerson deleted the andriy-popovych/PRED-2644 branch July 19, 2019 15:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PRED-2644] Fix decoding error in case of dialect detection #161

[PRED-2644] Fix decoding error in case of dialect detection #161

falkerson commented Jul 5, 2019 •

edited

Loading

devexp-slackbot bot commented Jul 5, 2019

ikalnytskyi Jul 5, 2019

falkerson Jul 15, 2019

coveralls commented Jul 15, 2019 •

edited

Loading

ikalnytskyi Jul 17, 2019

ikalnytskyi left a comment

ikalnytskyi Jul 19, 2019

falkerson Jul 19, 2019

		@@ -432,6 +432,19 @@ def sniff_dialect(sample, encoding, sep, skip_dialect, ui):
		return dialect


		def get_opener_and_mode(is_gz, text=False):

[PRED-2644] Fix decoding error in case of dialect detection #161

[PRED-2644] Fix decoding error in case of dialect detection #161

Conversation

falkerson commented Jul 5, 2019 • edited Loading

BATCH SCORING PULL REQUEST

RATIONALE

CHANGES

TESTING

devexp-slackbot bot commented Jul 5, 2019

JIRA

Jarvis

ikalnytskyi Jul 5, 2019

Choose a reason for hiding this comment

falkerson Jul 15, 2019

Choose a reason for hiding this comment

coveralls commented Jul 15, 2019 • edited Loading

ikalnytskyi Jul 17, 2019

Choose a reason for hiding this comment

ikalnytskyi left a comment

Choose a reason for hiding this comment

ikalnytskyi Jul 19, 2019

Choose a reason for hiding this comment

falkerson Jul 19, 2019

Choose a reason for hiding this comment

falkerson commented Jul 5, 2019 •

edited

Loading

coveralls commented Jul 15, 2019 •

edited

Loading