-
Notifications
You must be signed in to change notification settings - Fork 17
[PRED-2644] Fix decoding error in case of dialect detection #161
Conversation
fc417b9
to
4a676da
Compare
datarobot_batch_scoring/reader.py
Outdated
@@ -432,6 +432,19 @@ def sniff_dialect(sample, encoding, sep, skip_dialect, ui): | |||
return dialect | |||
|
|||
|
|||
def get_opener_and_mode(is_gz, text=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure it's going to work for gzipped japanese dataset. I think we need some sort of combination for is_gz and text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in PY3 gzip.open has encoding parameter. For PY2 please take a look here: https://github.com/datarobot/batch-scoring/pull/161/files#diff-6c240e73b54162e5fce6a481761017a2R527
78b621c
to
bd1eb21
Compare
We open some file in binary mode and read some N bytes to detect encoding. Later we use this bytes to detect dialect but before it we decode() them into string(unicode). Because we have const number of N, it's possible that during bytes reading last character may be torn apart and then during decode() we can't identify that character.
bd1eb21
to
f38605f
Compare
datarobot_batch_scoring/reader.py
Outdated
else: | ||
mode = 'rt' if text else 'rb' | ||
return (gzip.open, mode) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should work, however, I think the whole else
can be written using io
:
mode = 'rt' if text else 'rb'
return (io.open, mode)
because:
io.open
uses universal newlines by default (no need to passU
in openning mode)io.open
isopen
in Python 3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I'd just add a bunch of tests to cover gzipped version of datasets, and ensure encoding and encoding sniffing works fine with them, and maybe add some test japanese dataset too.
f0be1b8
to
1b56f0e
Compare
actual = out.read_text('utf-8') | ||
with open('tests/fixtures/jpReview_books_reg_out.csv', 'rU') as f: | ||
expected = f.read() | ||
assert str(actual) == str(expected), expected |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does str
works on Python 2? I mean .read_text
should return unicode
, and casting a unicode to str
should probably fail o_O I'd expect to see here six.text_type
instead of str
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't matter in case of output. Here we have:
row_id,0.0,1.0
0,1.0,0.0
1,1.0,0.0
2,1.0,0.0
so it can be converted as ascii
BATCH SCORING PULL REQUEST
This is a pull request into a public repository for Batch Scoring script maintained by DataRobot.
RATIONALE
We open some file in binary mode and read some N bytes to detect encoding. Later we use this bytes to detect dialect but before it we decode() them into string(unicode). Because we have const number of N, it's possible that during bytes reading last character may be torn apart and then during decode() we can't identify that character.
CHANGES
As solution we will read first N characters instead of bytes to detect dialect. Because it's called only once no huge performance degradation expected.
TESTING