Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix write method of file requires byte-like object, not str #1750

Merged
merged 3 commits into from
Dec 5, 2017
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion gensim/scripts/segment_wiki.py
Original file line number Diff line number Diff line change
Expand Up @@ -111,7 +111,7 @@ def segment_and_write_all_articles(file_path, output_file, min_article_character
if output_file is None:
outfile = sys.stdout
else:
outfile = smart_open(output_file, 'wb')
outfile = smart_open(output_file, 'w')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1: we should always be writing out bytes, in specific encoding (utf8).

What exactly is the problem/error this is trying to fix?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried to segmentize a wiki and write results to file. But I've got error:

Traceback (most recent call last):
  File "P:\Python35\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "P:\Python35\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "p:\_projects\gensim\gensim\scripts\segment_wiki.py", line 319, in <module>
    workers=args.workers
  File "p:\_projects\gensim\gensim\scripts\segment_wiki.py", line 125, in segment_and_write_all_articles
    outfile.write(json.dumps(output_data) + "\n")
  File "P:\Python35\lib\gzip.py", line 258, in write
    data = memoryview(data)
TypeError: memoryview: a bytes-like object is required, not 'str'

I can make a convertation to bytes but sys.stdout requires str, not bytes and I'd like to keep this flexible approach for writing.

I have Python v3.5, for Python2 is all good.

we should always be writing out bytes, in specific encoding (utf8).

Please, explain why?

Copy link
Owner

@piskvorky piskvorky Dec 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because that's what I/O layers understand: bits and bytes. It makes the logic more explicit and simpler to have "unicode inside" and "bytes on I/O".

I didn't look in detail, but to me that json.dumps() + "\n" looks like a bug. @menshikh-iv shouldn't that be encoded into a bytestring (utf8) before writing to a binary file?

@horpto thanks for pointing this out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @horpto suggested, sys.stdout opened in 'w' mode (not 'wb') (for python3 added encoding='UTF-8' explicitly)

I can convert this line to bytes explicitly, but potentially, we'll have problems with sys.stdout (or need to split this two cases).

@horpto can you test your code with 'wb' mode and explicit conversion for json.dumps() + "\n" (I can test this only for linux, sometimes, encoding problems on windows behaves not obviously)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@menshikh-iv

can you test your code with 'wb' mode and explicit conversion for json.dumps() + "\n"

It's OK and should work fine for python2 and python3.

Because that's what I/O layers understand: bits and bytes. It makes the logic more explicit and simpler to have "unicode inside" and "bytes on I/O".

I think, it's uselessly when we are writing to text file, as file-object does it already inside. When we are reading content from file - it's OK, files can contain some trash.

Copy link
Owner

@piskvorky piskvorky Dec 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not useless, it's a Python best practice.

Newlines are messed up on Windows, we always want to have full control over what we write. For this reason, explicit conversions between string and byte are preferred on all I/O boundaries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, interesting use case.
ok.


try:
article_stream = segment_all_articles(file_path, min_article_character, workers=workers)
Expand Down