Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update word2vec.ipynb #2423

Merged
merged 1 commit into from
Apr 15, 2019
Merged

Update word2vec.ipynb #2423

merged 1 commit into from
Apr 15, 2019

Conversation

asyabo
Copy link
Contributor

@asyabo asyabo commented Mar 19, 2019

This code sample doesn't work with 'rb' mode. The error would be "TypeError: can't concat str to bytes".

@mpenkov
Copy link
Collaborator

mpenkov commented Apr 15, 2019

Good catch! Congrats on your first contribution to gensim 👍

@mpenkov mpenkov merged commit ff107d6 into piskvorky:develop Apr 15, 2019
@@ -116,7 +116,7 @@
" \n",
" def __iter__(self):\n",
" for fname in os.listdir(self.dirname):\n",
" for line in smart_open(os.path.join(self.dirname, fname), 'rb'):\n",
" for line in smart_open(os.path.join(self.dirname, fname), 'r'):\n",
Copy link
Owner

@piskvorky piskvorky Apr 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use an explicit encoding. Opening a file in "text mode", without specifying the encoding, should always be considered an error.

There are two ways to do this:

  1. either open in binary mode and then decode the strings; or
  2. open in text mode but provide an explicit encoding parameter.

The latter is only available in Python 3, IIRC.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opening a file in "text mode", without specifying the encoding, should always be considered an error.

Why should it be considered an error? There is a default encoding that works the vast majority of the time.

The latter is only available in Python 3, IIRC.

We're using smart_open here, so it doesn't matter which Python version we're running.

Copy link
Owner

@piskvorky piskvorky Apr 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, we back-ported the encoding parameter into python2, in smart_open? That's cool, I didn't know (or forgot).

Why should it be considered an error? There is a default encoding that works the vast majority of the time.

It's an error because it consistently leads to errors and subtle bugs (in addition to being unnecessarily vague about the code intent). We never want to rely on any "default encodings", nor encourage such engineering patterns in our examples or own code.
If a piece of code is expecting utf8 (or ascii or …), it should say so.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are dozens of places where we invoke smart_open without specifying the encoding to open text. Should we go through the docs and explicitly specify the encoding in each case?

Copy link
Owner

@piskvorky piskvorky Apr 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely! Same goes for Gensim or any other of our code or tutorials. I'm not sure when / how such bugs slipped through, but if you find any, please squash them without mercy!

Is it possible that they're somehow related to the previous behaviour of smart_open as "binary by default", which was changed only recently to match the built-in open?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, because we never updated the documentation to use the new open function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants