-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update word2vec.ipynb #2423
Update word2vec.ipynb #2423
Conversation
Good catch! Congrats on your first contribution to gensim 👍 |
@@ -116,7 +116,7 @@ | |||
" \n", | |||
" def __iter__(self):\n", | |||
" for fname in os.listdir(self.dirname):\n", | |||
" for line in smart_open(os.path.join(self.dirname, fname), 'rb'):\n", | |||
" for line in smart_open(os.path.join(self.dirname, fname), 'r'):\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use an explicit encoding. Opening a file in "text mode", without specifying the encoding, should always be considered an error.
There are two ways to do this:
- either open in binary mode and then decode the strings; or
- open in text mode but provide an explicit
encoding
parameter.
The latter is only available in Python 3, IIRC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opening a file in "text mode", without specifying the encoding, should always be considered an error.
Why should it be considered an error? There is a default encoding that works the vast majority of the time.
The latter is only available in Python 3, IIRC.
We're using smart_open here, so it doesn't matter which Python version we're running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, we back-ported the encoding
parameter into python2, in smart_open
? That's cool, I didn't know (or forgot).
Why should it be considered an error? There is a default encoding that works the vast majority of the time.
It's an error because it consistently leads to errors and subtle bugs (in addition to being unnecessarily vague about the code intent). We never want to rely on any "default encodings", nor encourage such engineering patterns in our examples or own code.
If a piece of code is expecting utf8
(or ascii
or …), it should say so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are dozens of places where we invoke smart_open without specifying the encoding to open text. Should we go through the docs and explicitly specify the encoding in each case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely! Same goes for Gensim or any other of our code or tutorials. I'm not sure when / how such bugs slipped through, but if you find any, please squash them without mercy!
Is it possible that they're somehow related to the previous behaviour of smart_open as "binary by default", which was changed only recently to match the built-in open
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, because we never updated the documentation to use the new open
function.
This code sample doesn't work with 'rb' mode. The error would be "TypeError: can't concat str to bytes".