Update word2vec.ipynb #2423

asyabo · 2019-03-19T11:29:52Z

This code sample doesn't work with 'rb' mode. The error would be "TypeError: can't concat str to bytes".

mpenkov · 2019-04-15T01:36:34Z

Good catch! Congrats on your first contribution to gensim 👍

piskvorky · 2019-04-20T09:43:25Z

docs/notebooks/word2vec.ipynb

@@ -116,7 +116,7 @@
    " \n",
    "    def __iter__(self):\n",
    "        for fname in os.listdir(self.dirname):\n",
-    "            for line in smart_open(os.path.join(self.dirname, fname), 'rb'):\n",
+    "            for line in smart_open(os.path.join(self.dirname, fname), 'r'):\n",


Please use an explicit encoding. Opening a file in "text mode", without specifying the encoding, should always be considered an error.

There are two ways to do this:

either open in binary mode and then decode the strings; or

open in text mode but provide an explicit encoding parameter.

The latter is only available in Python 3, IIRC.

Opening a file in "text mode", without specifying the encoding, should always be considered an error.

Why should it be considered an error? There is a default encoding that works the vast majority of the time.

The latter is only available in Python 3, IIRC.

We're using smart_open here, so it doesn't matter which Python version we're running.

Oh, we back-ported the encoding parameter into python2, in smart_open? That's cool, I didn't know (or forgot).

Why should it be considered an error? There is a default encoding that works the vast majority of the time.

It's an error because it consistently leads to errors and subtle bugs (in addition to being unnecessarily vague about the code intent). We never want to rely on any "default encodings", nor encourage such engineering patterns in our examples or own code.
If a piece of code is expecting utf8 (or ascii or …), it should say so.

There are dozens of places where we invoke smart_open without specifying the encoding to open text. Should we go through the docs and explicitly specify the encoding in each case?

Definitely! Same goes for Gensim or any other of our code or tutorials. I'm not sure when / how such bugs slipped through, but if you find any, please squash them without mercy!

Is it possible that they're somehow related to the previous behaviour of smart_open as "binary by default", which was changed only recently to match the built-in open?

No, because we never updated the documentation to use the new open function.

Update word2vec.ipynb

cea658f

mpenkov merged commit ff107d6 into piskvorky:develop Apr 15, 2019

piskvorky reviewed Apr 20, 2019

View reviewed changes

mpenkov mentioned this pull request Apr 20, 2019

explicitly specify encoding whenever opening text files #2458

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update word2vec.ipynb #2423

Update word2vec.ipynb #2423

asyabo commented Mar 19, 2019 •

edited

Loading

mpenkov commented Apr 15, 2019

piskvorky Apr 20, 2019 •

edited

Loading

mpenkov Apr 20, 2019

piskvorky Apr 20, 2019 •

edited

Loading

mpenkov Apr 20, 2019

piskvorky Apr 20, 2019 •

edited

Loading

mpenkov Apr 20, 2019

Update word2vec.ipynb #2423

Update word2vec.ipynb #2423

Conversation

asyabo commented Mar 19, 2019 • edited Loading

mpenkov commented Apr 15, 2019

piskvorky Apr 20, 2019 • edited Loading

Choose a reason for hiding this comment

mpenkov Apr 20, 2019

Choose a reason for hiding this comment

piskvorky Apr 20, 2019 • edited Loading

Choose a reason for hiding this comment

mpenkov Apr 20, 2019

Choose a reason for hiding this comment

piskvorky Apr 20, 2019 • edited Loading

Choose a reason for hiding this comment

mpenkov Apr 20, 2019

Choose a reason for hiding this comment

asyabo commented Mar 19, 2019 •

edited

Loading

piskvorky Apr 20, 2019 •

edited

Loading

piskvorky Apr 20, 2019 •

edited

Loading

piskvorky Apr 20, 2019 •

edited

Loading