You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Exporting doc2vec embeddings in plain word2vec format fails when document title contains unicode chars that cannot be ASCII encoded. This only fails in Python 2.7 because Python 3 defaults to unicode strings. The problem is the call str(..) in doc2vec.py
Code to Reproduce
Example:
from gensim.models import doc2vec
# Load doc2vec model with a document title not ASCII encodable
m = Doc2Vec.load('model')
m.save_word2vec_format('model.out')
Expected Results
Export works
Actual Results
---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
<ipython-input-3-d12938849014> in <module>()
----> 1 m.save_word2vec_format('model.out')
..../.virtualenvs/gensim27_2.3/local/lib/python2.7/site-packages/gensim-2.3.0-py2.7-linux-x86_64.egg/gensim/models/doc2vec.pyc in save_word2vec_format(self, fname, doctag_vec, word_vec, prefix, fvocab, binary)
848 # store as in input order
849 for i in range(len(self.docvecs)):
--> 850 doctag = prefix + str(self.docvecs.index_to_doctag(i))
851 row = self.docvecs.doctag_syn0[i]
852 if binary:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf3' in position 8: ordinal not in range(128)
Description
Exporting doc2vec embeddings in plain word2vec format fails when document title contains unicode chars that cannot be ASCII encoded. This only fails in Python 2.7 because Python 3 defaults to unicode strings. The problem is the call
str(..)
in doc2vec.pyCode to Reproduce
Example:
Expected Results
Export works
Actual Results
Versions
Linux-4.4.0-59-generic-x86_64-with-Ubuntu-16.04-xenial
('Python', '2.7.12 (default, Nov 19 2016, 06:48:10) \n[GCC 5.4.0 20160609]')
('NumPy', '1.13.1')
('SciPy', '0.19.1')
('gensim', '2.3.0')
('FAST_VERSION', 1)
The text was updated successfully, but these errors were encountered: