Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode encode problem when saving doc2vec models in plain word2vec format. #1543

Closed
englhardt opened this issue Aug 21, 2017 · 0 comments
Closed

Comments

@englhardt
Copy link
Contributor

Description

Exporting doc2vec embeddings in plain word2vec format fails when document title contains unicode chars that cannot be ASCII encoded. This only fails in Python 2.7 because Python 3 defaults to unicode strings. The problem is the call str(..) in doc2vec.py

Code to Reproduce

Example:

from gensim.models import doc2vec

# Load doc2vec model with a document title not ASCII encodable
m = Doc2Vec.load('model')
m.save_word2vec_format('model.out')

Expected Results

Export works

Actual Results

---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-3-d12938849014> in <module>()
----> 1 m.save_word2vec_format('model.out')

..../.virtualenvs/gensim27_2.3/local/lib/python2.7/site-packages/gensim-2.3.0-py2.7-linux-x86_64.egg/gensim/models/doc2vec.pyc in save_word2vec_format(self, fname, doctag_vec, word_vec, prefix, fvocab, binary)
    848                 # store as in input order
    849                 for i in range(len(self.docvecs)):
--> 850                     doctag = prefix + str(self.docvecs.index_to_doctag(i))
    851                     row = self.docvecs.doctag_syn0[i]
    852                     if binary:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xf3' in position 8: ordinal not in range(128)

Versions

Linux-4.4.0-59-generic-x86_64-with-Ubuntu-16.04-xenial
('Python', '2.7.12 (default, Nov 19 2016, 06:48:10) \n[GCC 5.4.0 20160609]')
('NumPy', '1.13.1')
('SciPy', '0.19.1')
('gensim', '2.3.0')
('FAST_VERSION', 1)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant