Doc2Vec .save_word2vec_format() doesn't save everything. #1110

melvyniandrag · 2017-01-27T18:59:38Z

When I model.save_word2vec_format() or model.save(), it seems that only the word vector information is saved. The following code is almost identical to the wikipedia code in the repo.


from gensim.models.doc2vec import Doc2Vec, TaggedDocument
import multiprocessing
import json
import logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

class MyCorpus(object):
    def __init__(self, file):
        self.file = file
        self.all_text_labels = set()
    def __iter__(self):
        with open(self.file, "r") as fin:
            for  l in fin:
                j = json.loads(l)
                id = j["idx"]
                title = j["title"]
                words = j["words"]
                label = brand + "_"  + str(id)
                self.all_text_labels.add(label)
                yield TaggedDocument(words, [label])

documents = MyCorpus("/home/word2vec/sample.json")
pre = Doc2Vec(min_count=0)
pre.scan_vocab(documents)
cores = multiprocessing.cpu_count()
model = Doc2Vec(dm=0, dbow_words=1, size=10, window=8, min_count=1, iter=10, workers=cores)
model.build_vocab(documents)
model.train(documents)
model.save_word2vec_format("d2v_model")
print(model.docvecs.most_similar(positive=[SOME_DOC_TAG_GOES_HERE])) # This works

I can get most_similar() documents in the same script that trained the model, as above. However, I get this error:

  File "<stdin>", line 1, in <module>
  File "/home/word2vec/anaconda3/lib/python3.5/site-packages/gensim/models/doc2vec.py", line 426, in most_similar
    self.init_sims()
  File "/home/word2vec/anaconda3/lib/python3.5/site-packages/gensim/models/doc2vec.py", line 409, in init_sims
    self.doctag_syn0norm = empty(self.doctag_syn0.shape, dtype=REAL)
AttributeError: 'DocvecsArray' object has no attribute 'doctag_syn0'

if I reload the model in a different script. I.e.

"""
other_script.py
"""
import gensim
model =gensim.models.Doc2Vec.load_word2vec_format("d2v_model")
model.docvecs.most_similar(positive=[SOME_DOC_TAG_GOES_HERE], topn=4)

I can, however, do

model.most_similar(positive=[SOME_WORD_HERE])

In the current directory I see only one file called d2v_model, and when I open it I see the word vectors. I'm thinking there should be another one called d2v_model.doctag_syn0 or something. Help?

The text was updated successfully, but these errors were encountered:

melvyniandrag · 2017-01-27T19:31:52Z

As a follow up, I verified that all that is saved is the word model.


2017-01-27 14:28:17,639 : INFO : training model with 24 workers on 10543 vocabulary and 10 features, using sg=1 hs=0 sample=0.001 negative=5 window=8

And then

[word2vec@centos7]$ wc -l d2v_model
10544 d2v_model

The vocab size is 10543 and the saved model file has the corresponding number of lines (plus the header)

tmylk · 2017-01-27T21:00:45Z

Hi @melvyniandrag

For save_word2vec_format this is a known behaviour. There is an open issue #699 and discussion on how to extend this format to include document vectors.

Do you think that changing the docstring to "The word vectors of the model can also be instantiated from an existing file on disk in the word2vec C format. NOTE that it excludes the document vectors::" would make it more clear?

Have you tried the save method? It is expected to save the entire model. We have tests for persistence and I just checked that it works in the latest release.

gojomo · 2017-01-27T23:08:25Z

I, too, believe save() works as intended, saving the whole model. So I've edited the issue title to more specifically describe that it's just save_word2vec_format() that doesn't save the doc-vectors stuff. (If you have an example of save() not working, we can expand the title again.)

melvyniandrag · 2017-01-30T15:13:33Z

Hello tmylk and gojomo,
Sorry about the mistake, I see that save() does work. I don't know what I was thinking.

Also, sorry I didn't see that this was a known issue. I guess this thread can be closed then!

I do like the idea of changing the docstring, and I would change your note to say :

"NOTE document vectors are not saved with .save_word2vec_format(). Use .save() instead"

because this clearly states the functionality and a solution.

tmylk · 2017-01-31T15:09:44Z

Thanks for the comment idea. Fixed in ae04cda

gojomo changed the title ~~Doc2Vec .save() doesn't save everything.~~ Doc2Vec .save_word2vec_format() doesn't save everything. Jan 27, 2017

melvyniandrag closed this as completed Jan 30, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Doc2Vec .save_word2vec_format() doesn't save everything. #1110

Doc2Vec .save_word2vec_format() doesn't save everything. #1110

melvyniandrag commented Jan 27, 2017 •

edited

Loading

melvyniandrag commented Jan 27, 2017

tmylk commented Jan 27, 2017

gojomo commented Jan 27, 2017

melvyniandrag commented Jan 30, 2017

tmylk commented Jan 31, 2017

Doc2Vec .save_word2vec_format() doesn't save everything. #1110

Doc2Vec .save_word2vec_format() doesn't save everything. #1110

Comments

melvyniandrag commented Jan 27, 2017 • edited Loading

melvyniandrag commented Jan 27, 2017

tmylk commented Jan 27, 2017

gojomo commented Jan 27, 2017

melvyniandrag commented Jan 30, 2017

tmylk commented Jan 31, 2017

melvyniandrag commented Jan 27, 2017 •

edited

Loading