Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doc2Vec .save_word2vec_format() doesn't save everything. #1110

Closed
melvyniandrag opened this issue Jan 27, 2017 · 5 comments
Closed

Doc2Vec .save_word2vec_format() doesn't save everything. #1110

melvyniandrag opened this issue Jan 27, 2017 · 5 comments

Comments

@melvyniandrag
Copy link

melvyniandrag commented Jan 27, 2017

When I model.save_word2vec_format() or model.save(), it seems that only the word vector information is saved. The following code is almost identical to the wikipedia code in the repo.


from gensim.models.doc2vec import Doc2Vec, TaggedDocument
import multiprocessing
import json
import logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

class MyCorpus(object):
    def __init__(self, file):
        self.file = file
        self.all_text_labels = set()
    def __iter__(self):
        with open(self.file, "r") as fin:
            for  l in fin:
                j = json.loads(l)
                id = j["idx"]
                title = j["title"]
                words = j["words"]
                label = brand + "_"  + str(id)
                self.all_text_labels.add(label)
                yield TaggedDocument(words, [label])

documents = MyCorpus("/home/word2vec/sample.json")
pre = Doc2Vec(min_count=0)
pre.scan_vocab(documents)
cores = multiprocessing.cpu_count()
model = Doc2Vec(dm=0, dbow_words=1, size=10, window=8, min_count=1, iter=10, workers=cores)
model.build_vocab(documents)
model.train(documents)
model.save_word2vec_format("d2v_model")
print(model.docvecs.most_similar(positive=[SOME_DOC_TAG_GOES_HERE])) # This works

I can get most_similar() documents in the same script that trained the model, as above. However, I get this error:

  File "<stdin>", line 1, in <module>
  File "/home/word2vec/anaconda3/lib/python3.5/site-packages/gensim/models/doc2vec.py", line 426, in most_similar
    self.init_sims()
  File "/home/word2vec/anaconda3/lib/python3.5/site-packages/gensim/models/doc2vec.py", line 409, in init_sims
    self.doctag_syn0norm = empty(self.doctag_syn0.shape, dtype=REAL)
AttributeError: 'DocvecsArray' object has no attribute 'doctag_syn0'

if I reload the model in a different script. I.e.

"""
other_script.py
"""
import gensim
model =gensim.models.Doc2Vec.load_word2vec_format("d2v_model")
model.docvecs.most_similar(positive=[SOME_DOC_TAG_GOES_HERE], topn=4)

I can, however, do

model.most_similar(positive=[SOME_WORD_HERE])

In the current directory I see only one file called d2v_model, and when I open it I see the word vectors. I'm thinking there should be another one called d2v_model.doctag_syn0 or something. Help?

@melvyniandrag
Copy link
Author

As a follow up, I verified that all that is saved is the word model.


2017-01-27 14:28:17,639 : INFO : training model with 24 workers on 10543 vocabulary and 10 features, using sg=1 hs=0 sample=0.001 negative=5 window=8

And then

[word2vec@centos7]$ wc -l d2v_model
10544 d2v_model

The vocab size is 10543 and the saved model file has the corresponding number of lines (plus the header)

@tmylk
Copy link
Contributor

tmylk commented Jan 27, 2017

Hi @melvyniandrag

For save_word2vec_format this is a known behaviour. There is an open issue #699 and discussion on how to extend this format to include document vectors.

Do you think that changing the docstring to "The word vectors of the model can also be instantiated from an existing file on disk in the word2vec C format. NOTE that it excludes the document vectors::" would make it more clear?

Have you tried the save method? It is expected to save the entire model. We have tests for persistence and I just checked that it works in the latest release.

@gojomo gojomo changed the title Doc2Vec .save() doesn't save everything. Doc2Vec .save_word2vec_format() doesn't save everything. Jan 27, 2017
@gojomo
Copy link
Collaborator

gojomo commented Jan 27, 2017

I, too, believe save() works as intended, saving the whole model. So I've edited the issue title to more specifically describe that it's just save_word2vec_format() that doesn't save the doc-vectors stuff. (If you have an example of save() not working, we can expand the title again.)

@melvyniandrag
Copy link
Author

Hello tmylk and gojomo,
Sorry about the mistake, I see that save() does work. I don't know what I was thinking.

Also, sorry I didn't see that this was a known issue. I guess this thread can be closed then!

I do like the idea of changing the docstring, and I would change your note to say :

"NOTE document vectors are not saved with .save_word2vec_format(). Use .save() instead"

because this clearly states the functionality and a solution.

@tmylk
Copy link
Contributor

tmylk commented Jan 31, 2017

Thanks for the comment idea. Fixed in ae04cda

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants