Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Online word2vec test #778

Closed
wants to merge 31 commits into from
Closed

Online word2vec test #778

wants to merge 31 commits into from

Conversation

isomap
Copy link
Contributor

@isomap isomap commented Jul 8, 2016

I was doing the small fix zachmayer's and rutum's online word2vec code referring to #700.
Moreover, I wrote a test code to check the online word2vec working correctly.
My test strategies are as follows:

  1. Check the size of vocabulary in initial training model
  2. After online training, checking the size of vocabulary increases
  3. each sampling method (hs, neg) change the embedded vector through online training

In Travis-ci, my test is passed, except python 2.6. Do you have any idea on how to solve this?

tmylk and others added 24 commits November 5, 2015 19:07
recovering lost work

updating bug in sentences iterator

vector freeze after each training iteration

cbow update

clean up code

update summarization tutorial image

resolving merge conflicts

resolving merge conflicts

resolving merge conflicts

resolving merge conflicts

resolving merge conflicts

resolving merge conflicts

update_weights function change

updating the numpy copy code

updating the numpy copy code
@zachmayer
Copy link

You're getting one test failure on python 2.6: https://travis-ci.org/RaRe-Technologies/gensim/jobs/143268922

======================================================================
FAIL: Test that the algorithm is able to add new words to the
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/RaRe-Technologies/gensim/gensim/test/test_word2vec.py", line 121, in testOnlineLearning
    self.assertTrue(numpy.allclose(model_hs.syn0, orig0hs))
AssertionError

@isomap
Copy link
Contributor Author

isomap commented Jul 12, 2016

I think it's just a probabilistic problem so that increase iteration to pass the test.
In my local environment and Travis-ci have passed the test but in this pull request is not :(
(https://travis-ci.org/isohyt/gensim/builds/144057596)

@isomap
Copy link
Contributor Author

isomap commented Jul 12, 2016

wow! I found much of mistakes in online word2vec code.
I will fix that and write additional test code.

@isomap
Copy link
Contributor Author

isomap commented Jul 12, 2016

My fixing points are summarized as follows:

  1. increase the count of already added vocabulary. In the previous code, only added new vocabulary through online training and not continuing the count previously seen words.
  2. check loop. line 1046 does not execute the for loop because len(self.vocab) and len(newsyn0) are same. Thus, I change these architectures.
  3. correctly initialize syn1, syn1neg. In online phase, the previous code just changes the last element of syn1, syn1neg so that I add the code to initialize the newly added vector's syn1, syn1neg.

In other words, I made a decision to decrease the checking the similarity between word, 'war' and 'terrorism'. In many cases, these similarities are more than 0.8 ~ 0.9. However, sometimes dramatically fall these similarities' values. I think it occurs because of the small sample size.
To practical evaluation, it's a time to train online w2v on big data.

@tmylk tmylk mentioned this pull request Aug 23, 2016
@rutum
Copy link

rutum commented Aug 23, 2016

I am going to leave this link here: http://rutumulkar.com/blog/2015/word2vec

This is the original code for Online Word2Vec, which does not face any of the vocab issues that you have discussed.

@isomap
Copy link
Contributor Author

isomap commented Aug 24, 2016

I mean original online w2v code ignores incrementing the count of registered vocabularies.
In this case, following test cannot be passed.

sentences = [
     ['human', 'interface', 'computer'],
     ['survey', 'user', 'computer', 'system', 'response', 'time'],
     ['eps', 'user', 'interface', 'system'],
     ['system', 'human', 'system', 'eps'],
     ['user', 'response', 'time'],
     ['trees'],
     ['graph', 'trees'],
     ['graph', 'minors', 'trees'],
     ['graph', 'minors', 'survey']
 ]

 new_sentences = [
     ['computer', 'artificial', 'intelligence'],
     ['artificial', 'trees'],
      ['human', 'intelligence'],
      ['artificial', 'graph'],
      ['intelligence'],
     ['artificial', 'intelligence', 'system']
  ]

len(list(filter(lambda x: x == 'graph',sum(sentences, [])))) ## == 3
len(list(filter(lambda x: x == 'graph',sum(new_sentences, [])))) ## == 1
...

model = word2vec.Word2Vec(sentences, min_count=0)
self.assertTrue(model.vocab['graph'].count, 3)
model.build_vocab(new_sentences, update=True)
self.assertTrue(model.vocab['graph'].count, 4)

We assume the count of word 'graph' will be 4 after online build_vocab, but original code doesn't consider that.
My implementation is from line 607 to line 611 in word2vec.py. (22dab54)

@isomap
Copy link
Contributor Author

isomap commented Sep 4, 2016

I've written online word2vec tutorial using 2 different wikipedia dumps.
please, check it out :)

ns = elem.find(ns_path).text
if filter_namespaces and ns not in filter_namespaces:
text = None
# ns = elem.find(ns_path).text
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this commented out?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@isohyt "I've written online word2vec tutorial using 2 different wikipedia dumps"
Can you please provide the link to the tutorial? Thanks!

@isomap
Copy link
Contributor Author

isomap commented Sep 7, 2016

@boqrat link is here
however, this is beta version doc, so please follow updating

@boghrati
Copy link

boghrati commented Sep 7, 2016

@isohyt Thank you for the link. Have you tried saving the oldmodel, loading it and then continue training? I'm getting a vector of all zeros if I load a previous model, however if I follow your instructions it gives me meaningful results.

@isomap
Copy link
Contributor Author

isomap commented Sep 8, 2016

@boqrat in my environment, i loaded oldmodel successfully
online_w2v_oldmodel

@boghrati
Copy link

boghrati commented Sep 8, 2016

@isohyt I don't have a problem with loading the model. But when I do load a model, I have issues with retraining the model. Here is a sample of my code. So if I run the following code, vector representation of the new words in inputtext2 will NOT be zero.
model=gensim.models.Word2Vec()
model.build_vocab(inputtext1)
model.train(inputtext1)
model.build_vocab(inputtext2, update=True)
model.train(inputtext2)

But if I save and load the model, as the following code, vector representation of the new words will be zero.
model=gensim.models.Word2Vec()
model.build_vocab(inputtext1)
model.train(inputtext1)
model.save("model1")
model = gensim.models.Word2Vec.load("model1")
model.build_vocab(inputtext2, update=True)
model.train(inputtext2)

Let me know if you encounter the same problem or if I'm missing anything in my code. Thanks!

@boghrati
Copy link

boghrati commented Sep 8, 2016

I downloaded the latest version and it's working fine now! Thank you for your help:) This is a really useful feature!

@isomap
Copy link
Contributor Author

isomap commented Sep 9, 2016

@boqrat Thanks for the announcement. By the way, I will do a small fix later, so that please check my future update :)

@isomap
Copy link
Contributor Author

isomap commented Sep 9, 2016

@tmylk I've completed fixing code which you pointed out.
After passing your review, I will close this PR and open new PR to gather commit.

@isomap
Copy link
Contributor Author

isomap commented Sep 9, 2016

In addition, I'll write new tutorial ASAP

@tmylk
Copy link
Contributor

tmylk commented Sep 27, 2016

@isohyt Please fix merge conflicts and then will be happy to merge this useful feature.

@isomap isomap mentioned this pull request Sep 28, 2016
@isomap
Copy link
Contributor Author

isomap commented Sep 28, 2016

I've updated tutorial to use delta wikipedia and to fix merge confict, this PR is rebased

@isomap isomap closed this Sep 28, 2016
@ghost
Copy link

ghost commented Oct 7, 2016

Just two quick questions: 1. Are you still using the weight freezing technique that Rutum applied? 2. Is the accuracy still dropping?

@@ -38,7 +38,7 @@ class LeeCorpus(object):
def __iter__(self):
with open(datapath('lee_background.cor')) as f:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmylk Should be smart_open (we want to use smart_open consistently across gensim).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants