Online word2vec test #778

isomap · 2016-07-08T08:33:41Z

I was doing the small fix zachmayer's and rutum's online word2vec code referring to #700.
Moreover, I wrote a test code to check the online word2vec working correctly.
My test strategies are as follows:

Check the size of vocabulary in initial training model
After online training, checking the size of vocabulary increases
each sampling method (hs, neg) change the embedded vector through online training

In Travis-ci, my test is passed, except python 2.6. Do you have any idea on how to solve this?

recovering lost work updating bug in sentences iterator vector freeze after each training iteration cbow update clean up code update summarization tutorial image resolving merge conflicts resolving merge conflicts resolving merge conflicts resolving merge conflicts resolving merge conflicts resolving merge conflicts update_weights function change updating the numpy copy code updating the numpy copy code

…online-w2v merge online w2v working repo

…ning

zachmayer · 2016-07-11T16:26:30Z

You're getting one test failure on python 2.6: https://travis-ci.org/RaRe-Technologies/gensim/jobs/143268922

======================================================================
FAIL: Test that the algorithm is able to add new words to the
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/RaRe-Technologies/gensim/gensim/test/test_word2vec.py", line 121, in testOnlineLearning
    self.assertTrue(numpy.allclose(model_hs.syn0, orig0hs))
AssertionError

isomap · 2016-07-12T03:12:07Z

I think it's just a probabilistic problem so that increase iteration to pass the test.
In my local environment and Travis-ci have passed the test but in this pull request is not :(
(https://travis-ci.org/isohyt/gensim/builds/144057596)

isomap · 2016-07-12T04:57:39Z

wow! I found much of mistakes in online word2vec code.
I will fix that and write additional test code.

isomap · 2016-07-12T10:13:06Z

My fixing points are summarized as follows:

increase the count of already added vocabulary. In the previous code, only added new vocabulary through online training and not continuing the count previously seen words.
check loop. line 1046 does not execute the for loop because len(self.vocab) and len(newsyn0) are same. Thus, I change these architectures.
correctly initialize syn1, syn1neg. In online phase, the previous code just changes the last element of syn1, syn1neg so that I add the code to initialize the newly added vector's syn1, syn1neg.

In other words, I made a decision to decrease the checking the similarity between word, 'war' and 'terrorism'. In many cases, these similarities are more than 0.8 ~ 0.9. However, sometimes dramatically fall these similarities' values. I think it occurs because of the small sample size.
To practical evaluation, it's a time to train online w2v on big data.

rutum · 2016-08-23T16:15:06Z

I am going to leave this link here: http://rutumulkar.com/blog/2015/word2vec

This is the original code for Online Word2Vec, which does not face any of the vocab issues that you have discussed.

isomap · 2016-08-24T06:06:18Z

I mean original online w2v code ignores incrementing the count of registered vocabularies.
In this case, following test cannot be passed.

sentences = [
     ['human', 'interface', 'computer'],
     ['survey', 'user', 'computer', 'system', 'response', 'time'],
     ['eps', 'user', 'interface', 'system'],
     ['system', 'human', 'system', 'eps'],
     ['user', 'response', 'time'],
     ['trees'],
     ['graph', 'trees'],
     ['graph', 'minors', 'trees'],
     ['graph', 'minors', 'survey']
 ]

 new_sentences = [
     ['computer', 'artificial', 'intelligence'],
     ['artificial', 'trees'],
      ['human', 'intelligence'],
      ['artificial', 'graph'],
      ['intelligence'],
     ['artificial', 'intelligence', 'system']
  ]

len(list(filter(lambda x: x == 'graph',sum(sentences, [])))) ## == 3
len(list(filter(lambda x: x == 'graph',sum(new_sentences, [])))) ## == 1
...

model = word2vec.Word2Vec(sentences, min_count=0)
self.assertTrue(model.vocab['graph'].count, 3)
model.build_vocab(new_sentences, update=True)
self.assertTrue(model.vocab['graph'].count, 4)

We assume the count of word 'graph' will be 4 after online build_vocab, but original code doesn't consider that.
My implementation is from line 607 to line 611 in word2vec.py. (22dab54)

…ikidump

isomap · 2016-09-04T05:02:24Z

I've written online word2vec tutorial using 2 different wikipedia dumps.
please, check it out :)

tmylk · 2016-09-06T08:17:23Z

gensim/corpora/wikicorpus.py

-            ns = elem.find(ns_path).text
-            if filter_namespaces and ns not in filter_namespaces:
-                text = None
+            # ns = elem.find(ns_path).text


why is this commented out?

@isohyt "I've written online word2vec tutorial using 2 different wikipedia dumps"
Can you please provide the link to the tutorial? Thanks!

isomap · 2016-09-07T05:11:19Z

@boqrat link is here
however, this is beta version doc, so please follow updating

boghrati · 2016-09-07T17:44:35Z

@isohyt Thank you for the link. Have you tried saving the oldmodel, loading it and then continue training? I'm getting a vector of all zeros if I load a previous model, however if I follow your instructions it gives me meaningful results.

isomap · 2016-09-08T17:02:11Z

@boqrat in my environment, i loaded oldmodel successfully

boghrati · 2016-09-08T17:24:51Z

@isohyt I don't have a problem with loading the model. But when I do load a model, I have issues with retraining the model. Here is a sample of my code. So if I run the following code, vector representation of the new words in inputtext2 will NOT be zero.
model=gensim.models.Word2Vec()
model.build_vocab(inputtext1)
model.train(inputtext1)
model.build_vocab(inputtext2, update=True)
model.train(inputtext2)

But if I save and load the model, as the following code, vector representation of the new words will be zero.
model=gensim.models.Word2Vec()
model.build_vocab(inputtext1)
model.train(inputtext1)
model.save("model1")
model = gensim.models.Word2Vec.load("model1")
model.build_vocab(inputtext2, update=True)
model.train(inputtext2)

Let me know if you encounter the same problem or if I'm missing anything in my code. Thanks!

boghrati · 2016-09-08T19:34:12Z

I downloaded the latest version and it's working fine now! Thank you for your help:) This is a really useful feature!

isomap · 2016-09-09T04:10:24Z

@boqrat Thanks for the announcement. By the way, I will do a small fix later, so that please check my future update :)

isomap · 2016-09-09T06:10:04Z

@tmylk I've completed fixing code which you pointed out.
After passing your review, I will close this PR and open new PR to gather commit.

isomap · 2016-09-09T06:12:17Z

In addition, I'll write new tutorial ASAP

tmylk · 2016-09-27T06:02:48Z

@isohyt Please fix merge conflicts and then will be happy to merge this useful feature.

isomap · 2016-09-28T22:37:11Z

I've updated tutorial to use delta wikipedia and to fix merge confict, this PR is rebased

ghost · 2016-10-07T08:28:15Z

Just two quick questions: 1. Are you still using the weight freezing technique that Rutum applied? 2. Is the accuracy still dropping?

piskvorky · 2016-10-07T09:49:54Z

gensim/test/test_word2vec.py

@@ -38,7 +38,7 @@ class LeeCorpus(object):
    def __iter__(self):
        with open(datapath('lee_background.cor')) as f:


@tmylk Should be smart_open (we want to use smart_open consistently across gensim).

tmylk and others added 24 commits November 5, 2015 19:07

Merge branch 'release-0.12.3rc1'

1c63c9a

Merge branch 'release-0.12.3'

280a488

Merge branch 'release-0.12.3'

ddeb002

Update CHANGELOG.txt

f2ac3a9

Update CHANGELOG.txt

cf09e8c

resolve merge conflict in Changelog

b61287a

Merge branch 'release-0.12.4' with piskvorky#596

3ade404

fix test?

be1a0f0

dont sort vocab when updating

0848018

Merge branch 'release-0.13.0'

9e6522e

Merge branch 'release-0.13.0'

87c4e9c

Release version typo fix

9c74b40

Merge branch 'release-0.13.0rc1'

7b30025

Merge branch 'onlineW2V' of https://github.com/zachmayer/gensim into …

d18f06f

…online-w2v merge online w2v working repo

fix build_vocab

6228b73

add assert online add word vector value was changed after online trai…

aef1791

…ning

add new_sentences volume to pass the test

b616d4f

Merge branch 'release-0.13.0'

de79c8e

Merge branch 'release-0.13.1'

d4f9cc5

fix test

baaa1d6

Merge remote-tracking branch 'upstram/master' into online-w2v

46fae36

add test code for online w2v

5e30552

small fix

ec95b8f

fix test

48c9f29

isomap added 2 commits July 12, 2016 18:34

fix online w2v code and add sanity checking

22dab54

assertGreater -> assertLess

d21dd2c

tmylk mentioned this pull request Aug 23, 2016

Online word2vec #700

Closed

Add online word2vec tutorial.ipynb and Update wikicorpus.py for old w…

8f70e15

…ikidump

tmylk reviewed Sep 6, 2016
View reviewed changes

Fix test and wikicorpus

7f7007f

fix indent

2e393de

update online w2v tutorial to use delta wikipedia

adf45ef

isomap mentioned this pull request Sep 28, 2016

online word2vec #900

Merged

isomap closed this Sep 28, 2016

piskvorky reviewed Oct 7, 2016

View reviewed changes

isomap deleted the online-w2v branch October 10, 2016 09:22

martinpopel mentioned this pull request Nov 30, 2016

Enable and refactor image summaries ufal/neuralmonkey#162

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Online word2vec test #778

Online word2vec test #778

isomap commented Jul 8, 2016

zachmayer commented Jul 11, 2016

isomap commented Jul 12, 2016

isomap commented Jul 12, 2016

isomap commented Jul 12, 2016

rutum commented Aug 23, 2016

isomap commented Aug 24, 2016 •

edited

Loading

isomap commented Sep 4, 2016

tmylk Sep 6, 2016

boghrati Sep 7, 2016

isomap commented Sep 7, 2016

boghrati commented Sep 7, 2016

isomap commented Sep 8, 2016

boghrati commented Sep 8, 2016

boghrati commented Sep 8, 2016

isomap commented Sep 9, 2016

isomap commented Sep 9, 2016 •

edited

Loading

isomap commented Sep 9, 2016

tmylk commented Sep 27, 2016

isomap commented Sep 28, 2016

ghost commented Oct 7, 2016

piskvorky Oct 7, 2016

		@@ -38,7 +38,7 @@ class LeeCorpus(object):
		def __iter__(self):
		with open(datapath('lee_background.cor')) as f:

Online word2vec test #778

Online word2vec test #778

Conversation

isomap commented Jul 8, 2016

zachmayer commented Jul 11, 2016

isomap commented Jul 12, 2016

isomap commented Jul 12, 2016

isomap commented Jul 12, 2016

rutum commented Aug 23, 2016

isomap commented Aug 24, 2016 • edited Loading

isomap commented Sep 4, 2016

tmylk Sep 6, 2016

Choose a reason for hiding this comment

boghrati Sep 7, 2016

Choose a reason for hiding this comment

isomap commented Sep 7, 2016

boghrati commented Sep 7, 2016

isomap commented Sep 8, 2016

boghrati commented Sep 8, 2016

boghrati commented Sep 8, 2016

isomap commented Sep 9, 2016

isomap commented Sep 9, 2016 • edited Loading

isomap commented Sep 9, 2016

tmylk commented Sep 27, 2016

isomap commented Sep 28, 2016

ghost commented Oct 7, 2016

piskvorky Oct 7, 2016

Choose a reason for hiding this comment

isomap commented Aug 24, 2016 •

edited

Loading

isomap commented Sep 9, 2016 •

edited

Loading