-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Online word2vec test #778
Online word2vec test #778
Conversation
recovering lost work updating bug in sentences iterator vector freeze after each training iteration cbow update clean up code update summarization tutorial image resolving merge conflicts resolving merge conflicts resolving merge conflicts resolving merge conflicts resolving merge conflicts resolving merge conflicts update_weights function change updating the numpy copy code updating the numpy copy code
…online-w2v merge online w2v working repo
You're getting one test failure on python 2.6: https://travis-ci.org/RaRe-Technologies/gensim/jobs/143268922 ======================================================================
FAIL: Test that the algorithm is able to add new words to the
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/travis/build/RaRe-Technologies/gensim/gensim/test/test_word2vec.py", line 121, in testOnlineLearning
self.assertTrue(numpy.allclose(model_hs.syn0, orig0hs))
AssertionError |
I think it's just a probabilistic problem so that increase iteration to pass the test. |
wow! I found much of mistakes in online word2vec code. |
My fixing points are summarized as follows:
In other words, I made a decision to decrease the checking the similarity between word, 'war' and 'terrorism'. In many cases, these similarities are more than 0.8 ~ 0.9. However, sometimes dramatically fall these similarities' values. I think it occurs because of the small sample size. |
I am going to leave this link here: http://rutumulkar.com/blog/2015/word2vec This is the original code for Online Word2Vec, which does not face any of the vocab issues that you have discussed. |
I mean original online w2v code ignores incrementing the count of registered vocabularies.
We assume the count of word 'graph' will be 4 after online build_vocab, but original code doesn't consider that. |
I've written online word2vec tutorial using 2 different wikipedia dumps. |
ns = elem.find(ns_path).text | ||
if filter_namespaces and ns not in filter_namespaces: | ||
text = None | ||
# ns = elem.find(ns_path).text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this commented out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@isohyt "I've written online word2vec tutorial using 2 different wikipedia dumps"
Can you please provide the link to the tutorial? Thanks!
@boqrat link is here |
@isohyt Thank you for the link. Have you tried saving the oldmodel, loading it and then continue training? I'm getting a vector of all zeros if I load a previous model, however if I follow your instructions it gives me meaningful results. |
@isohyt I don't have a problem with loading the model. But when I do load a model, I have issues with retraining the model. Here is a sample of my code. So if I run the following code, vector representation of the new words in inputtext2 will NOT be zero. But if I save and load the model, as the following code, vector representation of the new words will be zero. Let me know if you encounter the same problem or if I'm missing anything in my code. Thanks! |
I downloaded the latest version and it's working fine now! Thank you for your help:) This is a really useful feature! |
@boqrat Thanks for the announcement. By the way, I will do a small fix later, so that please check my future update :) |
@tmylk I've completed fixing code which you pointed out. |
In addition, I'll write new tutorial ASAP |
@isohyt Please fix merge conflicts and then will be happy to merge this useful feature. |
I've updated tutorial to use delta wikipedia and to fix merge confict, this PR is rebased |
Just two quick questions: 1. Are you still using the weight freezing technique that Rutum applied? 2. Is the accuracy still dropping? |
@@ -38,7 +38,7 @@ class LeeCorpus(object): | |||
def __iter__(self): | |||
with open(datapath('lee_background.cor')) as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tmylk Should be smart_open
(we want to use smart_open
consistently across gensim).
I was doing the small fix zachmayer's and rutum's online word2vec code referring to #700.
Moreover, I wrote a test code to check the online word2vec working correctly.
My test strategies are as follows:
In Travis-ci, my test is passed, except python 2.6. Do you have any idea on how to solve this?