-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FastText save & callbacks suspicious behavior #2235
Comments
Same thing for me. When trained my wiki corpus with word2vec, I got 37% from analogy questions. But when I trained the same corpus with fasttext result is 3.3% from same analogy questions. Is there a problem in fasttext? Gensim version: 3.6.0 |
Thanks for report @daridar, especially (3) makes me think that we have an issue with |
CC @mpenkov |
I have encountered the same problem while I was trying to train FastText model from big dataset. Here is a simplifed version of the problem. from gensim.models.fasttext import FastText
from gensim.models.word2vec import Word2Vec
import gensim.downloader as api
import numpy as np
from tqdm import tqdm
from time import sleep
class list_iter:
def __init__(self,array,model,see=np.nan,only_one_loop=False):
self.array=array
self.see=see
self.model=model
self.only_one_loop=only_one_loop
self.tqdm_bar=tqdm(desc='iterations')
def __iter__(self):
while True:
for item in self.array:
self.tqdm_bar.update(1)
if self.tqdm_bar.n%self.see==0:
print('\nvector hash:'+str(hash(self.model['I'].tostring())))
sleep(2)
self.model.wv.save("model")
yield item
if self.only_one_loop:
self.tqdm_bar.close()
break
with open("tinyshakespeare.txt", 'r') as fp:
corpus=[i.split() for i in fp.read().split('\n')]
model=FastText(workers=1)
model.build_vocab(list_iter(corpus,model,only_one_loop=True))
model.train(list_iter(corpus,model,see=10000),total_examples=99999999999999999,epochs=10) the output of running this code is:
the vector of the word is not changing and the model is not learning anything.
and by the way you can use any text file. |
I think that i have fixed my problem, it looks like gensim team is working on solving it but its not released yet in the pip version?! self.model.wv.adjust_vectors()
self.model.wv.save("model") this solution is for my case but if you finished using the training function no need for using super(FastText, self).train(
sentences=sentences, corpus_file=corpus_file, total_examples=total_examples, total_words=total_words,
epochs=epochs, start_alpha=start_alpha, end_alpha=end_alpha, word_count=word_count,
queue_factor=queue_factor, report_delay=report_delay, callbacks=callbacks)
self.wv.adjust_vectors() |
Description
TODO: FastText model does not learn anything from the text corpus.
Steps/Code/Corpus to Reproduce
Expected Results
I expect to find in model.most_similar("word") something closer in meaning but found just a trash.
I took an open-source dataset from sklearn.datasets - fetch_20newsgroups.
Actual Results
And it changes very slightly from epoch to epoch, It can change slightly an order of this words, or change their similarity. But nothing changes during training. Nothing learns.
Also, what is important:
If I try to make a fasttext model from command line, I mean using this command:
./fasttext skipgram -input data.txt -output model
(https://github.com/facebookresearch/fastText) It shows good results, for example for apple we would receive: apples, apple's and so on.Also If I change my model from FastText to Word2Vec - I can learn. Results are good.
Also If I don't use my EpochSaver, but just load and save model on each epoch manuall, for example:
And then load your model before the next epoch starts, you can also receive good results.
So, the problem can be in EpochSaver, but can you explain please, why in Word2Vec's case it works, but here - don't.
Versions
Linux-4.15.0-24-generic-x86_64-with-Ubuntu-16.04-xenial
Python 3.5.2 (default, Nov 23 2017, 16:37:01)
[GCC 5.4.0 20160609]
NumPy 1.14.5
SciPy 1.1.0
gensim 3.6.0
FAST_VERSION 1
The text was updated successfully, but these errors were encountered: