Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FastText save & callbacks suspicious behavior #2235

Closed
darentsia opened this issue Oct 18, 2018 · 7 comments
Closed

FastText save & callbacks suspicious behavior #2235

darentsia opened this issue Oct 18, 2018 · 7 comments
Assignees
Labels
bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills fasttext Issues related to the FastText model

Comments

@darentsia
Copy link

darentsia commented Oct 18, 2018

Description

TODO: FastText model does not learn anything from the text corpus.

Steps/Code/Corpus to Reproduce

import os
import logging

from gensim.models import FastText
from gensim.models.callbacks import CallbackAny2Vec

class EpochSaver(CallbackAny2Vec):
    '''Callback to save model after each epoch and show training parameters '''

    def __init__(self, savedir):
        self.savedir = savedir
        self.epoch = 0
        os.makedirs(self.savedir, exist_ok=True)

    def on_epoch_end(self, model):
        savepath = os.path.join(self.savedir, "model_fastText_web_kw_sm{}_epoch.gz".format(self.epoch))
        model.save(savepath)
        print(
            "Epoch saved: {}".format(self.epoch + 1),
            "Start next epoch ... ", sep="\n"
            )
        if os.path.isfile(os.path.join(self.savedir, "model_fastText_web_kw_sm{}_epoch.gz".format(self.epoch - 1))):
            print("Previous model deleted ")
            os.remove(os.path.join(self.savedir, "model_fastText_web_kw_sm{}_epoch.gz".format(self.epoch - 1)))
        self.epoch += 1

class SentenceIter:
    def __iter__(self):
        with open("data/eng_tweets/20_news_groups_dataset.txt", "r") as f:
            for line in f:
                yield line[:-1].split(" ")

if __name__ == "__main__":

   logging.basicConfig(
   format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO
   )

   num_workers = os.cpu_count()
   model = FastText(
        SentenceIter(),
        sg=1,
        size=100,
        window=3,
        min_count=5,
        workers=num_workers,
        iter=5,
        negative=20
        callbacks=[EpochSaver("./checkpoints/fasttext_eng_tweets")]
    )

Expected Results

I expect to find in model.most_similar("word") something closer in meaning but found just a trash.
I took an open-source dataset from sklearn.datasets - fetch_20newsgroups.

Actual Results

image

And it changes very slightly from epoch to epoch, It can change slightly an order of this words, or change their similarity. But nothing changes during training. Nothing learns.

Also, what is important:

  1. If I try to make a fasttext model from command line, I mean using this command:
    ./fasttext skipgram -input data.txt -output model (https://github.com/facebookresearch/fastText) It shows good results, for example for apple we would receive: apples, apple's and so on.

  2. Also If I change my model from FastText to Word2Vec - I can learn. Results are good.

  3. Also If I don't use my EpochSaver, but just load and save model on each epoch manuall, for example:

for epoch in range(N_epochs):
    train model 
    save model 

And then load your model before the next epoch starts, you can also receive good results.

So, the problem can be in EpochSaver, but can you explain please, why in Word2Vec's case it works, but here - don't.

Versions

Linux-4.15.0-24-generic-x86_64-with-Ubuntu-16.04-xenial
Python 3.5.2 (default, Nov 23 2017, 16:37:01)
[GCC 5.4.0 20160609]
NumPy 1.14.5
SciPy 1.1.0
gensim 3.6.0
FAST_VERSION 1

@bunyamink
Copy link

Same thing for me. When trained my wiki corpus with word2vec, I got 37% from analogy questions. But when I trained the same corpus with fasttext result is 3.3% from same analogy questions. Is there a problem in fasttext?

Gensim version: 3.6.0
Python Version: 3.6.4
Windows 10

@menshikh-iv menshikh-iv changed the title FastText does not learn anything FastText save & callbacks suspicious behavior Dec 14, 2018
@menshikh-iv menshikh-iv added bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills labels Dec 14, 2018
@menshikh-iv
Copy link
Contributor

Thanks for report @daridar, especially (3) makes me think that we have an issue with save method (i.e. this change a current model somehow).

@menshikh-iv
Copy link
Contributor

CC @mpenkov

@5cat
Copy link

5cat commented Jan 17, 2019

I have encountered the same problem while I was trying to train FastText model from big dataset.

Here is a simplifed version of the problem.

from gensim.models.fasttext import FastText
from gensim.models.word2vec import Word2Vec 
import gensim.downloader as api
import numpy as np
from tqdm import tqdm
from time import sleep
class list_iter:
	def __init__(self,array,model,see=np.nan,only_one_loop=False):
		self.array=array
		self.see=see
		self.model=model
		self.only_one_loop=only_one_loop
		self.tqdm_bar=tqdm(desc='iterations')
	def __iter__(self):
		while True:
			for item in self.array:
				self.tqdm_bar.update(1)
				
				if self.tqdm_bar.n%self.see==0:
					print('\nvector hash:'+str(hash(self.model['I'].tostring())))
					sleep(2)
					self.model.wv.save("model")
				yield item
			if self.only_one_loop:
				self.tqdm_bar.close()
				break


with open("tinyshakespeare.txt", 'r') as fp:
	corpus=[i.split() for i in fp.read().split('\n')]

model=FastText(workers=1)

model.build_vocab(list_iter(corpus,model,only_one_loop=True))

model.train(list_iter(corpus,model,see=10000),total_examples=99999999999999999,epochs=10)

the output of running this code is:

iterations: 40001it [00:00, 359624.54it/s]
iterations: 8178it [00:00, 81174.89it/s]
vector hash:-4933655588363529352
iterations: 17619it [00:02, 5406.74it/s]
vector hash:-4933655588363529352
iterations: 28266it [00:04, 4400.15it/s]
vector hash:-4933655588363529352
iterations: 39611it [00:06, 4396.16it/s]
vector hash:-4933655588363529352

the vector of the word is not changing and the model is not learning anything.
if i replaced the FastText(workers=1) with Word2Vec(workers=1) everything works fine and make sense and the vector is updated

iterations: 40001it [00:00, 359474.28it/s]
iterations: 0it [00:00, ?it/s]
vector hash:-3094244126925185959
iterations: 19618it [00:02, 6651.19it/s]
vector hash:1153644772814581057
iterations: 22603it [00:04, 3228.06it/s]
vector hash:5947032563406220642
iterations: 30001it [00:06, 3326.54it/s]
vector hash:-7484819002721531784

and by the way you can use any text file.
And i think the problem is not from the save method because even without saving it, the vector is the same after each iteration, when i check the hash of the file its different each time i save it while training, but for some reasons i can't see any changes to the vectors.
even when i tried to get back to gensim 3.1.0 the issue is still there.
why is that?
gensim==3.6.0
python==3.6.4

@5cat
Copy link

5cat commented Jan 17, 2019

I think that i have fixed my problem, it looks like gensim team is working on solving it but its not released yet in the pip version?!
this is what i have done
pip3 uninstall gensim
then reinstall it with from this commit
pip3 install 'git+git://github.com/RaRe-Technologies/gensim.git@b452a5b59f2f474dbbd275d0838c45df4d3c5aac'
then before i save the model i run this function

self.model.wv.adjust_vectors()
self.model.wv.save("model")

this solution is for my case but if you finished using the training function no need for using model.wv.adjust_vectors() since at the end of train function it does model.wv.adjust_vectors() by it self.

        super(FastText, self).train(
            sentences=sentences, corpus_file=corpus_file, total_examples=total_examples, total_words=total_words,
            epochs=epochs, start_alpha=start_alpha, end_alpha=end_alpha, word_count=word_count,
            queue_factor=queue_factor, report_delay=report_delay, callbacks=callbacks)
        self.wv.adjust_vectors()

@menshikh-iv
Copy link
Contributor

I think that I have fixed my problem, it looks like gensim team is working on solving it but its not released yet in the pip version?!

yes, exactly, big thanks @mpenkov that helps us much with fasttext-related issues in #2313

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Jan 17, 2019

I guess I can close this issue as fixed by #2313

CC: @mpenkov

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills fasttext Issues related to the FastText model
Projects
None yet
Development

No branches or pull requests

5 participants