-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loading fastText models using only bin file #1341
Conversation
Thanks for the PR @prakhar2b Related to issue #1261 |
@jayantj yes, it was supposed to be only a workaround for now. Though, as for issue 1261, please look into this comment also |
gensim/models/wrappers/fasttext.py
Outdated
from gensim.models.word2vec import Word2Vec | ||
|
||
from six import string_types | ||
|
||
from numpy import exp, log, dot, zeros, outer, random, dtype, float32 as REAL,\ | ||
double, uint32, seterr, array, uint8, vstack, fromstring, sqrt, newaxis,\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need all these imports?
gensim/models/wrappers/fasttext.py
Outdated
@@ -224,7 +228,7 @@ def load_word2vec_format(cls, *args, **kwargs): | |||
return FastTextKeyedVectors.load_word2vec_format(*args, **kwargs) | |||
|
|||
@classmethod | |||
def load_fasttext_format(cls, model_file, encoding='utf8'): | |||
def load_fasttext_format(cls, model_file, bin_only = False, encoding='utf8'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add bin_only
to the docstring
gensim/models/wrappers/fasttext.py
Outdated
if not bin_only: | ||
assert len(self.wv.vocab) == nwords, 'mismatch between vocab sizes' | ||
if len(self.wv.vocab) != vocab_size: | ||
logger.warnings("If you are loading any model other than pretrained vector wiki.fr, ") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logger.warning
?
gensim/models/wrappers/fasttext.py
Outdated
self.wv.syn0 = zeros((vocab_size, self.vector_size), dtype=REAL) | ||
logger.info("here?") | ||
# TO-DO : how to update this | ||
ntokens= self.struct_unpack(file_handle, '@1q') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please take care of code style: space before =
gensim/models/wrappers/fasttext.py
Outdated
logger.warnings("If you are loading any model other than pretrained vector wiki.fr, ") | ||
logger.warnings("Please report to gensim or fastText.") | ||
else: | ||
self.wv.syn0 = zeros((vocab_size, self.vector_size), dtype=REAL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't belong here, we need to be loading the vectors for the in-vocab words after loading the ngram weight matrix (syn0_all
)
gensim/models/wrappers/fasttext.py
Outdated
assert len(self.wv.vocab) == vocab_size, 'mismatch between vocab sizes' | ||
self.struct_unpack(file_handle, '@1q') # number of tokens | ||
if not bin_only: | ||
assert len(self.wv.vocab) == nwords, 'mismatch between vocab sizes' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also log vocab_size
/nwords
/len(self.wv.vocab)
in case of mismatches.
gensim/models/wrappers/fasttext.py
Outdated
@@ -298,8 +312,24 @@ def load_dict(self, file_handle, encoding='utf8'): | |||
char_byte = file_handle.read(1) | |||
word = word_bytes.decode(encoding) | |||
count, _ = self.struct_unpack(file_handle, '@qb') | |||
assert self.wv.vocab[word].index == i, 'mismatch between gensim word index and fastText word index' | |||
self.wv.vocab[word].count = count | |||
if bin_only and word not in self.wv.vocab: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems unnecessarily convoluted - can the word be in self.wv.vocab
if we're loading from bin file only?
gensim/models/wrappers/fasttext.py
Outdated
if bin_only: | ||
if self.wv.syn0.shape[0] != len(self.wv.vocab): | ||
logger.info( | ||
"duplicate words detected, shrinking matrix size from %i to %i", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand the need for this. Can duplicate words exist?
gensim/test/test_fasttext_wrapper.py
Outdated
""" Test model succesfully loaded from fastText (new format) .bin files only """ | ||
new_model = fasttext.FastText.load_fasttext_format(self.test_new_model_file, bin_only = True) | ||
vocab_size, model_size = 1763, 10 | ||
self.assertEqual(self.test_new_model.wv.syn0.shape, (vocab_size, model_size)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we be testing new_model
and not self.test_new_model
?
gensim/models/wrappers/fasttext.py
Outdated
word_vec = np.zeros(self.wv.syn0.shape[1]) | ||
|
||
num_word_ngram_vectors = len(word_ngrams) | ||
for word_ngram in word_ngrams: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems redundant and possibly error prone.
Instead you could simply use self.word_vec(word)
to obtain the syn0 weights for the in-vocab words, after the syn0_all
trimming takes place. Could you please try that, and see if it initializes the correct values for self.syn0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, so from looking at the FastText code, it seems this is more complicated than what we originally thought.
The final in-vocab word vectors (aka the syn0 vectors, the vectors in the .vec
file) are obtained by summing up -
- all vectors for its component ngrams
- the vector for the original word itself
The large weight matrix in the .bin
file contains -
- vectors for the original words in the first
len(vocab)
rows - in the remaining rows, weights for component ngrams
This will require a change in logic then.
Possibly useful reference - https://github.com/facebookresearch/fastText/blob/master/src/dictionary.cc#L71
gensim/models/wrappers/fasttext.py
Outdated
@@ -284,12 +285,12 @@ def load_model_params(self, file_handle): | |||
def load_dict(self, file_handle, encoding='utf8'): | |||
vocab_size, nwords, _ = self.struct_unpack(file_handle, '@3i') | |||
# Vocab stored by [Dictionary::save](https://github.com/facebookresearch/fastText/blob/master/src/dictionary.cc) | |||
assert len(self.wv.vocab) == nwords, 'mismatch between vocab sizes' | |||
assert len(self.wv.vocab) == vocab_size, 'mismatch between vocab sizes' | |||
logger.info("loading projection weights") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This message doesn't seem right, we're loading the vocabulary here, not the weights.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I tried to replicate the logging that we were getting earlier.
https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/keyedvectors.py#L204
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but here we aren't loading exactly the same thing as in the case linked above - in that case, we're loading the vocab words along with their vectors, here we're simply building a dictionary.
gensim/models/wrappers/fasttext.py
Outdated
assert self.wv.vocab[word].index == i, 'mismatch between gensim word index and fastText word index' | ||
self.wv.vocab[word].count = count | ||
|
||
if word != "__label__": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a note briefly clarifying this, since it's a hack-ish fix and likely to confuse any future developers.
gensim/models/wrappers/fasttext.py
Outdated
self.wv.vocab[word].count = count | ||
|
||
if word != "__label__": | ||
self.wv.vocab[word] = Vocab(index=i, count=count) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this correct? The word "__label__"
doesn't have a corresponding vector in the weight matrix, if I understand our previous discussion correctly. As a result, any of the words after the "__label__"
word will have indices shifted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, you were right about it, this is the last word, but we can't skip reading it otherwise there will be error in further bytes reading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I understand we have to ready the bytes, agreed.
In that case, I'd prefer something more explicit and clear, like -
if i == nwords and i < vocab_size:
assert word == "__label__"
continue # don't add word to vocab
gensim/models/wrappers/fasttext.py
Outdated
self.wv.syn0[self.wv.vocab[w].index] += np.array(ngram_weights[self.wv.ngrams[word_ngram]]) | ||
|
||
self.wv.syn0[self.wv.vocab[w].index] /= (len(word_ngrams) + 1) | ||
logger.info("loaded %s matrix from %s" % (self.wv.syn0.shape)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This message seems inconsistent with self.wv.syn0.shape
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, tried to replicate older logging
https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/keyedvectors.py#L266
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, but you're passing in a tuple, so the message is going to look something like loaded 30000 matrix from 100
. Please update the message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't use %
to format the string at all; instead, pass the arguments as function arguments.
gensim/models/wrappers/fasttext.py
Outdated
for w, v in self.wv.vocab.items(): | ||
all_ngrams += self.compute_ngrams(w, self.wv.min_n, self.wv.max_n) | ||
self.wv.syn0[self.wv.vocab[w].index] += np.array(self.wv.syn0_all[self.wv.vocab[w].index]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
v
instead of self.wv.vocab[w]
? Please rename the variable to something more verbose when making this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, vocab
should be fine ?
@jayantj Also, we don't need this anymore |
Good point, let's remove this and any other unused code. |
gensim/models/wrappers/fasttext.py
Outdated
|
||
assert len(self.wv.vocab) == nwords, 'mismatch between vocab sizes' | ||
if len(self.wv.vocab) != vocab_size: | ||
logger.warning("If you are loading any model other than pretrained vector wiki.fr, ") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any particular reason for two separate warning statements? Why not a single one?
gensim/models/wrappers/fasttext.py
Outdated
@@ -363,13 +377,15 @@ def init_ngrams(self): | |||
|
|||
ngram_weights = self.wv.syn0_all | |||
|
|||
for w, v in self.wv.vocab.items(): | |||
logger.info("loading vocabulary weights") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try to make the INFO messages more informative: loading how many weights, loading from where etc.
When there's a problem, it really helps when the log contains concrete numbers and messages. The problem often becomes apparent even without debugging.
gensim/models/wrappers/fasttext.py
Outdated
@@ -220,6 +220,10 @@ def save(self, *args, **kwargs): | |||
super(FastText, self).save(*args, **kwargs) | |||
|
|||
@classmethod | |||
def load_word2vec_format(cls, *args, **kwargs): | |||
return FastTextKeyedVectors.load_word2vec_format(*args, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that a load using this method only learns the full-word vectors as in the .vec
file. If so, isn't it true that the resulting object doesn't have any other capabilities beyond a plain KeyedVectors
? In that case, using a specialized class like FastTextKeyedVectors
– that maybe is trying to do more, such as ngram-tracking, but inherently is not because that info was lost in the sequence-of-steps used to load it – seems potentially misleading. So unless I'm misunderstanding, I think this load-technique should use a plain KeyedVectors
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this method is not used now for loading using bin only. I removed this unused code, but got a strange flake8 error for python 3+, therefore re-added this for this PR. I'll try removing these unused codes later maybe in a different PR. @gojomo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is an odd error! I suspect it's not really the presence/absence of that method that triggered it, but something else either random or hidden in the whitespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gojomo ok, test passed this time after removing this code 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reference, this was a bug in the flake8 script, fixed in cebb3fc
gensim/models/wrappers/fasttext.py
Outdated
|
||
""" | ||
model = cls() | ||
model.wv = cls.load_word2vec_format('%s.vec' % model_file, encoding=encoding) | ||
model.file_name = model_file | ||
model.load_binary_data('%s.bin' % model_file, encoding=encoding) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it would be a good idea to allow taking the whole model filename (including the .bin) as valid input - with the latest changes, we're loading from the bin file only, so it makes intuitive sense for the entire filename to be valid input.
The only reason IMO we're still allowing the filename without extension as valid input is backward compatibility.
gensim/models/wrappers/fasttext.py
Outdated
@@ -284,12 +285,12 @@ def load_model_params(self, file_handle): | |||
def load_dict(self, file_handle, encoding='utf8'): | |||
vocab_size, nwords, _ = self.struct_unpack(file_handle, '@3i') | |||
# Vocab stored by [Dictionary::save](https://github.com/facebookresearch/fastText/blob/master/src/dictionary.cc) | |||
assert len(self.wv.vocab) == nwords, 'mismatch between vocab sizes' | |||
assert len(self.wv.vocab) == vocab_size, 'mismatch between vocab sizes' | |||
logger.info("loading vocabulary words for fastText model from %s.bin", self.file_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding the number of words would be helpful in this logging statement.
gensim/models/wrappers/fasttext.py
Outdated
|
||
if i == nwords and i < vocab_size: | ||
""" | ||
To handle the error in pretrained vector wiki.fr (French). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This format -
"""
Some comments
"""
is generally reserved for docstrings. Regular comments would be preferable.
gensim/models/wrappers/fasttext.py
Outdated
if len(self.wv.vocab) != vocab_size: | ||
logger.warning("mismatch between vocab sizes") | ||
logger.warning("If you are loading any model other than pretrained vector wiki.fr, ") | ||
logger.warning("Please report to Gensim.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please change the multiple warning statements to a single concatenated statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, "Please report to Gensim" is vague, and probably unnecessary.
If people encounter bugs or get exceptions, they'll let us know, don't worry about that.
I'd prefer if the logging message was more concrete instead: what mismatch, what are the mismatched "vocab sizes"?
We want the logs as concrete and useful as possible, for our own sanity (remote debugging via mailing list etc).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jayantj I remember a few weeks ago, I received a review comment from @piskvorky that concatenated statement would contain white space therefore better to split into multiple statements. Correct me if I didn't understand that comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe @piskvorky meant to split the string itself across multiple lines, like this -
logger.warning(
"mismatch between vocab sizes "
"If you are loading any model other than pretrained vector wiki.fr, "
"Please report to Gensim.")
He left a comment later in the PR clarifying it too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ohh, thanks for clarifying 😄
gensim/models/wrappers/fasttext.py
Outdated
@@ -348,6 +370,18 @@ def init_ngrams(self): | |||
self.wv.ngrams[ngram] = i | |||
self.wv.syn0_all = self.wv.syn0_all.take(ngram_indices, axis=0) | |||
|
|||
ngram_weights = self.wv.syn0_all | |||
|
|||
logger.info("loading weights for %s vocabulary words for fastText models from %s.bin", len(self.wv.vocab), self.file_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: fastText model (not models)
gensim/test/test_fasttext_wrapper.py
Outdated
@@ -115,7 +114,7 @@ def testNormalizedVectorsNotSaved(self): | |||
self.assertTrue(loaded_kv.syn0_all_norm is None) | |||
|
|||
def testLoadFastTextFormat(self): | |||
"""Test model successfully loaded from fastText .vec and .bin files""" | |||
"""Test model successfully loaded from fastText .bin files""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo here and below: .bin file (not files)
Pushed some changes to model filename handling and logging. |
…y#1341) * french wiki issue resolved * bin and vec mismatch handled * added test from bin only loading * [WIP] loading bin only * word vec from its ngrams * [WIP] word vec from ngrams * [WIP] getting syn0 from all n-grams * [TDD] test comparing word vector from bin_only and default loading * cleaned up test code * added docstring for bin_only * resolved wiki.fr issue * pep8 fixes * default bin file loading only * logging info modified plus changes a/c review * removed unused code in fasttext.py * removed unused codes and vec files from test * added lee_fasttext vec files again * re-added removed files and unused codes * added file name in logging info * removing unused load_word2vec_format code * updated logging info and comments * input file name with or without .bin both accepted * resolved typo mistake * test for file name * minor change to input filename handling in ft wrapper * changes to logging and assert messages, pep8 fixes * removes redundant .vec files * fixes utf8 bug in flake8_diff.sh script
solve #1236. Mismatch in
bin
andvec
files while loading wiki.fr pretrained model is resolved in this PR