Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading fastText models using only bin file #1341

Merged
merged 34 commits into from
Jun 28, 2017

Conversation

prakhar2b
Copy link
Contributor

solve #1236. Mismatch in bin and vec files while loading wiki.fr pretrained model is resolved in this PR

@prakhar2b
Copy link
Contributor Author

@tmylk error is about whitespace in the vec file that I have uploaded.
@jayantj As of now, this code reads additional word (vocab_size is one more than nwords), but skips loading any information into dictionary as it seems undesirable. Should we consider loading those information?

@jayantj
Copy link
Contributor

jayantj commented May 25, 2017

Thanks for the PR @prakhar2b
As discussed, I believe the ideal fix for this would be loading FastText models from the .bin file only. It wouldn't require us to handle special cases in code, as is the case with the current fix.

Related to issue #1261

@prakhar2b
Copy link
Contributor Author

prakhar2b commented May 25, 2017

@jayantj yes, it was supposed to be only a workaround for now. Though, as for issue 1261, please look into this comment also

@prakhar2b prakhar2b changed the title Loading french pretrained vector issue resolved Loading fastText models using only bin file Jun 1, 2017
from gensim.models.word2vec import Word2Vec

from six import string_types

from numpy import exp, log, dot, zeros, outer, random, dtype, float32 as REAL,\
double, uint32, seterr, array, uint8, vstack, fromstring, sqrt, newaxis,\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need all these imports?

@@ -224,7 +228,7 @@ def load_word2vec_format(cls, *args, **kwargs):
return FastTextKeyedVectors.load_word2vec_format(*args, **kwargs)

@classmethod
def load_fasttext_format(cls, model_file, encoding='utf8'):
def load_fasttext_format(cls, model_file, bin_only = False, encoding='utf8'):
Copy link
Contributor

@jayantj jayantj Jun 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add bin_only to the docstring

if not bin_only:
assert len(self.wv.vocab) == nwords, 'mismatch between vocab sizes'
if len(self.wv.vocab) != vocab_size:
logger.warnings("If you are loading any model other than pretrained vector wiki.fr, ")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logger.warning?

self.wv.syn0 = zeros((vocab_size, self.vector_size), dtype=REAL)
logger.info("here?")
# TO-DO : how to update this
ntokens= self.struct_unpack(file_handle, '@1q')
Copy link
Contributor

@jayantj jayantj Jun 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please take care of code style: space before =

logger.warnings("If you are loading any model other than pretrained vector wiki.fr, ")
logger.warnings("Please report to gensim or fastText.")
else:
self.wv.syn0 = zeros((vocab_size, self.vector_size), dtype=REAL)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't belong here, we need to be loading the vectors for the in-vocab words after loading the ngram weight matrix (syn0_all)

assert len(self.wv.vocab) == vocab_size, 'mismatch between vocab sizes'
self.struct_unpack(file_handle, '@1q') # number of tokens
if not bin_only:
assert len(self.wv.vocab) == nwords, 'mismatch between vocab sizes'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also log vocab_size/nwords/len(self.wv.vocab) in case of mismatches.

@@ -298,8 +312,24 @@ def load_dict(self, file_handle, encoding='utf8'):
char_byte = file_handle.read(1)
word = word_bytes.decode(encoding)
count, _ = self.struct_unpack(file_handle, '@qb')
assert self.wv.vocab[word].index == i, 'mismatch between gensim word index and fastText word index'
self.wv.vocab[word].count = count
if bin_only and word not in self.wv.vocab:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems unnecessarily convoluted - can the word be in self.wv.vocab if we're loading from bin file only?

if bin_only:
if self.wv.syn0.shape[0] != len(self.wv.vocab):
logger.info(
"duplicate words detected, shrinking matrix size from %i to %i",
Copy link
Contributor

@jayantj jayantj Jun 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand the need for this. Can duplicate words exist?

""" Test model succesfully loaded from fastText (new format) .bin files only """
new_model = fasttext.FastText.load_fasttext_format(self.test_new_model_file, bin_only = True)
vocab_size, model_size = 1763, 10
self.assertEqual(self.test_new_model.wv.syn0.shape, (vocab_size, model_size))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we be testing new_model and not self.test_new_model?

word_vec = np.zeros(self.wv.syn0.shape[1])

num_word_ngram_vectors = len(word_ngrams)
for word_ngram in word_ngrams:
Copy link
Contributor

@jayantj jayantj Jun 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems redundant and possibly error prone.
Instead you could simply use self.word_vec(word) to obtain the syn0 weights for the in-vocab words, after the syn0_all trimming takes place. Could you please try that, and see if it initializes the correct values for self.syn0?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so from looking at the FastText code, it seems this is more complicated than what we originally thought.
The final in-vocab word vectors (aka the syn0 vectors, the vectors in the .vec file) are obtained by summing up -

  1. all vectors for its component ngrams
  2. the vector for the original word itself

The large weight matrix in the .bin file contains -

  1. vectors for the original words in the first len(vocab) rows
  2. in the remaining rows, weights for component ngrams

This will require a change in logic then.

Possibly useful reference - https://github.com/facebookresearch/fastText/blob/master/src/dictionary.cc#L71

@@ -284,12 +285,12 @@ def load_model_params(self, file_handle):
def load_dict(self, file_handle, encoding='utf8'):
vocab_size, nwords, _ = self.struct_unpack(file_handle, '@3i')
# Vocab stored by [Dictionary::save](https://github.com/facebookresearch/fastText/blob/master/src/dictionary.cc)
assert len(self.wv.vocab) == nwords, 'mismatch between vocab sizes'
assert len(self.wv.vocab) == vocab_size, 'mismatch between vocab sizes'
logger.info("loading projection weights")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This message doesn't seem right, we're loading the vocabulary here, not the weights.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I tried to replicate the logging that we were getting earlier.
https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/keyedvectors.py#L204

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but here we aren't loading exactly the same thing as in the case linked above - in that case, we're loading the vocab words along with their vectors, here we're simply building a dictionary.

assert self.wv.vocab[word].index == i, 'mismatch between gensim word index and fastText word index'
self.wv.vocab[word].count = count

if word != "__label__":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a note briefly clarifying this, since it's a hack-ish fix and likely to confuse any future developers.

self.wv.vocab[word].count = count

if word != "__label__":
self.wv.vocab[word] = Vocab(index=i, count=count)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct? The word "__label__" doesn't have a corresponding vector in the weight matrix, if I understand our previous discussion correctly. As a result, any of the words after the "__label__" word will have indices shifted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, you were right about it, this is the last word, but we can't skip reading it otherwise there will be error in further bytes reading.

Copy link
Contributor

@jayantj jayantj Jun 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I understand we have to ready the bytes, agreed.
In that case, I'd prefer something more explicit and clear, like -

if i == nwords and i < vocab_size:
    assert word == "__label__"
    continue   # don't add word to vocab
   

self.wv.syn0[self.wv.vocab[w].index] += np.array(ngram_weights[self.wv.ngrams[word_ngram]])

self.wv.syn0[self.wv.vocab[w].index] /= (len(word_ngrams) + 1)
logger.info("loaded %s matrix from %s" % (self.wv.syn0.shape))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This message seems inconsistent with self.wv.syn0.shape

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, but you're passing in a tuple, so the message is going to look something like loaded 30000 matrix from 100. Please update the message.

Copy link
Owner

@piskvorky piskvorky Jun 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't use % to format the string at all; instead, pass the arguments as function arguments.

for w, v in self.wv.vocab.items():
all_ngrams += self.compute_ngrams(w, self.wv.min_n, self.wv.max_n)
self.wv.syn0[self.wv.vocab[w].index] += np.array(self.wv.syn0_all[self.wv.vocab[w].index])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

v instead of self.wv.vocab[w]? Please rename the variable to something more verbose when making this change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, vocab should be fine ?

@prakhar2b
Copy link
Contributor Author

@jayantj
Copy link
Contributor

jayantj commented Jun 18, 2017

Good point, let's remove this and any other unused code.
Please also remove the .vec files from the tests, as we only need the .bin files now.


assert len(self.wv.vocab) == nwords, 'mismatch between vocab sizes'
if len(self.wv.vocab) != vocab_size:
logger.warning("If you are loading any model other than pretrained vector wiki.fr, ")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason for two separate warning statements? Why not a single one?

@@ -363,13 +377,15 @@ def init_ngrams(self):

ngram_weights = self.wv.syn0_all

for w, v in self.wv.vocab.items():
logger.info("loading vocabulary weights")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try to make the INFO messages more informative: loading how many weights, loading from where etc.

When there's a problem, it really helps when the log contains concrete numbers and messages. The problem often becomes apparent even without debugging.

@@ -220,6 +220,10 @@ def save(self, *args, **kwargs):
super(FastText, self).save(*args, **kwargs)

@classmethod
def load_word2vec_format(cls, *args, **kwargs):
return FastTextKeyedVectors.load_word2vec_format(*args, **kwargs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that a load using this method only learns the full-word vectors as in the .vec file. If so, isn't it true that the resulting object doesn't have any other capabilities beyond a plain KeyedVectors? In that case, using a specialized class like FastTextKeyedVectors – that maybe is trying to do more, such as ngram-tracking, but inherently is not because that info was lost in the sequence-of-steps used to load it – seems potentially misleading. So unless I'm misunderstanding, I think this load-technique should use a plain KeyedVectors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this method is not used now for loading using bin only. I removed this unused code, but got a strange flake8 error for python 3+, therefore re-added this for this PR. I'll try removing these unused codes later maybe in a different PR. @gojomo

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is an odd error! I suspect it's not really the presence/absence of that method that triggered it, but something else either random or hidden in the whitespace.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gojomo ok, test passed this time after removing this code 😄

Copy link
Contributor

@jayantj jayantj Jun 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference, this was a bug in the flake8 script, fixed in cebb3fc


"""
model = cls()
model.wv = cls.load_word2vec_format('%s.vec' % model_file, encoding=encoding)
model.file_name = model_file
model.load_binary_data('%s.bin' % model_file, encoding=encoding)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would be a good idea to allow taking the whole model filename (including the .bin) as valid input - with the latest changes, we're loading from the bin file only, so it makes intuitive sense for the entire filename to be valid input.
The only reason IMO we're still allowing the filename without extension as valid input is backward compatibility.

@@ -284,12 +285,12 @@ def load_model_params(self, file_handle):
def load_dict(self, file_handle, encoding='utf8'):
vocab_size, nwords, _ = self.struct_unpack(file_handle, '@3i')
# Vocab stored by [Dictionary::save](https://github.com/facebookresearch/fastText/blob/master/src/dictionary.cc)
assert len(self.wv.vocab) == nwords, 'mismatch between vocab sizes'
assert len(self.wv.vocab) == vocab_size, 'mismatch between vocab sizes'
logger.info("loading vocabulary words for fastText model from %s.bin", self.file_name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding the number of words would be helpful in this logging statement.


if i == nwords and i < vocab_size:
"""
To handle the error in pretrained vector wiki.fr (French).
Copy link
Contributor

@jayantj jayantj Jun 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This format -

"""
Some comments
"""

is generally reserved for docstrings. Regular comments would be preferable.

if len(self.wv.vocab) != vocab_size:
logger.warning("mismatch between vocab sizes")
logger.warning("If you are loading any model other than pretrained vector wiki.fr, ")
logger.warning("Please report to Gensim.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change the multiple warning statements to a single concatenated statement.

Copy link
Owner

@piskvorky piskvorky Jun 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, "Please report to Gensim" is vague, and probably unnecessary.

If people encounter bugs or get exceptions, they'll let us know, don't worry about that.

I'd prefer if the logging message was more concrete instead: what mismatch, what are the mismatched "vocab sizes"?
We want the logs as concrete and useful as possible, for our own sanity (remote debugging via mailing list etc).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jayantj I remember a few weeks ago, I received a review comment from @piskvorky that concatenated statement would contain white space therefore better to split into multiple statements. Correct me if I didn't understand that comment

Copy link
Contributor

@jayantj jayantj Jun 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe @piskvorky meant to split the string itself across multiple lines, like this -

logger.warning(
	"mismatch between vocab sizes "
	"If you are loading any model other than pretrained vector wiki.fr, "
	"Please report to Gensim.")

He left a comment later in the PR clarifying it too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ohh, thanks for clarifying 😄

@@ -348,6 +370,18 @@ def init_ngrams(self):
self.wv.ngrams[ngram] = i
self.wv.syn0_all = self.wv.syn0_all.take(ngram_indices, axis=0)

ngram_weights = self.wv.syn0_all

logger.info("loading weights for %s vocabulary words for fastText models from %s.bin", len(self.wv.vocab), self.file_name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: fastText model (not models)

@@ -115,7 +114,7 @@ def testNormalizedVectorsNotSaved(self):
self.assertTrue(loaded_kv.syn0_all_norm is None)

def testLoadFastTextFormat(self):
"""Test model successfully loaded from fastText .vec and .bin files"""
"""Test model successfully loaded from fastText .bin files"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo here and below: .bin file (not files)

@jayantj
Copy link
Contributor

jayantj commented Jun 28, 2017

Pushed some changes to model filename handling and logging.
Also, a minor fix to the flake8 diff script that was causing the weird utf8 error in python3+ discussed above.
IMO, this PR can be merged now.

@menshikh-iv menshikh-iv merged commit e1d89c2 into piskvorky:develop Jun 28, 2017
@prakhar2b prakhar2b mentioned this pull request Jul 3, 2017
saparina pushed a commit to saparina/gensim that referenced this pull request Jul 9, 2017
…y#1341)

* french wiki issue resolved

* bin and vec mismatch handled

* added test from bin only loading

* [WIP] loading bin only

* word vec from its ngrams

* [WIP] word vec from ngrams

* [WIP] getting syn0 from all n-grams

* [TDD] test comparing word vector from bin_only and default loading

* cleaned up test code

* added docstring for bin_only

* resolved wiki.fr issue

* pep8 fixes

* default bin file loading only

* logging info modified plus changes a/c review

* removed unused code in fasttext.py

* removed unused codes and vec files from test

* added lee_fasttext vec files again

* re-added removed files and unused codes

* added file name in logging info

* removing unused load_word2vec_format code

* updated logging info and comments

* input file name with or without .bin both accepted

* resolved typo mistake

* test for file name

* minor change to input filename handling in ft wrapper

* changes to logging and assert messages, pep8 fixes

* removes redundant .vec files

* fixes utf8 bug in flake8_diff.sh script
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants