Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix critical issues in FastText #2313

Merged
merged 136 commits into from
Jan 11, 2019
Merged
Show file tree
Hide file tree
Changes from 89 commits
Commits
Show all changes
136 commits
Select commit Hold shift + click to select a range
94a20e9
WIP
mpenkov Dec 16, 2018
fb2b5b0
Handle incompatible float size condition
mpenkov Dec 16, 2018
1a41182
update docstring
mpenkov Dec 16, 2018
cd0b318
move regression test to unit tests
mpenkov Dec 16, 2018
3b31288
WIP
mpenkov Dec 23, 2018
42626a2
introduced Tracker class
mpenkov Dec 23, 2018
12cc3e2
added log examples
mpenkov Dec 23, 2018
64f7f39
initialize trainables weights when loading native model
mpenkov Dec 28, 2018
00b472b
adding script to trigger bug
mpenkov Dec 28, 2018
abfd573
minor documentation changes
mpenkov Dec 29, 2018
4e46062
improve unit test
mpenkov Dec 29, 2018
d3544c7
retrained toy model
mpenkov Dec 29, 2018
b98bc0b
update bucket parameter in unit test
mpenkov Dec 29, 2018
30be5bd
update unit test
mpenkov Dec 29, 2018
ab1eaf6
WIP
mpenkov Dec 29, 2018
e59d1db
retrain model with a smaller dimensionality
mpenkov Dec 30, 2018
392201b
git add docs/fasttext-notes.md
mpenkov Dec 31, 2018
f25607f
adding some comments and fixmes
mpenkov Dec 31, 2018
ef394ed
minor refactoring, update tests
mpenkov Dec 31, 2018
fe10ca7
update notes
mpenkov Dec 31, 2018
4c2223c
update notes
mpenkov Jan 2, 2019
28bf757
initialize wv.vectors_vocab
mpenkov Jan 2, 2019
8e0d04f
init vectors_vocab properly
mpenkov Jan 2, 2019
795fed0
add test_sanity_vectors
mpenkov Jan 2, 2019
671b3c0
no longer segfaulting
mpenkov Jan 2, 2019
9adb532
adding tests for in-vocab out-of-vocab words
mpenkov Jan 2, 2019
6de08de
removing old test
mpenkov Jan 2, 2019
81dd478
fix typo in test, reduce tolerance
mpenkov Jan 2, 2019
5c500f0
update test_continuation, it now fails
mpenkov Jan 2, 2019
cb045de
test continued training with gensim model
mpenkov Jan 2, 2019
a916266
compare vectors_ngrams before and after
mpenkov Jan 2, 2019
cee6311
disable test reruns for now
mpenkov Jan 2, 2019
752cf9b
set min_count=0
mpenkov Jan 2, 2019
ad3342a
initialize wv.buckets_word prior to continuing training
mpenkov Jan 2, 2019
64caa3c
making all tests pass
mpenkov Jan 2, 2019
2c9f2b4
add bucket param to FastTextKeyedVectors constructor
mpenkov Jan 2, 2019
80c8092
minor refactoring: split out _load_vocab function
mpenkov Jan 2, 2019
0d30cae
minor refactoring: split out _load_trainables method
mpenkov Jan 2, 2019
bf1c8b8
removing Tracker class: it was for debugging only
mpenkov Jan 3, 2019
ec92983
remove debugging print statement
mpenkov Jan 3, 2019
1c58119
docstring fixes
mpenkov Jan 3, 2019
91b3599
remove FIXME, leave this function alone
mpenkov Jan 3, 2019
5100335
add newlines at the end of docstrings
mpenkov Jan 3, 2019
aae713d
remove comment
mpenkov Jan 3, 2019
e5ec723
re-enable test reruns in tox.ini
mpenkov Jan 3, 2019
87f655a
remove print statements from tests
mpenkov Jan 3, 2019
76aca9a
git rm trigger.py
mpenkov Jan 3, 2019
8027459
refactor FB model loading code
mpenkov Jan 4, 2019
07f34e2
fix bug with missing ngrams (still need cleanup of hash2index & testing)
menshikh-iv Jan 4, 2019
118cd7f
fix cython implementation of _ft_hash (based on #2233)
menshikh-iv Jan 4, 2019
ef58c7c
decrease tolerances in unit tests
mpenkov Jan 5, 2019
799596d
add test case for native models and hashes
mpenkov Jan 5, 2019
6cf3d1f
add working/broken hash implementations for py/cy and tests
mpenkov Jan 5, 2019
b58a50b
minor fixup around hashes
mpenkov Jan 5, 2019
97baf3c
add oov test
mpenkov Jan 5, 2019
ef90436
adding hash compatibility tests for FastText model
mpenkov Jan 5, 2019
8956530
git rm gensim.xml native.xml
mpenkov Jan 6, 2019
2e10ece
minor fix in comment
mpenkov Jan 6, 2019
cb25448
refactoring: extract _pad_random and _pad ones functions
mpenkov Jan 7, 2019
901eaeb
refactoring: move matrix init to FastTextKeyedVectors
mpenkov Jan 7, 2019
f0bd22d
refactoring: move init_ngrams_post_load method to FastTextKeyedVectors
mpenkov Jan 7, 2019
fa34d84
refactoring: move trainables.get_vocab_word_vecs to wv.calculate_vectors
mpenkov Jan 7, 2019
f9c1547
refactoring: simplify reset_ngrams_weights method
mpenkov Jan 7, 2019
de7d9ef
deprecate struct_unpack public method
mpenkov Jan 7, 2019
2946896
refactoring: improve separation of concerns between model and vectors
mpenkov Jan 7, 2019
a7c14d0
refactoring: improve separation of concerns between model and vectors
mpenkov Jan 7, 2019
5598e19
refactoring: get rid of init_ngrams_weights method
mpenkov Jan 7, 2019
07c84f5
refactoring: remove unused vectors_vocab_norm attribute
mpenkov Jan 7, 2019
f15094d
review response: update ft_hash_broken comment
mpenkov Jan 7, 2019
5e25a4f
review response: revert changes to broken hash function
mpenkov Jan 7, 2019
b789971
Merge remote-tracking branch 'upstream/develop' into attrs
mpenkov Jan 7, 2019
1ed35ea
review response: handle .bucket backwards compatibility
mpenkov Jan 7, 2019
0f62660
review response: adjust warning text
mpenkov Jan 7, 2019
c461193
tox -e flake8
mpenkov Jan 7, 2019
eeafdec
tox -e flake8-docs
mpenkov Jan 7, 2019
3e0e656
review response: store .compatible_hash in vectors only
mpenkov Jan 7, 2019
262599d
Revert "refactoring: remove unused vectors_vocab_norm attribute"
mpenkov Jan 7, 2019
7d4e60e
review response: remove critical log comments
mpenkov Jan 7, 2019
6cc80de
Merge remote-tracking branch 'upstream/develop' into attrs
mpenkov Jan 7, 2019
069912f
review response: fix docstring in fasttext_bin.py
mpenkov Jan 7, 2019
cc19393
review response: make fasttext_bin an internal module
mpenkov Jan 7, 2019
1661c16
review response: skip cython tests if cython is disabled
mpenkov Jan 7, 2019
72b1d81
review response: use np.allclose instead of array_equals
mpenkov Jan 8, 2019
e467060
refactoring: simplify ngrams_weights matrix init
mpenkov Jan 8, 2019
c2740cd
fixup: remove unused vectors_lockf attribute
mpenkov Jan 8, 2019
daa425a
fixup in _load_fasttext_model function
mpenkov Jan 8, 2019
64844f3
minor refactoring in unit tests
mpenkov Jan 8, 2019
39e85f1
adjust unit test
mpenkov Jan 8, 2019
60d0477
temporarily disabling some assertions in tests
mpenkov Jan 8, 2019
52e2fbe
document vectors_vocab_lockf and vectors_ngrams_lockf
mpenkov Jan 8, 2019
d08500b
refactoring: further simplify growth of _lockf matrices
mpenkov Jan 8, 2019
3a2f93e
remove outdated comments
mpenkov Jan 8, 2019
3159a18
fix deprecation warnings
mpenkov Jan 8, 2019
f262815
improve documentation for FastTextKeyedVectors
mpenkov Jan 9, 2019
b80c329
refactoring: extract L2 norm functions
mpenkov Jan 9, 2019
127a13e
add LoadFastTextFormatTest
mpenkov Jan 9, 2019
2b96550
refactoring: remove old FB I/O code
mpenkov Jan 9, 2019
25ad1ae
refactoring: FastTextKeyedVectors.init_post_load method
mpenkov Jan 9, 2019
09388ec
refactoring: simplify init_post_load method
mpenkov Jan 9, 2019
6054aa8
refactoring: simplify init_post_load method
mpenkov Jan 9, 2019
b92f435
refactoring: simplify calculate_vectors, rename to adjust_vectors
mpenkov Jan 9, 2019
422e3b1
refactoring: simplify _lockf init
mpenkov Jan 9, 2019
553c8e0
remove old tests
mpenkov Jan 9, 2019
d42e506
tox -e flake8
mpenkov Jan 9, 2019
425e942
fixup: introduce OrderedDict to _fasttext_bin.py
mpenkov Jan 9, 2019
802587a
add unicode prefixes to literals for Py2.7 compatibility
mpenkov Jan 9, 2019
ff82b71
more Py2.7 compatibility stuff
mpenkov Jan 9, 2019
dab47f3
refactoring: extract _process_fasttext_vocab function
mpenkov Jan 9, 2019
914aa95
still more Py2.7 compatibility stuff
mpenkov Jan 9, 2019
65abda9
adding additional assertion
mpenkov Jan 9, 2019
01d84d1
re-enable disabled assertions
mpenkov Jan 9, 2019
611cdb2
delete out of date comment
mpenkov Jan 9, 2019
0c959a9
Revert "re-enable disabled assertions"
mpenkov Jan 9, 2019
c196ace
more work on init_post_load function, update unit tests
mpenkov Jan 9, 2019
fb51a6a
update unit tests
mpenkov Jan 9, 2019
768a941
review response: remove FastTextVocab class, keep alias
mpenkov Jan 10, 2019
f4643bb
review response: simplify _l2_norm_inplace function
mpenkov Jan 10, 2019
d802e91
review response: add docstring
mpenkov Jan 10, 2019
92da774
review response: update docstring
mpenkov Jan 10, 2019
e638628
review response: move import
mpenkov Jan 10, 2019
6e47a88
review response: adjust skip message
mpenkov Jan 10, 2019
2cdad39
Merge remote-tracking branch 'upstream/develop' into attrs
mpenkov Jan 10, 2019
fbaf086
reivew response: add test_hash_native
mpenkov Jan 10, 2019
dc32126
review response: explain how model was generated
mpenkov Jan 10, 2019
39e8844
review response: explain how expected values were generated
mpenkov Jan 10, 2019
08ee7d8
review response: add test for long OOV word
mpenkov Jan 10, 2019
6d8a648
review response: remove unused comments
mpenkov Jan 10, 2019
734a0ac
review response: remove comment
mpenkov Jan 10, 2019
143445e
add test_continuation_load_gensim
mpenkov Jan 10, 2019
250d388
update model using gensim 3.6.0
mpenkov Jan 10, 2019
9fcf35e
review response: get rid of struct_unpack
mpenkov Jan 10, 2019
c1aeb85
review response: implement handling for zero bucket edge case
mpenkov Jan 10, 2019
58c1166
review response: add test_save_load
mpenkov Jan 10, 2019
e5960ed
review response: add test_save_load_native
mpenkov Jan 10, 2019
52230aa
workaround appveyor tempfile issue
mpenkov Jan 11, 2019
14c497d
fix tests
menshikh-iv Jan 11, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
152 changes: 152 additions & 0 deletions docs/fasttext-notes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,152 @@
FastText Notes
menshikh-iv marked this conversation as resolved.
Show resolved Hide resolved
==============

The implementation is split across several submodules:

- models.fasttext
- models.keyedvectors (includes FastText-specific code, not good)
- models.word2vec (superclasses)
- models.base_any2vec (superclasses)

The implementation consists of several key classes:

1. models.fasttext.FastTextVocab: the vocabulary
2. models.keyedvectors.FastTextKeyedVectors: the vectors
3. models.fasttext.FastTextTrainables: the underlying neural network
4. models.fasttext.FastText: ties everything together

FastTextVocab
-------------

Seems to be an entirely redundant class.
Inherits from models.word2vec.Word2VecVocab, adding no new functionality.

FastTextKeyedVectors
--------------------

Inheritance hierarchy:

1. FastTextKeyedVectors
2. WordEmbeddingsKeyedVectors. Implements word similarity e.g. cosine similarity, WMD, etc.
3. BaseKeyedVectors (abstract base class)
4. utils.SaveLoad

There are many attributes.

Inherited from BaseKeyedVectors:

- vectors: a 2D numpy array. Flexible number of rows (0 by default). Number of columns equals vector dimensionality.
- vocab: a dictionary. Keys are words. Items are Vocab instances: these are essentially namedtuples that contain an index and a count. The former is the index of a term in the entire vocab. The latter is the number of times the term occurs.
- vector_size (dimensionality)
- index2entity

Inherited from WordEmbeddingsKeyedVectors:

- vectors_norm
- index2word

Added by FastTextKeyedVectors:

- vectors_vocab: 2D array. Rows are vectors. Columns correspond to vector dimensions. Initialized in FastTextTrainables.init_ngrams_weights. Reset in reset_ngrams_weights. Referred to as syn0_vocab in fasttext_inner.pyx. These are vectors for every word in the vocabulary.
- vectors_vocab_norm: looks unused, see _clear_post_train method.
- vectors_ngrams: 2D array. Each row is a bucket. Columns correspond to vector dimensions. Initialized in init_ngrams_weights function. Initialized in _load_vectors method when reading from native FB binary. Modified in reset_ngrams_weights method. This is the first matrix loaded from the native binary files.
- vectors_ngrams_norm: looks unused, see _clear_post_train method.
- buckets_word: A hashmap. Keyed by the index of a term in the vocab. Each value is an array, where each element is an integer that corresponds to a bucket. Initialized in init_ngrams_weights function
- hash2index: A hashmap. Keys are hashes of ngrams. Values are the number of ngrams (?). Initialized in init_ngrams_weights function.
- min_n: minimum ngram length
- max_n: maximum ngram length
- num_ngram_vectors: initialized in the init_ngrams_weights function

The init_ngrams_method looks like an internal method of FastTextTrainables.
It gets called as part of the prepare_weights method, which is effectively part of the FastModel constructor.

The above attributes are initialized to None in the FastTextKeyedVectors class constructor.
Unfortunately, their real initialization happens in an entirely different module, models.fasttext - another indication of poor separation of concerns.

Some questions:

- What is the x_lockf stuff? Why is it used only by the fast C implementation?
- How are vectors_vocab and vectors_ngrams different?

vectors_vocab contains vectors for entire vocabulary.
vectors_ngrams contains vectors for each _bucket_.


FastTextTrainables
------------------

[Link](https://radimrehurek.com/gensim/models/fasttext.html#gensim.models.fasttext.FastTextTrainables)

This is a neural network that learns the vectors for the FastText embedding.
Mostly inherits from its [Word2Vec parent](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2VecTrainables).
Adds logic for calculating and maintaining ngram weights.

Key attributes:

- hashfxn: function for randomly initializing weights. Defaults to the built-in hash()
- layer1_size: The size of the inner layer of the NN. Equal to the vector dimensionality. Set in the Word2VecTrainables constructor.
- seed: The random generator seed used in reset_weights and update_weights
- syn1: The inner layer of the NN. Each row corresponds to a term in the vocabulary. Columns correspond to weights of the inner layer. There are layer1_size such weights. Set in the reset_weights and update_weights methods, only if hierarchical sampling is used.
- syn1neg: Similar to syn1, but only set if negative sampling is used.
- vectors_lockf: A one-dimensional array with one element for each term in the vocab. Set in reset_weights to an array of ones.
- vectors_vocab_lockf: Similar to vectors_vocab_lockf, ones(len(model.trainables.vectors), dtype=REAL)
- vectors_ngrams_lockf = ones((self.bucket, wv.vector_size), dtype=REAL)

The lockf stuff looks like it gets used by the fast C implementation.

The inheritance hierarchy here is:

1. FastTextTrainables
2. Word2VecTrainables
3. utils.SaveLoad

FastText
--------

Inheritance hierarchy:

1. FastText
2. BaseWordEmbeddingsModel: vocabulary management plus a ton of deprecated attrs
3. BaseAny2VecModel: logging and training functionality
4. utils.SaveLoad: for loading and saving

Lots of attributes (many inherited from superclasses).

From BaseAny2VecModel:

- workers
- vector_size
- epochs
- callbacks
- batch_words
- kv
- vocabulary
- trainables

From BaseWordEmbeddingModel:

- alpha
- min_alpha
- min_alpha_yet_reached
- window
- random
- hs
- negative
- ns_exponent
- cbow_mean
- compute_loss
- running_training_loss
- corpus_count
- corpus_total_words
- neg_labels

FastText attributes:

- wv: FastTextWordVectors. Used instead of .kv

Logging
-------

The logging seems to be inheritance-based.
It may be better to refactor this using aggregation istead of inheritance in the future.
The benefits would be leaner classes with less responsibilities and better separation of concerns.
217 changes: 217 additions & 0 deletions gensim/models/_fasttext_bin.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,217 @@
"""Load models from the native binary format released by Facebook.

Examples
--------

Load a model from a binary file:
.. sourcecode:: pycon

>>> from gensim.test.utils import datapath
>>> from gensim.models.fasttext_bin import load
>>> with open(datapath('crime-and-punishment.bin'), 'rb') as fin:
... model = load(fin)
>>> model.nwords
291
>>> model.vectors_ngrams.shape
(391, 5)
>>> sorted(model.raw_vocab, key=lambda w: len(w), reverse=True)[:5]
['останавливаться', 'изворачиваться,', 'раздражительном', 'exceptionally', 'проскользнуть']

See Also
--------

`FB Implementation <https://github.com/facebookresearch/fastText/blob/master/src/matrix.cc>`_.

"""

import collections
import logging
import struct

import numpy as np

logger = logging.getLogger(__name__)

_FASTTEXT_FILEFORMAT_MAGIC = 793712314

_NEW_HEADER_FORMAT = [
('dim', 'i'),
('ws', 'i'),
('epoch', 'i'),
('min_count', 'i'),
('neg', 'i'),
('_', 'i'),
('loss', 'i'),
('model', 'i'),
('bucket', 'i'),
('minn', 'i'),
('maxn', 'i'),
('_', 'i'),
('t', 'd'),
]

_OLD_HEADER_FORMAT = [
('epoch', 'i'),
('min_count', 'i'),
('neg', 'i'),
('_', 'i'),
('loss', 'i'),
('model', 'i'),
('bucket', 'i'),
('minn', 'i'),
('maxn', 'i'),
('_', 'i'),
('t', 'd'),
]


def _yield_field_names():
for name, _ in _OLD_HEADER_FORMAT + _NEW_HEADER_FORMAT:
if not name.startswith('_'):
yield name
yield 'raw_vocab'
yield 'vocab_size'
yield 'nwords'
yield 'vectors_ngrams'
yield 'hidden_output'


_FIELD_NAMES = sorted(set(_yield_field_names()))
Model = collections.namedtuple('Model', _FIELD_NAMES)


def _struct_unpack(fin, fmt):
num_bytes = struct.calcsize(fmt)
return struct.unpack(fmt, fin.read(num_bytes))


def _load_vocab(fin, new_format, encoding='utf-8'):
"""Load a vocabulary from a FB binary.

Before the vocab is ready for use, call the prepare_vocab function and pass
in the relevant parameters from the model.

Parameters
----------
fin : file
An open file pointer to the binary.
new_format: boolean
True if the binary is of the newer format.
encoding : str
The encoding to use when decoding binary data into words.

Returns
-------
tuple
The loaded vocabulary. Keys are words, values are counts.
The vocabulary size.
The number of words.
"""
vocab_size, nwords, nlabels = _struct_unpack(fin, '@3i')

# Vocab stored by [Dictionary::save](https://github.com/facebookresearch/fastText/blob/master/src/dictionary.cc)
if nlabels > 0:
raise NotImplementedError("Supervised fastText models are not supported")
logger.info("loading %s words for fastText model from %s", vocab_size, fin.name)

_struct_unpack(fin, '@1q') # number of tokens
if new_format:
pruneidx_size, = _struct_unpack(fin, '@q')

raw_vocab = {}
for i in range(vocab_size):
word_bytes = b''
char_byte = fin.read(1)
# Read vocab word
while char_byte != b'\x00':
word_bytes += char_byte
char_byte = fin.read(1)
word = word_bytes.decode(encoding)
count, _ = _struct_unpack(fin, '@qb')
raw_vocab[word] = count

if new_format:
for j in range(pruneidx_size):
_struct_unpack(fin, '@2i')

return raw_vocab, vocab_size, nwords


def _load_matrix(fin, new_format=True):
"""Load a matrix from fastText native format.

Interprets the matrix dimensions and type from the file stream.

Parameters
----------
fin : file
A file handle opened for reading.
new_format : bool, optional
True if the quant_input variable precedes
the matrix declaration. Should be True for newer versions of fastText.

Returns
-------
:class:`numpy.array`
The vectors as an array.
Each vector will be a row in the array.
The number of columns of the array will correspond to the vector size.

"""
if new_format:
_struct_unpack(fin, '@?') # bool quant_input in fasttext.cc

num_vectors, dim = _struct_unpack(fin, '@2q')

float_size = struct.calcsize('@f')
if float_size == 4:
dtype = np.dtype(np.float32)
elif float_size == 8:
dtype = np.dtype(np.float64)
else:
raise ValueError("Incompatible float size: %r" % float_size)

matrix = np.fromfile(fin, dtype=dtype, count=num_vectors * dim)
matrix = matrix.reshape((num_vectors, dim))
return matrix


def load(fin, encoding='utf-8'):
"""Load a model from a binary stream.

Parameters
----------
fin : file
The readable binary stream.
encoding : str, optional
The encoding to use for decoding text

Returns
-------
Model
The loaded model.

"""
if isinstance(fin, str):
fin = open(fin, 'rb')

magic, version = _struct_unpack(fin, '@2i')
new_format = magic == _FASTTEXT_FILEFORMAT_MAGIC

header_spec = _NEW_HEADER_FORMAT if new_format else _OLD_HEADER_FORMAT
model = {name: _struct_unpack(fin, fmt)[0] for (name, fmt) in header_spec}
if not new_format:
model.update(dim=magic, ws=version)

raw_vocab, vocab_size, nwords = _load_vocab(fin, new_format, encoding=encoding)
model.update(raw_vocab=raw_vocab, vocab_size=vocab_size, nwords=nwords)

vectors_ngrams = _load_matrix(fin, new_format=new_format)

hidden_output = _load_matrix(fin, new_format=new_format)
model.update(vectors_ngrams=vectors_ngrams, hidden_output=hidden_output)

assert fin.read() == b'', 'expected to reach EOF'

model = {k: v for k, v in model.items() if k in _FIELD_NAMES}
return Model(**model)
Loading