piskvorky · menshikh-iv · Jan 11, 2019 · Dec 16, 2018 · Dec 16, 2018 · Dec 16, 2018
diff --git a/docs/fasttext-notes.md b/docs/fasttext-notes.md
@@ -0,0 +1,152 @@
+FastText Notes
+==============
+
+The implementation is split across several submodules:
+
+- models.fasttext
+- models.keyedvectors (includes FastText-specific code, not good)
+- models.word2vec (superclasses)
+- models.base_any2vec (superclasses)
+
+The implementation consists of several key classes:
+
+1. models.fasttext.FastTextVocab: the vocabulary
+2. models.keyedvectors.FastTextKeyedVectors: the vectors
+3. models.fasttext.FastTextTrainables: the underlying neural network
+4. models.fasttext.FastText: ties everything together
+
+FastTextVocab
+-------------
+
+Seems to be an entirely redundant class.
+Inherits from models.word2vec.Word2VecVocab, adding no new functionality.
+
+FastTextKeyedVectors
+--------------------
+
+Inheritance hierarchy:
+
+1. FastTextKeyedVectors
+2. WordEmbeddingsKeyedVectors.  Implements word similarity e.g. cosine similarity, WMD, etc.
+3. BaseKeyedVectors (abstract base class)
+4. utils.SaveLoad
+
+There are many attributes.
+
+Inherited from BaseKeyedVectors:
+
+- vectors: a 2D numpy array.  Flexible number of rows (0 by default).  Number of columns equals vector dimensionality.
+- vocab: a dictionary.  Keys are words.  Items are Vocab instances: these are essentially namedtuples that contain an index and a count.  The former is the index of a term in the entire vocab.  The latter is the number of times the term occurs.
+- vector_size (dimensionality)
+- index2entity
+
+Inherited from WordEmbeddingsKeyedVectors:
+
+- vectors_norm
+- index2word
+
+Added by FastTextKeyedVectors:
+
+- vectors_vocab: 2D array.  Rows are vectors.  Columns correspond to vector dimensions.  Initialized in FastTextTrainables.init_ngrams_weights.  Reset in reset_ngrams_weights.  Referred to as syn0_vocab in fasttext_inner.pyx.  These are vectors for every word in the vocabulary.
+- vectors_vocab_norm: looks unused, see _clear_post_train method.
+- vectors_ngrams: 2D array.  Each row is a bucket.  Columns correspond to vector dimensions.  Initialized in init_ngrams_weights function.  Initialized in _load_vectors method when reading from native FB binary.  Modified in reset_ngrams_weights method.  This is the first matrix loaded from the native binary files.
+- vectors_ngrams_norm: looks unused, see _clear_post_train method.
+- buckets_word: A hashmap.  Keyed by the index of a term in the vocab.  Each value is an array, where each element is an integer that corresponds to a bucket.  Initialized in init_ngrams_weights function
+- hash2index: A hashmap.  Keys are hashes of ngrams.  Values are the number of ngrams (?).  Initialized in init_ngrams_weights function.
+- min_n: minimum ngram length
+- max_n: maximum ngram length
+- num_ngram_vectors: initialized in the init_ngrams_weights function
+
+The init_ngrams_method looks like an internal method of FastTextTrainables.
+It gets called as part of the prepare_weights method, which is effectively part of the FastModel constructor.
+
+The above attributes are initialized to None in the FastTextKeyedVectors class constructor.
+Unfortunately, their real initialization happens in an entirely different module, models.fasttext - another indication of poor separation of concerns.
+
+Some questions:
+
+- What is the x_lockf stuff?  Why is it used only by the fast C implementation?
+- How are vectors_vocab and vectors_ngrams different?
+
+vectors_vocab contains vectors for entire vocabulary.
+vectors_ngrams contains vectors for each _bucket_.
+
+
+FastTextTrainables
+------------------
+
+[Link](https://radimrehurek.com/gensim/models/fasttext.html#gensim.models.fasttext.FastTextTrainables)
+
+This is a neural network that learns the vectors for the FastText embedding.
+Mostly inherits from its [Word2Vec parent](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2VecTrainables).
+Adds logic for calculating and maintaining ngram weights.
+
+Key attributes:
+
+- hashfxn: function for randomly initializing weights.  Defaults to the built-in hash() 
+- layer1_size: The size of the inner layer of the NN.  Equal to the vector dimensionality.  Set in the Word2VecTrainables constructor.
+- seed: The random generator seed used in reset_weights and update_weights
+- syn1: The inner layer of the NN.  Each row corresponds to a term in the vocabulary.  Columns correspond to weights of the inner layer.  There are layer1_size such weights.  Set in the reset_weights and update_weights methods, only if hierarchical sampling is used.
+- syn1neg: Similar to syn1, but only set if negative sampling is used.
+- vectors_lockf: A one-dimensional array with one element for each term in the vocab.  Set in reset_weights to an array of ones.
+- vectors_vocab_lockf: Similar to vectors_vocab_lockf, ones(len(model.trainables.vectors), dtype=REAL)
+- vectors_ngrams_lockf = ones((self.bucket, wv.vector_size), dtype=REAL)
+
+The lockf stuff looks like it gets used by the fast C implementation.
+
+The inheritance hierarchy here is:
+
+1. FastTextTrainables
+2. Word2VecTrainables
+3. utils.SaveLoad
+
+FastText
+--------
+
+Inheritance hierarchy:
+
+1. FastText
+2. BaseWordEmbeddingsModel: vocabulary management plus a ton of deprecated attrs
+3. BaseAny2VecModel: logging and training functionality
+4. utils.SaveLoad: for loading and saving
+
+Lots of attributes (many inherited from superclasses).
+
+From BaseAny2VecModel:
+
+- workers
+- vector_size
+- epochs
+- callbacks
+- batch_words
+- kv
+- vocabulary
+- trainables
+
+From BaseWordEmbeddingModel:
+
+- alpha
+- min_alpha
+- min_alpha_yet_reached
+- window
+- random
+- hs
+- negative
+- ns_exponent
+- cbow_mean
+- compute_loss
+- running_training_loss
+- corpus_count
+- corpus_total_words
+- neg_labels
+
+FastText attributes:
+
+- wv: FastTextWordVectors.  Used instead of .kv
+
+Logging
+-------
+
+The logging seems to be inheritance-based.
+It may be better to refactor this using aggregation istead of inheritance in the future.
+The benefits would be leaner classes with less responsibilities and better separation of concerns.
diff --git a/gensim/models/_fasttext_bin.py b/gensim/models/_fasttext_bin.py
@@ -0,0 +1,217 @@
+"""Load models from the native binary format released by Facebook.
+
+Examples
+--------
+
+Load a model from a binary file:
+.. sourcecode:: pycon
+
+    >>> from gensim.test.utils import datapath
+    >>> from gensim.models.fasttext_bin import load
+    >>> with open(datapath('crime-and-punishment.bin'), 'rb') as fin:
+    ...     model = load(fin)
+    >>> model.nwords
+    291
+    >>> model.vectors_ngrams.shape
+    (391, 5)
+    >>> sorted(model.raw_vocab, key=lambda w: len(w), reverse=True)[:5]
+    ['останавливаться', 'изворачиваться,', 'раздражительном', 'exceptionally', 'проскользнуть']
+
+See Also
+--------
+
+`FB Implementation <https://github.com/facebookresearch/fastText/blob/master/src/matrix.cc>`_.
+
+"""
+
+import collections
+import logging
+import struct
+
+import numpy as np
+
+logger = logging.getLogger(__name__)
+
+_FASTTEXT_FILEFORMAT_MAGIC = 793712314
+
+_NEW_HEADER_FORMAT = [
+    ('dim', 'i'),
+    ('ws', 'i'),
+    ('epoch', 'i'),
+    ('min_count', 'i'),
+    ('neg', 'i'),
+    ('_', 'i'),
+    ('loss', 'i'),
+    ('model', 'i'),
+    ('bucket', 'i'),
+    ('minn', 'i'),
+    ('maxn', 'i'),
+    ('_', 'i'),
+    ('t', 'd'),
+]
+
+_OLD_HEADER_FORMAT = [
+    ('epoch', 'i'),
+    ('min_count', 'i'),
+    ('neg', 'i'),
+    ('_', 'i'),
+    ('loss', 'i'),
+    ('model', 'i'),
+    ('bucket', 'i'),
+    ('minn', 'i'),
+    ('maxn', 'i'),
+    ('_', 'i'),
+    ('t', 'd'),
+]
+
+
+def _yield_field_names():
+    for name, _ in _OLD_HEADER_FORMAT + _NEW_HEADER_FORMAT:
+        if not name.startswith('_'):
+            yield name
+    yield 'raw_vocab'
+    yield 'vocab_size'
+    yield 'nwords'
+    yield 'vectors_ngrams'
+    yield 'hidden_output'
+
+
+_FIELD_NAMES = sorted(set(_yield_field_names()))
+Model = collections.namedtuple('Model', _FIELD_NAMES)
+
+
+def _struct_unpack(fin, fmt):
+    num_bytes = struct.calcsize(fmt)
+    return struct.unpack(fmt, fin.read(num_bytes))
+
+
+def _load_vocab(fin, new_format, encoding='utf-8'):
+    """Load a vocabulary from a FB binary.
+
+    Before the vocab is ready for use, call the prepare_vocab function and pass
+    in the relevant parameters from the model.
+
+    Parameters
+    ----------
+    fin : file
+        An open file pointer to the binary.
+    new_format: boolean
+        True if the binary is of the newer format.
+    encoding : str
+        The encoding to use when decoding binary data into words.
+
+    Returns
+    -------
+    tuple
+        The loaded vocabulary.  Keys are words, values are counts.
+        The vocabulary size.
+        The number of words.
+    """
+    vocab_size, nwords, nlabels = _struct_unpack(fin, '@3i')
+
+    # Vocab stored by [Dictionary::save](https://github.com/facebookresearch/fastText/blob/master/src/dictionary.cc)
+    if nlabels > 0:
+        raise NotImplementedError("Supervised fastText models are not supported")
+    logger.info("loading %s words for fastText model from %s", vocab_size, fin.name)
+
+    _struct_unpack(fin, '@1q')  # number of tokens
+    if new_format:
+        pruneidx_size, = _struct_unpack(fin, '@q')
+
+    raw_vocab = {}
+    for i in range(vocab_size):
+        word_bytes = b''
+        char_byte = fin.read(1)
+        # Read vocab word
+        while char_byte != b'\x00':
+            word_bytes += char_byte
+            char_byte = fin.read(1)
+        word = word_bytes.decode(encoding)
+        count, _ = _struct_unpack(fin, '@qb')
+        raw_vocab[word] = count
+
+    if new_format:
+        for j in range(pruneidx_size):
+            _struct_unpack(fin, '@2i')
+
+    return raw_vocab, vocab_size, nwords
+
+
+def _load_matrix(fin, new_format=True):
+    """Load a matrix from fastText native format.
+
+    Interprets the matrix dimensions and type from the file stream.
+
+    Parameters
+    ----------
+    fin : file
+        A file handle opened for reading.
+    new_format : bool, optional
+        True if the quant_input variable precedes
+        the matrix declaration.  Should be True for newer versions of fastText.
+
+    Returns
+    -------
+    :class:`numpy.array`
+        The vectors as an array.
+        Each vector will be a row in the array.
+        The number of columns of the array will correspond to the vector size.
+
+    """
+    if new_format:
+        _struct_unpack(fin, '@?')  # bool quant_input in fasttext.cc
+
+    num_vectors, dim = _struct_unpack(fin, '@2q')
+
+    float_size = struct.calcsize('@f')
+    if float_size == 4:
+        dtype = np.dtype(np.float32)
+    elif float_size == 8:
+        dtype = np.dtype(np.float64)
+    else:
+        raise ValueError("Incompatible float size: %r" % float_size)
+
+    matrix = np.fromfile(fin, dtype=dtype, count=num_vectors * dim)
+    matrix = matrix.reshape((num_vectors, dim))
+    return matrix
+
+
+def load(fin, encoding='utf-8'):
+    """Load a model from a binary stream.
+
+    Parameters
+    ----------
+    fin : file
+        The readable binary stream.
+    encoding : str, optional
+        The encoding to use for decoding text
+
+    Returns
+    -------
+    Model
+        The loaded model.
+
+    """
+    if isinstance(fin, str):
+        fin = open(fin, 'rb')
+
+    magic, version = _struct_unpack(fin, '@2i')
+    new_format = magic == _FASTTEXT_FILEFORMAT_MAGIC
+
+    header_spec = _NEW_HEADER_FORMAT if new_format else _OLD_HEADER_FORMAT
+    model = {name: _struct_unpack(fin, fmt)[0] for (name, fmt) in header_spec}
+    if not new_format:
+        model.update(dim=magic, ws=version)
+
+    raw_vocab, vocab_size, nwords = _load_vocab(fin, new_format, encoding=encoding)
+    model.update(raw_vocab=raw_vocab, vocab_size=vocab_size, nwords=nwords)
+
+    vectors_ngrams = _load_matrix(fin, new_format=new_format)
+
+    hidden_output = _load_matrix(fin, new_format=new_format)
+    model.update(vectors_ngrams=vectors_ngrams, hidden_output=hidden_output)
+
+    assert fin.read() == b'', 'expected to reach EOF'
+
+    model = {k: v for k, v in model.items() if k in _FIELD_NAMES}
+    return Model(**model)