Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve FastText loading times #1261

Closed
tmylk opened this issue Apr 4, 2017 · 33 comments · Fixed by #2340
Closed

Improve FastText loading times #1261

tmylk opened this issue Apr 4, 2017 · 33 comments · Fixed by #2340
Assignees
Labels
difficulty medium Medium issue: required good gensim understanding & python skills fasttext Issues related to the FastText model performance Issue related to performance (in HW meaning) wishlist Feature request

Comments

@tmylk
Copy link
Contributor

tmylk commented Apr 4, 2017

Consider reading just the bin file as sugested in #814 (comment)

Compare to C++ code in https://github.com/salestock/fastText.py/blob/77bdf69ee97a7446e314b342b129c5d46e9e4e29/fasttext/fasttext.pyx#L143

@tmylk tmylk added wishlist Feature request difficulty easy Easy issue: required small fix difficulty medium Medium issue: required good gensim understanding & python skills labels Apr 4, 2017
@prakhar2b
Copy link
Contributor

@tmylk I would like to work on this

@tmylk
Copy link
Contributor Author

tmylk commented May 2, 2017

@prakhar2b that would be great

@jayantj
Copy link
Contributor

jayantj commented May 12, 2017

My thoughts about loading the FastText model using only the .bin file -

  1. Definitely doable, the .bin file contains all information about the model
  2. The .vec file was used mainly to be able to reuse code from KeyedVectors, changing to use simply the .bin file would require some architectural changes.
  3. The next step should be clearly benchmarking the FastText loading times for gensim and fastText.py for small and large models, and then profiling the gensim method. In case loading the .vec file is a significant bottleneck, it would make sense to load using the .bin file only.

@manneshiva
Copy link
Contributor

I was looking to use fasttext in python and the gensim wrapper seems to be considerably slower than salestock/fasttext.py . The gensim wrapper takes around 35seconds to load a moderately large(~200K vocab) pre trained model (.bin-1Gb, .vec-0.2gb) while the fasttext.py takes just above a second to load the same model. Posting the cProfile stats for both the runs/loads below:
Gensim Wrapper
image
Fasttext.py
image
As seen from the first figure, loading the .bin file alone(by reading in python) would not bring the load time anywhere close to the salestock/fasttext.py load time. Achieving comparable load times would require reading in C/C++(like salestock).

@jayantj
Copy link
Contributor

jayantj commented May 25, 2017

Hi @manneshiva , thanks for the profiling report.

It could be possible to reduce the time required for loading the .bin file further by optimizing the bottlenecks (we could rewrite them in Cython if required), but we should definitely get rid of the call to load_word2vec_format.

Does the profiling report also include what exactly is slow inside the load_vectors function? That would be helpful in determining what the next steps should be.

@manneshiva
Copy link
Contributor

manneshiva commented May 25, 2017

@jayantj
Here is the profiling report on load_vectors.
image

The main bottleneck seems to be the ft_hash function. We could also look at ways to optimize compute_ngrams(takes ~3.2secs compared to 11secs taken by ft_hash). Would getting rid of load_word2vec_format and Cythonizing ft_hash and compute_ngrams be sufficient? I am interested to work on this issue.

@tmylk
Copy link
Contributor Author

tmylk commented May 29, 2017

@manneshiva Your help will be very appreciated here as it's a very needed feature. Cython rather than C/C++ is preferred. Happy to mentor it via our incubator programme. CC @menshikh-iv

@gojomo
Copy link
Collaborator

gojomo commented May 29, 2017

Optimizations of init_ngrams & compute_ngrams & ft_hash short of cythonization might also yield a big speedup. (Avoid the seterr toggling every ngram; avoid reconstructing the constants; maybe finding overflow-oblivious approach; maybe inlining ft_hash; restructuring the loops to avoid the += appends and set(); maybe even reusing available prefix ngram hashes; etc.)

(Separately: the comment for init_ngram() mentioning vectors being discarded to 'save space', and the building of the ngrams-to-indexes dict, seems a little fishy to me. Does the native fasttext code do that too? I would think the usual case is that very few of the hashed-index vectors are completely unused – or else the chosen bucket was too large – and the space overhead of the mapping dict could be significant.)

@jayantj
Copy link
Contributor

jayantj commented May 31, 2017

@gojomo Those ideas sound good to me.

About the unused vectors being discarded - the native FastText code doesn't do that. The default value for bucket size is quite large (2 mil) and according to the original comment from the PR for the fastText wrapper - #847 (comment) - 1.6mil of them are unused even with a mid-sized corpus (text8).

Some more detailed memory profiling could be useful here - to determine the exact gains from dropping the unused vectors, and the space overhead of the mapping dict.

@gojomo
Copy link
Collaborator

gojomo commented Jun 1, 2017

IMO text8 is a tiny corpus - it'd be most relevant to see the bucket-load in actual vector sets, like those FB pretrained & released.

Also, even though when we do local training, we may know for sure that an ngram's vector slot has never been trained, it's theoretically plausible that in a file-being read, even slots for ngrams that never appear in the declared vocabulary might have been trained. (Maybe, the vocab-for-distrib was trimmed after training on a larger set. Not sure any implementation does this, but if modeling out-of-vocab words was my project priority, I'd consider such a strategy, so that even very-obscure words/ngrams get at least a little coverage. An interesting thing to check could be: in FB's pretrained vectors, do any slots which seem to have no ngrams mapped to them ever exceed the magnitude of only-ever-randomly-initialized vectors?)

Definitely agree best course is to evaluate the optimization with actual memory profiling/sizing-analysis.

@jayantj
Copy link
Contributor

jayantj commented Jun 2, 2017

I agree, that would be an interesting experiment to do (checking if vectors with no ngrams mapped are within the max-range of randomly initialized vectors).

This discussion about keeping/discarding vectors also directly affects the work on improving loading times - in case the impact (memory-wise) of retaining all vectors is not significant, and we decide to keep all vectors, the loading time automatically improves tremendously, since the majority of time spent right now is in computing hashes for all ngrams at load time. Any optimization with ft_hash/compute_ngrams wouldn't be as worthwhile if we do decide to keep all vectors.

In that case, I'd recommend that the next steps be memory profiling for different vocab/bucket sizes to determine whether keeping all vectors is a good idea.

@manneshiva
Copy link
Contributor

@gojomo @jayantj
The table below summarizes the memory profiling results on pre-trained fasttext english vectors with default bucket size(2Million), Text9(with default bucket size) and Text9(retrained with bucket size 1Million):
image

We have 2 options to solve this issue:

  1. Loading from .bin alone and trimming ngrams :
    • memory profiling shows significantly lesser memory consumption for default/large bucket size
    • would require optimization of relevant functions
  2. Loading all ngrams :
    • would considerably boost the load time without any optimization.
    • only makes sense when user is allowed to set bucket size when calling train(no option in present code)
    • no way to for user to decide the ideal bucket size
    • a very small bucket size could lead to hash collisions which could affect the quality of vectors for OOV words(not sure about this point)

It looks like the second option is better provided the concerns mentioned aren't too big a problem. Would like to hear your thoughts on this.

@jayantj
Copy link
Contributor

jayantj commented Jun 9, 2017

@manneshiva Thanks for the analysis. How exactly has the "Memory saved by only selecting ngrams of vocab words" been calculated?
I'm curious because the difference in the number of of unused vectors for wiki.simple and Text9 (with bucket size 2mil) seems low (1.4mil vs 1.16mil), but the difference in memory saved is very high (1100 MB vs 440 MB).

Maybe initially we could load all ngrams and not trim any of them (will speed up load time), and add an optional method to drop the unused ngrams for someone who is looking to save memory. (somewhat similarly to init_sims(replace=True) in word2vec)

@gojomo what do you think?

@manneshiva
Copy link
Contributor

@jayantj

How exactly has the "Memory saved by only selecting ngrams of vocab words" been calculated?

I looked at the memory taken by self.wv.syn0_all before and after reassigning it with self.wv.syn0_all.take(ngram_indices, axis=0).

I'm curious because the difference in the number of of unused vectors for wiki.simple and Text9 (with bucket size 2mil) seems low (1.4mil vs 1.16mil), but the difference in memory saved is very high (1100 MB vs 440 MB).

This is because of the difference in embedding size. The words are 300 dimensional in case of wiki.simple while they are 100 dimensional for Text9(should have mentioned this, apologies for the confusion).

@gojomo
Copy link
Collaborator

gojomo commented Jun 12, 2017

Text9 is still pretty small, and 100-dimensions (despite being the default) may not be a representative size for serious users. The effects on the pre-trained FB FastText vectors (of Wikipedia in many languages) may be more representative of what real users will experience.

(Are you sure the memory accounting is counting the cost of the dictionary mapping ngrams to remembered, rather than hash-discovered, slots? It's probably only a few tens-of-MB but not sure where it'd appear in the current analysis.)

I'm not sure saving all ngrams will slow loading time - doesn't the load code right now do more to precalculate the ngrams & do extra discarding?

It wouldn't be too hard to add bucket-configurability, or logging/reporting of an (approximate) count of unique ngrams, to help choose optimal bucket size. But also, these savings don't yet seem so big, for serious users.

@menshikh-iv menshikh-iv added performance Issue related to performance (in HW meaning) and removed difficulty easy Easy issue: required small fix labels Oct 2, 2017
@saroufimc1
Copy link

saroufimc1 commented Dec 11, 2017

Hi all,
Can someone please summarize the current state of this? I posted an issue with regards to the bucket in this code and I think it is directly related to your discussion here about optimizing the code.

#1779

@jayantj
Copy link
Contributor

jayantj commented Dec 12, 2017

The tradeoff here is speed v/s memory.

For relatively small vocab sizes (~200k), the steady-state memory usage is 1.1 GB lower than it would be if we chose to keep all ngram vectors as is. (for 300-d vectors). This is at the cost of significantly increased loading time.

Conversely, for large vocab sizes (like for Wikipedia models), we don't reduce memory usage, while also causing much higher load times. (as @gojomo rightly pointed out)

In case the common use case is indeed loading large models, it might make sense to store ngram vectors as is, without trying to discard any unused ones.

@piskvorky @menshikh-iv @manneshiva what do you think?

@manneshiva
Copy link
Contributor

I feel we should give the user an option -- discard_unused_ngrams to save memory, which by default could be False. Since the memory saved for small vocab models is significant (owing to a fewer number of total used ngrmas), this should be helpful for a user trying to load a small vocab model with limited RAM.

Only concern here, as pointed by @gojomo in this comment -- can we confirm that ngrams for the final vocab are the only ngrams trained. We shouldn't be discarding trained ngrams that do not appear in the ngrams of in-vocab words. It is highly unlikely that this is the case. @jayantj comments?

@DomHudson
Copy link

Hi,
I'm just taking a look at this myself also, just posting some metrics here.
Loading a 1.2gb .bin file using the python wrapper in the original repository takes 2.078 seconds compared with gensim's load_fasttext_format taking 68.171 seconds.

The vast majority of time is in init_ngrams_post_load.

         64526433 function calls (64513338 primitive calls) in 68.171 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   45.658   45.658   62.476   62.476 fasttext.py:857(init_ngrams_post_load)
 16220806    7.847    0.000    7.847    0.000 {built-in method numpy.core.multiarray.array}
 31281478    4.118    0.000    4.118    0.000 {built-in method gensim.models._utils_any2vec.ft_hash}
  1159878    3.742    0.000    3.742    0.000 {built-in method gensim.models._utils_any2vec.compute_ngrams}
        1    2.429    2.429    4.705    4.705 fasttext.py:618(_load_dict)
  6100747    0.892    0.000    0.892    0.000 {method 'read' of '_io.BufferedReader' objects}
        1    0.588    0.588    0.588    0.588 {method 'take' of 'numpy.ndarray' objects}
   579939    0.494    0.000    0.616    0.000 keyedvectors.py:97(__init__)
        1    0.411    0.411    0.411    0.411 {built-in method numpy.core.multiarray.fromfile}
3713959/3713525    0.401    0.000    0.401    0.000 {built-in method builtins.len}

(Gensim Version: 3.4.0)

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Jul 30, 2018

Thanks @DomHudson, I checked myself and confrim long time of init_ngrams_post_load (~90%).
Noe: original python wrapper re-use c++ function for loading: https://github.com/facebookresearch/fastText/blob/master/python/fastText/pybind/fasttext_pybind.cc#L142

Profiling:

# content of t.py
from gensim.models import FastText
import logging


logging.basicConfig(level=logging.INFO)


m = FastText.load_fasttext_format("wiki.en.bin")

run it as python -m cProfile -o p.prof t.py

result in attach, init_ngrams_post_load definitely consume much time and this should be improved
image

p.zip

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Jan 18, 2019

New measurement

image

Raw cProfile outpt (zip)

So, slowest method is adjust_vectors (by _ft_hash reason)
https://github.com/RaRe-Technologies/gensim/blob/a864e0247dda421d8e4de280fa91f86f474a5691/gensim/models/keyedvectors.py#L2230-L2249

Notes:

  • I guess, reason in ".encoding" call in hash function for each ngram (we can't see that by cython reason). More details:
    gensim.models._utils_any2vec.ft_hash
    
    n_calls: 15474957
    cumtime: 645.1 sec
    precall: 4.169e-05 sec
    
    We need to speedup it at least 10x to have 1 minute load time, this will be hard I guess
  • I think we need special flag (full_model=False, that controls, how much matricies we load: 1 or 2. Reason: FB models really large and if user don't want to continue an training - no reason to load full FBmodel
  • Get rid hash2index completely, related Roll back optimization that trims empty buckets #2329
  • Re-think how we work with vectors at all, are we really need things like model.wv.vocab that duplicates a memory & need to maintain always. If this possible to use references instead of copies - this will be the simplest and best solution

@horpto
Copy link
Contributor

horpto commented Jan 18, 2019

@menshikh-iv why don't make _compute_ngrams generate list of bytes ? Only one time make .encode in _compute_ngrams. There is one big drawback that _compute_ngrams should iterate by unicode characters, but generate bytes.

@menshikh-iv
Copy link
Contributor

@horpto yes, I agree, this is one of the ideas, though need to test properly.

@menshikh-iv
Copy link
Contributor

Optimization attempt №1, Idea (thanks @mpenkov) - use caching (LRU in our case)

Apply caching (very dirty patch, I know, but shows main idea)

ivan@h3:~/ft/gensim$ git diff
diff --git a/gensim/models/keyedvectors.py b/gensim/models/keyedvectors.py
index d9dad1cc..ad10f542 100644
--- a/gensim/models/keyedvectors.py
+++ b/gensim/models/keyedvectors.py
@@ -2237,7 +2237,7 @@ class FastTextKeyedVectors(WordEmbeddingsKeyedVectors):
         if self.bucket == 0:
             return
 
-        hash_fn = _ft_hash if self.compatible_hash else _ft_hash_broken
+        hash_fn = lru_cache(maxsize=100000)(_ft_hash) if self.compatible_hash else _ft_hash_broken
 
         for w, v in self.vocab.items():
             word_vec = np.copy(self.vectors_vocab[v.index])
@@ -2247,8 +2247,9 @@ class FastTextKeyedVectors(WordEmbeddingsKeyedVectors):
                 word_vec += self.vectors_ngrams[ngram_index]
             word_vec /= len(ngrams) + 1
             self.vectors[v.index] = word_vec
+        print("@@@", hash_fn.cache_info())
 
-
+from functools import lru_cache
 def _process_fasttext_vocab(iterable, min_n, max_n, num_buckets, hash_fn, hash2index):

Result

Loading time for cc.ru.300.bin with LRU patch (described below)

Cache type Time Speed Hits Misses Hits %
No cache (current) 12m17s 1x - - -
LRU (100k) 6m14s 1.97x 8224568 7250389 53%
LRU (200k) 5m2s 2.44x 9914071 5560886 64%
LRU (500k) 3m36s 3.41x 11787735 3687222 76%

@mpenkov
Copy link
Collaborator

mpenkov commented Jan 19, 2019

@horpto @menshikh-iv

why don't make _compute_ngrams generate list of bytes ? Only one time make .encode in _compute_ngrams. There is one big drawback that _compute_ngrams should iterate by unicode characters, but generate bytes.

I think that it's an interesting idea, but there's a problem: there's no guarantee that the length of the string before and after encoding will be the same, because a single character can be encoded to multiple bytes. This means that implementing the idea, we may get different ngrams.

At this stage, the thing to do is go read the FB implementation and documentation. We need to figure out how they're dealing with this and decide what to do based on compatibility.

I have a similar proposal based on @horpto's idea: why don't we store the vocabulary as bytes instead of strings? That would save us the cost of encoding everything each time we compute ngrams. Again, this depends if we can do this is a way that is compatible with FB, but I thought that it's worth documenting this idea here.

@horpto
Copy link
Contributor

horpto commented Jan 19, 2019

@mpenkov My idea actually is the same (I understand it later) to fb/fasttext implementation of _compute_ngrams (see https://github.com/facebookresearch/fastText/blob/7842495a4d64c7a3bb4339d45d6e64321d002ed8/src/dictionary.cc#L172). It's first. Second, I don't clearly expressed about unicode strings but I've written about this big drawback. It must make code a more complex (but maybe we can get access to inner str buffer?). Third, according to the code, actually everywhere we are calling _compute_ngrams and then hash(ngram) for all ngrams but itself ngram-s are not using. So maybe make an _computer_ngram_hashes ?

@menshikh-iv
Copy link
Contributor

@horpto @mpenkov I made small benchmark (but I feel that I'm wrong somewhere)

from six import PY2
import numpy as np
cimport numpy as np


cdef _byte_to_int_py3(b):
    return b

cdef _byte_to_int_py2(b):
    return ord(b)

_byte_to_int = _byte_to_int_py2 if PY2 else _byte_to_int_py3

cpdef ft_hash_current(unicode string):
    cdef unsigned int h = 2166136261
    for c in string.encode("utf-8"):
        h = np.uint32(h ^ np.uint32(np.int8(_byte_to_int(c))))
        h = np.uint32(h * np.uint32(16777619))
    return h

cpdef ft_hash_new(bytes string):
    cdef unsigned int h = 2166136261
    for c in string:  # no more encode here, bytes as input
        h = np.uint32(h ^ np.uint32(np.int8(_byte_to_int(c))))
        h = np.uint32(h * np.uint32(16777619))
    return h

cpdef compute_ngrams_current(word, unsigned int min_n, unsigned int max_n):
    cdef unicode extended_word = f'<{word}>'
    ngrams = []
    for ngram_length in range(min_n, min(len(extended_word), max_n) + 1):
        for i in range(0, len(extended_word) - ngram_length + 1):
            ngrams.append(extended_word[i:i + ngram_length])
    return ngrams


cpdef compute_ngrams_new(word, unsigned int min_n, unsigned int max_n):
    cdef unicode extended_word = f'<{word}>'
    ngrams = []
    for ngram_length in range(min_n, min(len(extended_word), max_n) + 1):
        for i in range(0, len(extended_word) - ngram_length + 1):
            ngrams.append(extended_word[i:i + ngram_length])
    return (" ".join(ngrams)).encode("utf-8").split()  # make encode here for all ngrams in one moment

We have 2 pairs of func:

  • compute_ngrams_current + ft_hash_current (current master)
  • compute_ngrams_new + ft_hash_new - idea with single encoding for word (not an each ngram)

Benchmark looks really simle

import gensim.downloader as api
import itertools as it

words = tuple(it.chain.from_iterable(api.load("text8")))
assert len(words) == 17005207  # long enough

words = words[:100000]


def benchmark(words, ngram_func, hash_func):
    for w in words:
        arr = [hash_func(ngram) % 10000 for ngram in ngram_func(w, 3, 6)]

And result is

%time benchmark(words, ngram_func=compute_ngrams_current, hash_func=ft_hash_current)
"""
CPU times: user 36.3 s, sys: 417 ms, total: 36.7 s
Wall time: 34.5 s
"""

%time benchmark(words, ngram_func=compute_ngrams_new, hash_func=ft_hash_new)
"""
CPU times: user 37.6 s, sys: 405 ms, total: 38 s
Wall time: 35.6 s
"""

New variant even a bit slower than current one (I guess reason in join + split), but I'm surprised with result (still think that I do something wrong =/)

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Jan 19, 2019

In addition, I ported piece of function from FB impl of ngram generation based on bytes

cpdef compute_ngrams_awesome(word, unsigned int min_n, unsigned int max_n):
    cdef bytes _word = f'<{word}>'.encode("utf-8")
    cdef int _wlen = len(_word)
    cdef int j, i, n
    
    ngrams = []
    for i in range(_wlen): 
        ngram = []

        if _word[i] & 0xC0 == 0x80:  # it's not a first byte of actual character
            continue
        
        j, n = i, 1
        while j < _wlen and n <= max_n:
            ngram.append(_word[j])
            j += 1
            while j < _wlen and (_word[j] & 0xC0) == 0x80:
                ngram.append(_word[j])
                j += 1
    
            if n >= min_n and not (n == 1 and (i == 0 or j == _wlen)):
                ngrams.append(bytes(ngram))
            
            n += 1
        
    return ngrams

But this is still not help much, hmm..

@DomHudson
Copy link

Apologies if this has been addressed, but is there any reason we can't just use some of the original C++ with suitable pybind11 bindings to access the loaded data? Is the assumption that the overhead from moving the data from C++ to Python would mitigate any C++ speed improvements?

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Jan 19, 2019

@DomHudson

is there any reason we can't just use some of the original C++ with suitable pybind11 bindings to access the loaded data?

Yes, we can't, because

  • we have our own implementation and need to fill it from FB binary format, we don't want to include FB C++ implementation to python package, especially for this operation
  • we don't want to use pybind11

@menshikh-iv
Copy link
Contributor

@horpto @mpenkov according to my experiment (calculate byte-based character ngram + no encoding on hash function side) - no difference with current implementation (same 12m17s) , looks like encoding wasn't an issue in the current case. Any other ideas, guys?

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Jan 19, 2019

I guess I solve this issue: I re-write hash function

from libc.stdint cimport uint32_t, int8_t
from libc.string cimport strcpy, strlen
from libc.stdlib cimport malloc

cdef char* c_call_returning_a_c_string(bytes string):
    cdef char* c_string = <char *> malloc((len(string) + 1) * sizeof(char))
    if not c_string:
        raise MemoryError()
    strcpy(c_string, string)
    return c_string


cpdef ft_hash_ff(bytes string):
    cdef uint32_t h = 2166136261
    cdef char* ss = c_call_returning_a_c_string(string)

    for i in range(strlen(ss)):
        h = h ^ <uint32_t>(<int8_t>ss[i])  # attention - I drop 'ord' from py2, not sure about correctenss
        h = h * 16777619
    return h

Now loading takes 1m1s instead of 12m17s

@mpenkov please

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Jan 20, 2019

upd: @horpto simplified my code version (get rid malloc & copying), looks simpler and better

cpdef ft_hash_ff2(bytes string):
    cdef uint32_t h = 2166136261
    cdef char ch

    for ch in string:
        h = h ^ <uint32_t>(<int8_t>ch) # attention - I drop 'ord' from py2, not sure about correctenss
        h = h * 16777619
    return h

CC @mpenkov

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty medium Medium issue: required good gensim understanding & python skills fasttext Issues related to the FastText model performance Issue related to performance (in HW meaning) wishlist Feature request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants