-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve FastText loading times #1261
Comments
@tmylk I would like to work on this |
@prakhar2b that would be great |
My thoughts about loading the FastText model using only the
|
Hi @manneshiva , thanks for the profiling report. It could be possible to reduce the time required for loading the .bin file further by optimizing the bottlenecks (we could rewrite them in Cython if required), but we should definitely get rid of the call to Does the profiling report also include what exactly is slow inside the |
@jayantj The main bottleneck seems to be the |
@manneshiva Your help will be very appreciated here as it's a very needed feature. Cython rather than C/C++ is preferred. Happy to mentor it via our incubator programme. CC @menshikh-iv |
Optimizations of (Separately: the comment for |
@gojomo Those ideas sound good to me. About the unused vectors being discarded - the native FastText code doesn't do that. The default value for bucket size is quite large (2 mil) and according to the original comment from the PR for the fastText wrapper - #847 (comment) - 1.6mil of them are unused even with a mid-sized corpus (text8). Some more detailed memory profiling could be useful here - to determine the exact gains from dropping the unused vectors, and the space overhead of the mapping dict. |
IMO text8 is a tiny corpus - it'd be most relevant to see the bucket-load in actual vector sets, like those FB pretrained & released. Also, even though when we do local training, we may know for sure that an ngram's vector slot has never been trained, it's theoretically plausible that in a file-being read, even slots for ngrams that never appear in the declared vocabulary might have been trained. (Maybe, the vocab-for-distrib was trimmed after training on a larger set. Not sure any implementation does this, but if modeling out-of-vocab words was my project priority, I'd consider such a strategy, so that even very-obscure words/ngrams get at least a little coverage. An interesting thing to check could be: in FB's pretrained vectors, do any slots which seem to have no ngrams mapped to them ever exceed the magnitude of only-ever-randomly-initialized vectors?) Definitely agree best course is to evaluate the optimization with actual memory profiling/sizing-analysis. |
I agree, that would be an interesting experiment to do (checking if vectors with no ngrams mapped are within the max-range of randomly initialized vectors). This discussion about keeping/discarding vectors also directly affects the work on improving loading times - in case the impact (memory-wise) of retaining all vectors is not significant, and we decide to keep all vectors, the loading time automatically improves tremendously, since the majority of time spent right now is in computing hashes for all ngrams at load time. Any optimization with In that case, I'd recommend that the next steps be memory profiling for different vocab/bucket sizes to determine whether keeping all vectors is a good idea. |
@gojomo @jayantj We have 2 options to solve this issue:
It looks like the second option is better provided the concerns mentioned aren't too big a problem. Would like to hear your thoughts on this. |
@manneshiva Thanks for the analysis. How exactly has the "Memory saved by only selecting ngrams of vocab words" been calculated? Maybe initially we could load all ngrams and not trim any of them (will speed up load time), and add an optional method to drop the unused ngrams for someone who is looking to save memory. (somewhat similarly to @gojomo what do you think? |
I looked at the memory taken by
This is because of the difference in embedding size. The words are 300 dimensional in case of |
Text9 is still pretty small, and 100-dimensions (despite being the default) may not be a representative size for serious users. The effects on the pre-trained FB FastText vectors (of Wikipedia in many languages) may be more representative of what real users will experience. (Are you sure the memory accounting is counting the cost of the dictionary mapping ngrams to remembered, rather than hash-discovered, slots? It's probably only a few tens-of-MB but not sure where it'd appear in the current analysis.) I'm not sure saving all ngrams will slow loading time - doesn't the load code right now do more to precalculate the ngrams & do extra discarding? It wouldn't be too hard to add bucket-configurability, or logging/reporting of an (approximate) count of unique ngrams, to help choose optimal bucket size. But also, these savings don't yet seem so big, for serious users. |
Hi all, |
The tradeoff here is speed v/s memory. For relatively small vocab sizes (~200k), the steady-state memory usage is 1.1 GB lower than it would be if we chose to keep all ngram vectors as is. (for 300-d vectors). This is at the cost of significantly increased loading time. Conversely, for large vocab sizes (like for Wikipedia models), we don't reduce memory usage, while also causing much higher load times. (as @gojomo rightly pointed out) In case the common use case is indeed loading large models, it might make sense to store ngram vectors as is, without trying to discard any unused ones. @piskvorky @menshikh-iv @manneshiva what do you think? |
I feel we should give the user an option -- Only concern here, as pointed by @gojomo in this comment -- can we confirm that |
Hi, The vast majority of time is in
(Gensim Version: 3.4.0) |
Thanks @DomHudson, I checked myself and confrim long time of Profiling:
# content of t.py
from gensim.models import FastText
import logging
logging.basicConfig(level=logging.INFO)
m = FastText.load_fasttext_format("wiki.en.bin") run it as result in attach, |
New measurement
So, slowest method is Notes:
|
@menshikh-iv why don't make |
@horpto yes, I agree, this is one of the ideas, though need to test properly. |
Optimization attempt №1, Idea (thanks @mpenkov) - use caching (LRU in our case) Apply caching (very dirty patch, I know, but shows main idea) ivan@h3:~/ft/gensim$ git diff
diff --git a/gensim/models/keyedvectors.py b/gensim/models/keyedvectors.py
index d9dad1cc..ad10f542 100644
--- a/gensim/models/keyedvectors.py
+++ b/gensim/models/keyedvectors.py
@@ -2237,7 +2237,7 @@ class FastTextKeyedVectors(WordEmbeddingsKeyedVectors):
if self.bucket == 0:
return
- hash_fn = _ft_hash if self.compatible_hash else _ft_hash_broken
+ hash_fn = lru_cache(maxsize=100000)(_ft_hash) if self.compatible_hash else _ft_hash_broken
for w, v in self.vocab.items():
word_vec = np.copy(self.vectors_vocab[v.index])
@@ -2247,8 +2247,9 @@ class FastTextKeyedVectors(WordEmbeddingsKeyedVectors):
word_vec += self.vectors_ngrams[ngram_index]
word_vec /= len(ngrams) + 1
self.vectors[v.index] = word_vec
+ print("@@@", hash_fn.cache_info())
-
+from functools import lru_cache
def _process_fasttext_vocab(iterable, min_n, max_n, num_buckets, hash_fn, hash2index): Result Loading time for
|
I think that it's an interesting idea, but there's a problem: there's no guarantee that the length of the string before and after encoding will be the same, because a single character can be encoded to multiple bytes. This means that implementing the idea, we may get different ngrams. At this stage, the thing to do is go read the FB implementation and documentation. We need to figure out how they're dealing with this and decide what to do based on compatibility. I have a similar proposal based on @horpto's idea: why don't we store the vocabulary as bytes instead of strings? That would save us the cost of encoding everything each time we compute ngrams. Again, this depends if we can do this is a way that is compatible with FB, but I thought that it's worth documenting this idea here. |
@mpenkov My idea actually is the same (I understand it later) to fb/fasttext implementation of |
@horpto @mpenkov I made small benchmark (but I feel that I'm wrong somewhere) from six import PY2
import numpy as np
cimport numpy as np
cdef _byte_to_int_py3(b):
return b
cdef _byte_to_int_py2(b):
return ord(b)
_byte_to_int = _byte_to_int_py2 if PY2 else _byte_to_int_py3
cpdef ft_hash_current(unicode string):
cdef unsigned int h = 2166136261
for c in string.encode("utf-8"):
h = np.uint32(h ^ np.uint32(np.int8(_byte_to_int(c))))
h = np.uint32(h * np.uint32(16777619))
return h
cpdef ft_hash_new(bytes string):
cdef unsigned int h = 2166136261
for c in string: # no more encode here, bytes as input
h = np.uint32(h ^ np.uint32(np.int8(_byte_to_int(c))))
h = np.uint32(h * np.uint32(16777619))
return h
cpdef compute_ngrams_current(word, unsigned int min_n, unsigned int max_n):
cdef unicode extended_word = f'<{word}>'
ngrams = []
for ngram_length in range(min_n, min(len(extended_word), max_n) + 1):
for i in range(0, len(extended_word) - ngram_length + 1):
ngrams.append(extended_word[i:i + ngram_length])
return ngrams
cpdef compute_ngrams_new(word, unsigned int min_n, unsigned int max_n):
cdef unicode extended_word = f'<{word}>'
ngrams = []
for ngram_length in range(min_n, min(len(extended_word), max_n) + 1):
for i in range(0, len(extended_word) - ngram_length + 1):
ngrams.append(extended_word[i:i + ngram_length])
return (" ".join(ngrams)).encode("utf-8").split() # make encode here for all ngrams in one moment We have 2 pairs of func:
Benchmark looks really simle import gensim.downloader as api
import itertools as it
words = tuple(it.chain.from_iterable(api.load("text8")))
assert len(words) == 17005207 # long enough
words = words[:100000]
def benchmark(words, ngram_func, hash_func):
for w in words:
arr = [hash_func(ngram) % 10000 for ngram in ngram_func(w, 3, 6)] And result is %time benchmark(words, ngram_func=compute_ngrams_current, hash_func=ft_hash_current)
"""
CPU times: user 36.3 s, sys: 417 ms, total: 36.7 s
Wall time: 34.5 s
"""
%time benchmark(words, ngram_func=compute_ngrams_new, hash_func=ft_hash_new)
"""
CPU times: user 37.6 s, sys: 405 ms, total: 38 s
Wall time: 35.6 s
""" New variant even a bit slower than current one (I guess reason in join + split), but I'm surprised with result (still think that I do something wrong =/) |
In addition, I ported piece of function from FB impl of ngram generation based on bytes cpdef compute_ngrams_awesome(word, unsigned int min_n, unsigned int max_n):
cdef bytes _word = f'<{word}>'.encode("utf-8")
cdef int _wlen = len(_word)
cdef int j, i, n
ngrams = []
for i in range(_wlen):
ngram = []
if _word[i] & 0xC0 == 0x80: # it's not a first byte of actual character
continue
j, n = i, 1
while j < _wlen and n <= max_n:
ngram.append(_word[j])
j += 1
while j < _wlen and (_word[j] & 0xC0) == 0x80:
ngram.append(_word[j])
j += 1
if n >= min_n and not (n == 1 and (i == 0 or j == _wlen)):
ngrams.append(bytes(ngram))
n += 1
return ngrams But this is still not help much, hmm.. |
Apologies if this has been addressed, but is there any reason we can't just use some of the original C++ with suitable pybind11 bindings to access the loaded data? Is the assumption that the overhead from moving the data from C++ to Python would mitigate any C++ speed improvements? |
Yes, we can't, because
|
I guess I solve this issue: I re-write hash function from libc.stdint cimport uint32_t, int8_t
from libc.string cimport strcpy, strlen
from libc.stdlib cimport malloc
cdef char* c_call_returning_a_c_string(bytes string):
cdef char* c_string = <char *> malloc((len(string) + 1) * sizeof(char))
if not c_string:
raise MemoryError()
strcpy(c_string, string)
return c_string
cpdef ft_hash_ff(bytes string):
cdef uint32_t h = 2166136261
cdef char* ss = c_call_returning_a_c_string(string)
for i in range(strlen(ss)):
h = h ^ <uint32_t>(<int8_t>ss[i]) # attention - I drop 'ord' from py2, not sure about correctenss
h = h * 16777619
return h Now loading takes @mpenkov please
|
upd: @horpto simplified my code version (get rid malloc & copying), looks simpler and better cpdef ft_hash_ff2(bytes string):
cdef uint32_t h = 2166136261
cdef char ch
for ch in string:
h = h ^ <uint32_t>(<int8_t>ch) # attention - I drop 'ord' from py2, not sure about correctenss
h = h * 16777619
return h CC @mpenkov |
Consider reading just the
bin
file as sugested in #814 (comment)Compare to C++ code in https://github.com/salestock/fastText.py/blob/77bdf69ee97a7446e314b342b129c5d46e9e4e29/fasttext/fasttext.pyx#L143
The text was updated successfully, but these errors were encountered: