Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File-based fast training for Any2Vec models #2127

Merged
merged 133 commits into from
Sep 14, 2018

Conversation

persiyanov
Copy link
Contributor

@persiyanov persiyanov commented Jul 11, 2018

Tutorial explaining the whats & hows: Jupyter notebook

note: all preliminary discussions are in #2048

This PR summarizes all my work during GSoC 2018. For more understanding what's going on, follow the links:

Summary

In this pull request, new argument corpus_file is proposed for Word2Vec, FastText and Doc2Vec models. It is supposed to use corpus_file instead of standard sentences argument if you have the preprocessed dataset on disk and want to get significant speedup during model training.

On our benchmarks, training Word2Vec on English Wikipedia dump is 370% faster with corpus_file than training with sentences (see the attached jupyter notebook with the code).

Look at this chart for Word2Vec:
word2vec_file_scaling

Usage

The usage is really simple. I'll provide examples for Word2Vec while the usage for FastText and Doc2Vec is identical. The corpus_file argument is supported for:

Constructor

# Standard way
model = Word2Vec(sentences=my_corpus, <...other arguments...>)

# New way
model = Word2Vec(corpus_file='my_corpus_saved.txt', <...other arguments...>)

# You can save your own corpus using
gensim.utils.save_as_line_sentence(my_corpus, 'my_corpus_saved.txt')

build_vocab

# Create the model without training
model = Word2Vec(<...other arguments...>)

# Standard way
model.build_vocab(sentences=my_corpus, ...)

# New way
model.build_vocab(corpus_file='my_corpus_saved.txt', ...)

train

# Create the model without training
model = Word2Vec(<...other arguments...>)

# Build vocab (with `sentences` or `corpus_file` way, choose what you like)
model.build_vocab(corpus_file='my_corpus_saved.txt')

# Train the model (old way)
model.train(sentences=my_corpus, total_examples=model.corpus_count, ...)

# Train the model (new way)
model.train(corpus_file='my_corpus_saved.txt', total_words=model.corpus_total_words, ...)

That's it! Everything else remains the same as before.

Details

Firstly, let me describe the standard approach to train *2Vec models:

  1. A user provides input data stream (python iterable object)
  2. One job_producer python thread is created. This thread reads data from the input stream and pushes batches into the python threading.Queue (job_queue).
  3. Several worker threads pull batches from job_queue and perform model updates. Batches are python lists of lists of tokens. They are first translated into C structures and then a model update is performed without GIL.

Such approach allows to scale model updates linearly, but batch producing (from reading up to filling C structures from python object) is a bottleneck in this pipeline.

It is evident that we can't optimize batch generation for abstract python stream (with custom user logic). Instead of this, we performed such an optimization only for data which is stored on a disk in a form of gensim.models.word2vec.LineSentence (one sentence per line, words are separated by whitespace).

Such a restriction allowed us to read the data directly on C++ level without GIL. And then, immediately, perform model updates. Finally, this resulted in linear scaling during training.

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Jul 14, 2018

Notes:

  • Implement progress logging for multistream cython version
  • Mention info about callbacks + cython version (maybe in documentation only)
  • Fix memory issue with vocab (copy it to each thread)

@persiyanov
Copy link
Contributor Author

@menshikh-iv @piskvorky @gojomo

Guys, I've encountered some problems with nogil training with python streams (this branch of code in current PR).

Recently, I got results on previous PR which are in this table. They look good, but when I ran the same benchmark on the current branch, I got much worse numbers:

* Workers = 1
* Total epoch time: 892.676775932 sec.
* Processing speed: 203153.482749 words/sec

* Workers = 4
* Total epoch time: 295.960248947 sec.
* Processing speed: 612746.308483 words/sec


* Workers = 8
* CPU usage : 536.20%
* Total epoch time: 238.676858902 sec.
* Processing speed: 759814.520077 words/sec

* Workers = 10
* CPU usage : 455.60%
* Total epoch time: 285.25424099 sec.
* Processing speed: 635751.19995 words/sec

* Workers = 14
* CPU usage : 422.40%
* Total epoch time: 329.26668787 sec.
* Processing speed: 550769.773199 words/sec

I hypothesized that the reason for such degradation is using any2utf8 function here (I didn't use is in the previous PR when I ran the benchmark, I only did x.encode('utf8') and everything was okay). And indeed, when I changed any2utf8 in current PR to encode('utf8'), I got back to the good numbers.

After that, I realized, that if a user will use any python stream a bit more complicated (in CPU-boundness terms) than LineSentence, he will get much more degradation. To check this hypothesis, I ran the benchmark with any2utf8 -> encode('utf8') changes and using LineSentence with preprocess_string iterator (code is here). As I expected, in this benchmark (on 14 input streams and number of workers) there is only 146% CPU load and it didn't finish even in 15 minutes, so I didn't wait till the end.

That said, I suggest NOT to include nogil training for python iterators and leave only with CythonLineSentence part. At least code will look much cleaner :)

WDYT?

@gojomo
Copy link
Collaborator

gojomo commented Jul 14, 2018

That every word must be re-encoded as UTF8 inside the cython code, but outside the nogil block, seems problematic whether it's done by encode() of any2utf8(). (Is it really just swapping one for the other that makes the giant difference? If so, maybe there's a problem in any2utf8()'s logic – because it looks like its preferred case just does 1 type-check then encode()? Maybe the contract for the class should require the lists-of-tokens to already be of a format requiring no more re-encoding, at the very least so that on multi-pass training from an all-in-memory dataset the same strings aren't re-encoded every time.)

It seems a little odd for the pure-python function (holding the GIL and using no cython) iterate_batches_from_pystream() to be in the _inner file, but I suppose it belongs near its only places-of-use. That the gil/nogil branches (where iterate_batches_from_pystream() is used) seem to be identical code except for the iteration strikes me as some perhaps hard-to-maintain duplication. I'm also a little surprised that the plain-Python list returned from iterate_batches_from_pystream() is ok as the vector[vector[string]] of prepare_c_structures_for_batch() – might that involve some marshalling/conversion overhead?

Copy link
Owner

@piskvorky piskvorky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an algorithm summary of what this version does? I find it hard to decipher what is actually happening, at a conceptual level.

def _train_epoch(self, data_iterable=None, data_iterables=None, cur_epoch=0, total_examples=None,
total_words=None, queue_factor=2, report_delay=1.0):
def _train_epoch_multistream(self, data_iterables, cur_epoch=0, total_examples=None, total_words=None):
assert len(data_iterables) == self.workers, "You have to pass the same amount of input streams as workers, " \
Copy link
Owner

@piskvorky piskvorky Jul 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert is for checking programmer errors (code invariants), not user input. Exception better.

setup.py Outdated
@@ -250,7 +250,8 @@ def finalize_options(self):

ext_modules=[
Extension('gensim.models.word2vec_inner',
sources=['./gensim/models/word2vec_inner.c'],
sources=['./gensim/models/word2vec_inner.cpp', './gensim/models/fast_line_sentence.cpp'],
language="c++",
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strong -1 on any C++ dependencies.

total_examples=None, total_words=None):
thread_private_mem = self._get_thread_working_mem()

examples, tally, raw_tally = self._do_train_epoch(input_stream, thread_private_mem, cur_epoch,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hanging indent please (here and everywhere else).

@piskvorky
Copy link
Owner

piskvorky commented Jul 14, 2018

A more optimized version of LineSentence (with faster recoding / reading) sounds like a nice addition. But I'm -1 on duplicating code paths for some special cases, or restricting important functionality to only special non-Python iterators (CythonLineSentence).

I'm also strongly -1 on introducing any C++ dependencies into Gensim.

Is there a high level summary of the current approach, its limitations and potential?

@piskvorky
Copy link
Owner

piskvorky commented Jul 14, 2018

@persiyanov I played with the Python C API a bit, to see if we could treat each sentence as "const" and release GIL completely while working on it.

I saw no issues with it; see my gist here. I call this code from Python via Cython:

from cpython.ref cimport PyObject

cdef extern from "sum_c.c":
    long long process_const_sentence(PyObject *lst) nogil;

So basically whatever happens while processing a single sentence can be completely GIL-free. No data copies or dynamic allocations needed. Everything is "static", unchanging (at least for the duration of processing a single sentence).

We're already in highly optimized territory so I'm not sure my intuition is correct, but does this help? Or is producing the sentences (even with no-op processing) already the bottleneck?

I'd love to see some benchmark numbers on a multi-stream version that uses a fast sentence reader Python iterable (a well optimized LineSentence) + GIL-free streamed sentence training (no batching, each sentence processed immediately ala process_const_sentence).

@menshikh-iv
Copy link
Contributor

But I'm -1 on duplicating code paths for some special cases, or restricting important functionality to only special non-Python iterators (CythonLineSentence).

this is almost inevitable because for good performance (i.e. linear scaling) we need release GIL almost everywhere (and we must know when we can do this). This is "advanced" functionality, I see no issues to use special kind of iterator for it (if you no need max performance - pass any iterable, otherwise - pass our optimized version).
Of course, we'll test your approach, if it will work as we needed we changed this behaviour.

I'm also strongly -1 on introducing any C++ dependencies into Gensim.

any motivation? We need STL here (vector, map, etc). Also, sent2vec PR use c++ #1619. I see no additional problems with maintain (all difference for building in argument --cplus for cython).

@piskvorky
Copy link
Owner

piskvorky commented Jul 15, 2018

I agree we need to release GIL as much as possible. There's no question about that. In fact, that's what my previous comment experimented with. It confirms we can release GIL as soon as we have a sentence, since all objects are "static" after that point, no dynamic allocations or objects moving, no reference counting needed.

And since we cannot release GIL any sooner (sentences come from arbitrary Python iterables, need GIL for that), this means this is as good as it gets—unless I missed something, an optimal solution, given the problem constraints.

What I'm -1 on is a special training code path inside word2vec just to handle specific non-Python inputs. That's not worth the added complexity.

What we want is a single training code path that works for any iterable(s)—although possibly works better for optimized iterables, that's fine. The difference should be mainly in the iterable code, not mainly in the training code. We don't want splitting or duplication of training logic for different types of inputs. That's unmaintainable.

Another consideration: how does the code handle the most common use case of streams=1, workers>1? That's what everyone is using now, and we want to keep supporting that without any major regressions (speed, ease of use).

"No C++" is a hard requirement for me. I see no reason to introduce another language into Gensim. C is powerful, simple, elegant, and already plays well with Python, its C API and its scientific ecosystem. What makes you think we need C++?

@persiyanov
Copy link
Contributor Author

persiyanov commented Jul 15, 2018

@piskvorky Let me describe the current approach with short pseudocode:

if isinstance(input_stream, CythonLineSentence):
    # Cython stream: read a batch (sentences), prepare it and perform train update fully WITHOUT GIL
    with nogil:
        while <input stream is not end>:
            sentences = <get sentences from input_stream>
            prepare_batch(input_stream, sentences, sentence_idx, indexes, reduced_windows)
            <..train update on batch..>
else:
    # Python stream: read a batch (sentences), prepare it WITH GIL, and perform train update WITHOUT GIL
    while <input stream is not end>:
        sentences = <get sentences from input_stream>
        with nogil:
            prepare_batch(input_stream, sentences, sentence_idx, indexes, reduced_windows)
            <..train update on batch..>
  1. In first branch of if-else, we can release the GIL only once and then perform all phases (reading sentences, prepare batch and train) without GIL. This is the most efficient part and can only be ran with input stream that allows reading sentences without GIL (e.g. CythonLineSentence)
  2. The second branch of if-else is for python streams. In this case, we can get sentences only with GIL, but then we perform training update without GIL. The difference between this approach and the one that in develop branch is that we initialize all C structures only once.

In my previous message I said that if we use more complex python streams (with CPU processing), we immediately get a dramatic decrease in performance in the second branch of if-else. So, the problem is not in processing sentences without GIL but in getting sentences from a stream without GIL. And your example which uses Python C API actually doesn't help in this way.

Concerning C++, the approach in this PR requires at least unordered_map in order to lookup words in vocabulary without GIL. I don't think that writing our own hash table (or using external C library for it) is a wise decision, because STL hash tables are extremely optimized.

@piskvorky
Copy link
Owner

piskvorky commented Sep 14, 2018

@persiyanov congrats! ⭐️

Can you please share the raw plot data (gist), in case we need to regen the image in the future? What was the machine for this benchmark (HW specs, BLAS)?

@persiyanov
Copy link
Contributor Author

Yep, sure:

queue_based = [(1, 166155), (4, 550714), (8, 754872), (10, 776207), (12, 797239), (14, 792528), (24, 747589), (32, 753477)]
file_based = [(1, 185190), (4, 695116), (8, 1311816), (12, 1827262), (16, 2265058), (24, 3378465), (32, 3695980)]
origc = [(1, 74.32*1000), (4, 77.02*1000*4), (8, 74.84*1000*8), (10, 76.26*1000*10), (12, 76.38*1000*12), (14, 80.56*1000*14), (24, 68.88*1000*24), (32, 63.13*1000*32)]

About hardware specs it's better to ask Menshikh. I only know that it has 60GB RAM + 32 cores.

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Sep 20, 2018

For history, benchmark information

  • Dataset: full English Wikipedia
  • Cloud: GCE
  • CPU: Intel(R) Xeon(R) CPU @ 2.30GHz 32 cores
  • BLAS: libblas3 (3.7.1-3ubuntu2) MKL

@piskvorky
Copy link
Owner

@persiyanov @menshikh-iv libblas3 is not good. This is just a reference implementation: big difference to real BLAS like OpenBLAS or MKL.

BLAS can make a difference of 4x, i.e. more than entire "file-based" improvement. If BLAS wasn't installed, the comparison to C tool in our graph is misleading.

@persiyanov I know you're busy, but when you have time, can you re-run the timings using proper BLAS? Either OpenBLAS compiled for that particular machine (not generic), or MKL (comes pre-installed with Anaconda, IIRC).

@menshikh-iv
Copy link
Contributor

If BLAS wasn't installed, the comparison to C tool in our graph is misleading.

if BLAS wasn't installed for all algorithms (i.e. our code & Mikolov implementation), how we can get misleading, can you clarify this, please?

@piskvorky
Copy link
Owner

piskvorky commented Sep 20, 2018

The C tool doesn't use BLAS, Gensim does. All high-perf users (for whom time matters = audience of this PR) will have optimized BLAS installed. Benchmarks without BLAS are misleading, and fail to show the difference between C and Gensim properly.

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Sep 20, 2018

@piskvorky aha, this was a misunderstanding from my side: I interpret "misleading" as errors in general conclusion (like gensim w2v faster than C version, file-based faster than queue-based).
In this case, the plot will look same (by order), but gensim should be faster, I understand what you mean, thanks for the clarification.

@piskvorky
Copy link
Owner

piskvorky commented Sep 20, 2018

Yes, performance of C should not change either way. But Gensim should improve, depending on how bad libblas3 was vs how good the new BLAS will be in comparison.

@piskvorky
Copy link
Owner

piskvorky commented Sep 20, 2018

(This is all assuming these new file-based implementations use BLAS like the existing algos -- @persiyanov do they? If not, then even the comparison to existing queue-based will be misleading, and the whole project without much effect.)

@persiyanov
Copy link
Contributor Author

persiyanov commented Sep 20, 2018

@piskvorky How can I get BLAS version which gensim uses? I ran my benchmarks under anaconda3 python, numpy config shows this:

In [1]: import numpy as np
In [2]: np.__config__.show()
mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/home/persiyanov/anaconda3/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/home/persiyanov/anaconda3/include']
blas_mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/home/persiyanov/anaconda3/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/home/persiyanov/anaconda3/include']
blas_opt_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/home/persiyanov/anaconda3/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/home/persiyanov/anaconda3/include']
lapack_mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/home/persiyanov/anaconda3/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/home/persiyanov/anaconda3/include']
lapack_opt_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/home/persiyanov/anaconda3/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/home/persiyanov/anaconda3/include']


Seems like blas-mkl & lapack are installed

@piskvorky
Copy link
Owner

piskvorky commented Sep 20, 2018

That looks like MKL is installed (via Anaconda).

Whether a given algorithm uses BLAS depends on the algorithm: if you're using Python with NumPy/SciPy, it gets picked up automatically.

But if you're writing your own C/Cython code, you have to call BLAS routines manually, like the queue-based implementation of word2vec/doc2vec/fasttext do. You can check whether BLAS is being used in the queue-based word2vec with gensim.models.word2vec.FAST_VERSION (see for example here).

@persiyanov
Copy link
Contributor Author

I call the same routines that were written for original queue-based models. So I'm pretty sure it's used in the same way as in queue-based models.

@menshikh-iv
Copy link
Contributor

So, in this case, our comparison correct, is it @piskvorky?

@piskvorky
Copy link
Owner

piskvorky commented Sep 20, 2018

Probably :) @persiyanov can you point me to the optimized part of the code that does the actual training on a sentence / word pair? The inner-most routine, after all the redirecting / data preparation. Probably somewhere deep inside Cython/C++. Since we're on the topic, I'll double check.

@persiyanov
Copy link
Contributor Author

persiyanov commented Sep 20, 2018

@piskvorky
This is word2vec example

I import & use the same functions as in queue-based w2v:

from gensim.models.word2vec_inner cimport (
    w2v_fast_sentence_sg_hs,
    w2v_fast_sentence_sg_neg,
    w2v_fast_sentence_cbow_hs,
    w2v_fast_sentence_cbow_neg,
    random_int32,
    init_w2v_config,
    Word2VecConfig
)

The same is true for fasttext/doc2vec. So, basically, I didn't write any inner-most routines, I use existing ones, which use blas. I only wrote wrapper code regarding proper noGIL data preparation & reading.

@piskvorky
Copy link
Owner

piskvorky commented Sep 20, 2018

Yeah, that works 👍 So, a good BLAS should give us a nice boost, compared to libblas3.

To be honest I thought the benchmark results were already using BLAS, because otherwise I don't understand where the speedup compared to C is coming from. Why should this be faster than Mikolov's C which does the same thing, without BLAS? Strange.

@persiyanov
Copy link
Contributor Author

persiyanov commented Sep 20, 2018

I suppose that the graph I plotted is already with good BLAS. I've provided an output from numpy, seems that numpy uses good BLAS, therefore so does gensim.

Ivan gathered libblas3 by listing all apt packages installed on the machine, IIRC, but, actually, gensim used good blas.

@piskvorky
Copy link
Owner

piskvorky commented Sep 20, 2018

Yeah, if it was run with Intel's MKL (not libblas3), that's a very good BLAS. That explains it.

@persiyanov
Copy link
Contributor Author

Then, we don't have to rerun all the benchmarks :)

@akutuzov
Copy link
Contributor

@persiyanov Does it work with gzipped corpus files?

@menshikh-iv
Copy link
Contributor

@akutuzov no, the only plaintext, see also #2159

@gojomo
Copy link
Collaborator

gojomo commented Oct 5, 2018

Check report on discussion list – https://groups.google.com/forum/#!topic/gensim/GFif2UwPRys – of very-different aggregated vector magnitudes when using the new mode compared to old. (It might be aggravated by user's tiny dataset & peculiar parameters – something that disappears in real larger datasets – or some other unintentional excess training that's happening in the new mode.)

alreadytaikeune added a commit to alreadytaikeune/gensim that referenced this pull request Oct 18, 2018
In word2vec_inner.pyx, functions now used the new config object while still returning the number of samples.

In base_any2vec, logging includes the new loss values, (the addition of this branch)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants