-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Word2Vec does not run faster with more workers caused by sentences length #1509
Comments
When you gave it giant lines, you may have seen deceptive speed indicators: due to implementation limits in the optimized code, texts over 10,000 tokens are truncated to 10,000 tokens – with the rest ignored. (Throwing away lots of data can make things look very fast!) When you give it small lines, it is internally batching them together for efficiency (though not in such a way that any context windows overlap the breaks you've supplied). But the rates you're seeing are probably a more accurate estimate of what it takes to train on all supplied words. The inability of the code to fully utilize all cores, or even increase throughput, with more than 3-16 workers (no matter how many cores are available) is a known limitation, mostly due to the 'global interpreter lock' single-threading imposed on Python code, and perhaps somewhat due to the current architecture of a single corpus-reading thread handing work out to multiple worker threads. (Though, that 2nd factor can be minimized if your corpus iterable is relatively efficient, such as by working with data only in RAM or from fast volumes.) See related discussion in issues like #1486, #1291, #532, & #336. |
Thank very much for you reply. From the code When I ran the raw text8 with 10 workers, below is the debug log snippet, the bach-size =10000 and sentences number=1, speed is 806140 words/s.
But when I split text8 to multiple lines, below is the log, the speed is only about 169806 words/s.
|
Though it won't account for the full difference, note that rate-timings a bit deeper into training, or for the full training, can be more stable than rates at the very beginning, before all threads active and CPU caches warm. How many cores does your system have? Which batch-size(s) are you tuning? Are you splitting the lines in-memory on-the-fly, or once to a separate line-broken file on disk? Are you sure you didn't do something else, in the 2nd case, to force smaller (1000-word) training batches? (The default of 10000 would mean those (It's tough to be sure all the things varies in your tests without the full code.) |
Thank you for your patience, I didn't clearly report these details and sorry for that. I cleaned my tests code and put in gist below, with no params tuning, and you can see the differences: comparison in short: text8, 1 worker:
text8, 20 workers:
text8_split, 1 worker:
text8_split, 20 workers:
|
Thanks for the detailed report! That helps a lot. The In other words, even the almost trivial loops here and here seem to become the bottleneck with super short documents. The fact that the 1-worker I really don't know what we could do about this -- we're hitting the limits of Python itself here. Perhaps the most general solution would be to change the main API of gensim from "user supplies a stream of input data" to "user supplies multiple streams of data". It's fully backward compatible (one stream is just a special case of many streams), but would allow us to parallelize more easily without as much batching, data shuffling etc. Basically advise users to split their large data into multiple parts, Spark/Hadoop-style. Applies to all algos (LDA, LSI, word2vec...). @menshikh-iv @gojomo thoughts? |
Yes, the reason the 1-thread split run is faster is almost certainly due to the fact that with skip-gram and window=5, having many short sentences mean lots less training is happening due to windows truncated at sentence ends. IO may still be a factor, depending on your volume type and the behavior of Also, maximum throughput for the small-sentences case might be reached with a worker-count between 1 and 20 - the contention of a larger number of threads for the single Python GIL may be a factor in starving the master thread. @piskvorky Yes I think a shared abstraction for helping models open multiple non-contending (and ideally non-overlapping) streams into the corpus would be a good way to increase throughput. (The word2vec.c and fasttext.cc implementations just let every thread open their own handle into a different starting-offset of the file, and continue cycling through until the desired total number of examples is read. Because of thread overtaking issues there's no guarantee some parts of the file aren't read more times than others... but it probably doesn't matter in practice that their training samples aren't exactly N passes over each example.) |
@gojomo Yes, working off RAM (lists) will help isolate the issue, but I don't think the number of threads nor the IO are a factor here. @Knighter is comparing two identical setups, on identical data (the same number of workers, same IO...). The only difference here is the sentence length. One setup is starved, one isn't. Regarding multiple streams: gensim would be agnostic as to where the streams come from. Seeking to different places in a single file is one option; reading from multiple separate files (possibly separate filesystems) another. In practice, I suspect most people simply use a local FS, so that's our target use-case to optimize. |
@piskvorky If LineSentence is less efficient at reading the many-lined input, that might contribute for the 20-worker (unsplit) to 20-worker (split) starvation. The concatenation of small examples into larger batches may be relevant – but that was added because it greatly increased multithreaded parallelism in tests, by having long noGIL blocks, compared to small-examples-without-batching – at least in cases of 3-8 threads. Perhaps either of these processes – LineSentence IO or batching – gets starved for cycles when more threads are all waiting for the GIL. (That is: the trivial loops mean many more time-slicing events/overhead and context-switching.) Is there a 'withGIL' to force a block of related statements to not be open to normal interpreter GIL-sharing? |
@gojomo I met the same question. Does the long sentence longer than 10000 will be cut to 10000, and the rest data be discard while training? I do not see any declare about this process in API document. |
Yes, there's still a hard limit on sentence length (= max effective number of words in a document). Btw the performance issues around GIL were solved in #2127, closing this ticket. See the tutorial at https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/Any2Vec_Filebased.ipynb CC @mpenkov @gojomo – let's link to that tutorial straight from the API docs. I had trouble locating it myself, it's pretty well hidden. The API docs only mention "You may use |
OK, I'll deal with that as part of the documentation refactoring |
Hello @mpenkov , I still have a question. Does the batch size will affect the max length of the sentence? If I set the batch size = 128, So the max length of sentence will be set to 10000 or 128? |
@shuxiaobo |
I think corpus-file implementation doesn't solve this issue. This ability to scale linearly with the number of cores is more needed when you have more data, but that's exactly the situation in which you can't use LineSentence. I was wondering, why not using multiprocessing.Process instead of threading.Thread ? |
Why not?
Why not? |
Because, if I understand correctly, having my corpus in LineSentence format implies serializing the whole corpus to disk. Something I cannot afford to do, because I don't have enough space (I'm actually streaming from a remote datasource) |
No, LineSentence is a streamed format: |
Ah ok, nice. |
The LineSentence is streamed and this allows you to keep the RAM constant, |
That is true. Where do you pull your documents from for processing, if they don't even fit on disk? Gensim aims at RAM-independence (corpora larger than RAM), but not disk independence (corpora larger than hard disk). |
I undestand. My pre-processing prunes a lot of words (english common words for instance) but produces also different versions of the same sentences, (sentence as-is, sentence with n-grams taken from NER annotators and NER annotations replaced with ontology IDs) I can't estimate the size of this, for sure not less than 500 GB. |
I suspect that if you're pulling document "live" from Solr, and preprocessing them on the fly, then training is not your bottleneck. I would be surprised if you needed more than one Gensim worker to saturate such pipeline. In other words, the speed of procuring the training data will be the bottleneck, not the training itself. |
The preprocessing pipeline involves multiple parallel loaders and multiple parallel pre-processors. I have all cores at 100% usage 100% of the time and my pre-processing is very quick. The things changes when I train, and all my sentences are going to a queue that the gensim pulls through the iterator interface. But till now the best performance I got is 25k words/s which is very low. |
Another option would be training in batches of documents. Will this result in the same model? or training where multiple epochs corresponds to multiple training set is an issue? |
Yeah 25k words/s sounds low… especially if you have a large corpus (billions of tokens) to chew through. To be clear – is it the initial vocabulary scan (1st corpus pass) that is slow, or also the actual training = subsequent corpus training passes? If your preprocessing consumes 100% CPU 100% of the time, that indicates to me that it is indeed the bottleneck. I don't see any reason why multiple workers would only get to 25k words/seconds otherwise. Did you time the corpus iteration? No training, no gensim, just go through the input corpus iterator: About epochs: This seems too involved and specific, and we're getting off topic, sorry. |
I used this code to benchmark, where sentences is my iterator passed to FastText.train()
with 6 pre-processors, 4 loaders: ~103k words/s (peak 150k words/s) faster, but still slower than I expected. |
Thanks, that's good progress. Yes, 150k words/s is still slow. IIRC fasttext can do >1m words/s easily, on a single powerful machine. Still doesn't explain the drop from 150k to 25k. There is some overhead for sure but I don't see why it should be that large, with enough workers and enough spare CPU (vs 100% CPU already used for the on-the-fly preprocessing…).
Yes – that's why there's the |
I'll use corpus_file. If it doesn't fit disk, I will train in chunks. Thank you for your guidance. |
Optimizing specific usage scenarios could be better discussed on the project discussion list, than an old closed bug-tracking issue. But, some observations:
|
I perfectly agree with you on all points. In general "separation of concern" is best practice in software development an should be always encouraged. |
Description
Word2Vec does not run faster with more workers caused by sentences length:
When I use raw text8 data, multi-core worked fine, but my corpus is short text, one single line only contains several words, and when I randomly split text8 data to multiple line (e.g. only 3 ~ 8 words per line), and found more workers become useless.
Steps/Code/Corpus to Reproduce
Expected Results
Actual Results
Versions
Linux-3.10.0-229.7.2.el7.x86_64-x86_64-with-centos-7.1.1503-Core
('Python', '2.7.5 (default, Nov 20 2015, 02:00:19) \n[GCC 4.8.5 20150623 (Red Hat 4.8.5-4)]')
('NumPy', '1.13.1')
('SciPy', '0.19.1')
('gensim', '2.3.0')
('FAST_VERSION', 1)
The text was updated successfully, but these errors were encountered: