Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doc2vec not parallelizing #532

Closed
fccoelho opened this issue Nov 17, 2015 · 55 comments
Closed

Doc2vec not parallelizing #532

fccoelho opened this issue Nov 17, 2015 · 55 comments
Labels
bug Issue described a bug difficulty hard Hard issue: required deep gensim understanding & high python/cython skills

Comments

@fccoelho
Copy link
Contributor

Doc2vec does not use all my cores despite my setting workers=8 when I instantiate it.

My install passes the assert below:

assert gensim.models.doc2vec.FAST_VERSION > -1

Do I have to do something else?

@gojomo
Copy link
Collaborator

gojomo commented Nov 18, 2015

As soon as FAST_VERSION is not -1, there are compute-intensive codepaths that avoid holding the python global interpreter lock, and thus you should start to see multiple cores engaged.

However, there are still bottlenecks (eg discussions in #336, with some maybe-workarounds) that limit how well all the cores are engaged, especially as the number of cores/workers grow greater than 4 or 8. So you should expect to see some but not full core-engagement.

In general, training that spends more time inside the optimized code will achieve higher utilization. That means more dimensions, larger text examples (more words), larger window values, or a larger count of negative samples (if using negative sampling). In fact I've noticed that when training isn't yet saturating all cores, upping some of those parameters (that would normally require more work to be done per example, and thus slower completion) can come 'for free'.

@fccoelho
Copy link
Contributor Author

Ok. Thanks for the explanations. Besides this, I noticed that as It runs, it get slower and slower, as measured by the number of words/second, and when you look it up on htop, it is consuming just 0.7% of one core... it started with 3.5k words /sec and after a few hours running it is down to 36/sec

do you get this kind of behavior too?

@gojomo
Copy link
Collaborator

gojomo commented Nov 19, 2015

Are you on OSX? If so it might be this 'app nap' issue: #493.

If not, I've not seen that behavior and wouldn't expect it: after the initial model setup, training should be at about the same rate early or late. If you're seeing it, I would first take a look at possible IO/network bottlenecks (or maybe throttling) in reading the data, or if the data has been sorted in some way that makes the later examples very different, or if some other code (including perhaps the code feeding in the examples) might have some performance issue (eg linear-scan-from-start-for-each-example) or be triggering swapping.

@fccoelho
Copy link
Contributor Author

No I am not.

There is no I/O bottleneck that I can detect. I am streaming the examples directly from a local Mongodb collection. I have trained word2vec models from the same dataset under the same circumstances without any issues.

@gojomo
Copy link
Collaborator

gojomo commented Nov 21, 2015

There's little difference between what the Doc2Vec and Word2Vec models are doing during training –certainly no extra or harder steps that would account for a slowdown at the end.

The Doc2Vec model might be using much more addressable memory, if there are far more documents/doctags than vocabulary words. That might show up as swapping.

But I would most strongly suspect something with whatever code fetches and feeds the examples to the model. (If the data can fit in main memory, perhaps compose it all there rather than having another DB/disk in the loop during processing. Or even if not, perhaps eliminate any DB/API as a factor by streaming the corpus as text from a fast local volume like an SSD.)

@tmylk
Copy link
Contributor

tmylk commented Jan 9, 2016

@gojomo Is this reproducible? Otherwise suggest closing the issue.

@mmroden
Copy link

mmroden commented Apr 29, 2016

This is absolutely a problem for me. I'm creating a doc2vec model using the command:

NUM_WORKERS = multiprocessing.cpu_count()
NUM_VECTORS = 500

sentences = list(LabeledSentence(clean_doc(key),
                                 value) for key, value in corpus)
vector_count = NUM_VECTORS
model = Doc2Vec(size=vector_count, min_count=1, dm=0, iter=1,
                dm_mean=1, dbow_words=1, workers=NUM_WORKERS)
model.build_vocab(sentences)

start_alpha = config.start_alpha  # 0.025 by default
alpha_step = config.alpha_step   # 0.001 by default
epochs = config.epochs   # 25 by default

model.alpha = start_alpha

for epoch in range(epochs):
    print ("Training epoch", epoch)
    model.train(sentences)
    model.alpha -= alpha_step
    model.min_alpha = model.alpha

(I've paraphrased a lot down here). Using htop, I can see that only one core is actually in use, the rest are idle.

I start from a 14.04.4 ubuntu image on an ec2 c4.8xlarge. My provisioning script:

#!/bin/bash

add-apt-repository -y ppa:rwky/redis
sudo DEBIAN_FRONTEND=noninteractive apt-get -y -o Dpkg::Options::="--force-confdef" -o Dpkg::Options::="--force-confold" update
sudo DEBIAN_FRONTEND=noninteractive apt-get -y -o Dpkg::Options::="--force-confdef" -o Dpkg::Options::="--force-confold" upgrade
apt-get install -y redis-server ntp python-software-properties software-properties-common python-virtualenv
apt-get install -y python-dev python-pip git gfortran libopenblas-dev liblapack-dev postgresql postgresql-contrib libpq-dev python-numpy # python-scipy
pip install virtualenv

My requirements:

behave>=1.2.4
boto>=2.32.0
pep8
python-dateutil
gevent
grequests
psycopg2
cython
scipy==0.16.1
gensim==0.12.4
parse
parse_type

I've tried setting scipy to 0.15.1 and gensim to 0.12.1, and then all versions of gensim to 0.12.4, and scipy to 0.16.1. All say FAST_VERSION = 1. None of them use more than one core at a time.

I've also tried the numpy from apt (1.8.6), as well as the latest (1.11.0), with no change.

@gojomo
Copy link
Collaborator

gojomo commented Apr 29, 2016

How many examples and words are in your 'sentences'?

@mmroden
Copy link

mmroden commented Apr 29, 2016

Here's a sample from when the code starts running:

2016-04-29 21:27:07,360 - gensim.models.doc2vec - INFO - collecting all words and their counts
2016-04-29 21:27:07,360 - gensim.models.doc2vec - INFO - PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
2016-04-29 21:27:08,747 - gensim.models.doc2vec - INFO - PROGRESS: at example #10000, processed 6997928 words (5047407/s), 201811 word types, 5535 tags
2016-04-29 21:27:10,101 - gensim.models.doc2vec - INFO - PROGRESS: at example #20000, processed 14040529 words (5201383/s), 286697 word types, 9189 tags
2016-04-29 21:27:11,004 - gensim.models.doc2vec - INFO - collected 328446 word types and 11126 unique tags from a corpus of 26368 examples and 18625550 words
2016-04-29 21:27:12,441 - gensim.models.word2vec - INFO - min_count=1 retains 328446 unique words (drops 0)
2016-04-29 21:27:12,441 - gensim.models.word2vec - INFO - min_count leaves 18625550 word corpus (100% of original 18625550)
2016-04-29 21:27:12,996 - gensim.models.word2vec - INFO - deleting the raw counts dictionary of 328446 items
2016-04-29 21:27:13,014 - gensim.models.word2vec - INFO - sample=0 downsamples 0 most-common words
2016-04-29 21:27:13,014 - gensim.models.word2vec - INFO - downsampling leaves estimated 18625550 word corpus (100.0% of prior 18625550)
2016-04-29 21:27:13,014 - gensim.models.word2vec - INFO - estimated required memory for 328446 words and 500 dimensions: 1568173400 bytes
2016-04-29 21:27:13,435 - gensim.models.word2vec - INFO - constructing a huffman tree from 328446 words
2016-04-29 21:27:22,904 - gensim.models.word2vec - INFO - built huffman tree with maximum node depth 24
2016-04-29 21:27:23,029 - gensim.models.word2vec - INFO - resetting layer weights
('Training epoch', 0)
2016-04-29 21:27:27,697 - gensim.models.word2vec - INFO - training model with 36 workers on 328446 vocabulary and 500 features, using sg=1 hs=1 sample=0 negative=0
2016-04-29 21:27:27,697 - gensim.models.word2vec - INFO - expecting 26368 sentences, matching count from corpus used for vocabulary survey
2016-04-29 21:27:44,930 - gensim.models.word2vec - INFO - PROGRESS: at 0.06% examples, 578 words/s, in_qsize 72, out_qsize 71
2016-04-29 21:28:01,875 - gensim.models.word2vec - INFO - PROGRESS: at 3.65% examples, 20037 words/s, in_qsize 72, out_qsize 71
2016-04-29 21:28:19,105 - gensim.models.word2vec - INFO - PROGRESS: at 7.30% examples, 26352 words/s, in_qsize 72, out_qsize 71
2016-04-29 21:28:35,995 - gensim.models.word2vec - INFO - PROGRESS: at 10.98% examples, 29808 words/s, in_qsize 72, out_qsize 71

@mmroden
Copy link

mmroden commented Apr 29, 2016

Looking at this, I'm wondering if the 18 million words being constrained to 300k at a time might be a bottleneck. I have more than enough RAM to handle more than that; any way to increase? Or could there be something else at play? If I set dbow_words=0, things are certainly faster, but still only use one core.

@gojomo
Copy link
Collaborator

gojomo commented Apr 29, 2016

Where are you seeing a "300k at a time" constraint?

With all the version-variants you've tried, you may want to make absolutely sure of the effective FAST_VERSION value, by printing it just before the training. (Though, you should also see a logged warning if using the pure-Python fallback.)

Given how much of the process is still in Python and subject to the GIL, you're unlikely to see full 36-core parallelism, but should be seeing something better than 1. It's possible the attempt at 36-threads is making things harder, so it'd be worth trying smaller worker counts – especially 2 to 16. (If none of these show any hint of using more cores, which I've seen working on Ubuntu 14.04 with 8 workers on an 8-core machine, then I'd start to expect some other odd system-specific limit – something in the support libraries or processor-affinity settings or somesuch.)

The part that can thread-parallelize is the cython 'nogil' regions. Generally, things which make the algorithm spend more time there increase the parallelism (when it's working at all). So: a larger window, or more negative-examples (when using negative-sampling), or enabling dbow_words concurrent word-training, can sometimes be marginally 'free' (in total run-time) because even though more work is being done, it's all taken by the higher parallelism.

Recent gensim also batches multiple examples, up to a total word-count of 10000 (

MAX_WORDS_IN_BATCH = MAX_SENTENCE_LEN
), into a single nogil training loop entry – this helped parallelism a lot with very-small examples (5-20 words). That may not help as much with your average 700-word examples (but it also shouldn't be backfiring). Essentially, your 1.8M words will be broken into ~1800 batches to the individual threads, which should be enough to see some multi-core usage. If you're comfortable modifying the PYX code, re-cython-compiling and then re-c-compiling, you could up that 10K to a much higher number to try even larger batches into the parallelizable blocks – but probably only worth trying that incremental step after you see at least some parallelism.

@mmroden
Copy link

mmroden commented Apr 30, 2016

So I've tried dropping the number of workers to 1, 2, 4, etc, and they all go to the same speed as the 36 core instance, and htop shows a single core being used. It just feels like, somehow, the threads are still being constrained to a single core.

I remember getting this to work before in the past, so I'm wondering if I was misremembering that, or if there is something that's changed in those apt-get updates that's breaking things.

How do I enable dbow_words concurrent word-training? Is that changing the setting from 1 to 0?

I don't know how to read the numpy config settings, but is there something in here that looks incorrect?

>>> import gensim
>>> import scipy
>>> scipy.show_config()
blas_info:
    libraries = ['blas']
    library_dirs = ['/usr/lib']
    language = f77
lapack_info:
    libraries = ['lapack']
    library_dirs = ['/usr/lib']
    language = f77
atlas_threads_info:
  NOT AVAILABLE
blas_opt_info:
    libraries = ['blas']
    library_dirs = ['/usr/lib']
    language = f77
    define_macros = [('NO_ATLAS_INFO', 1)]
openblas_info:
  NOT AVAILABLE
atlas_blas_threads_info:
  NOT AVAILABLE
lapack_opt_info:
    libraries = ['lapack', 'blas']
    library_dirs = ['/usr/lib']
    language = f77
    define_macros = [('NO_ATLAS_INFO', 1)]
atlas_info:
  NOT AVAILABLE
lapack_mkl_info:
  NOT AVAILABLE
blas_mkl_info:
  NOT AVAILABLE
atlas_blas_info:
  NOT AVAILABLE
mkl_info:
  NOT AVAILABLE
>>> gensim.models.doc2vec.FAST_VERSION
1

It looks similar enough to the output in #336 that nothing stood out as a warning flag to me, but as I mentioned, I'm not sure how to read that.

@tmylk
Copy link
Contributor

tmylk commented May 1, 2016

You have most probably checked this but maybe there was some cpu binding
setup using taskset command? Can you get a simple Python parallel process
example to use multiple cores?
On 30 Apr 2016 04:05, "mmroden" [email protected] wrote:

So I've tried dropping the number of workers to 1, 2, 4, etc, and they all
go to the same speed as the 36 core instance, and htop shows a single core
being used. It just feels like, somehow, the threads are still being
constrained to a single core.

I remember getting this to work before in the past, so I'm wondering if I
was misremembering that, or if there is something that's changed in those
apt-get updates that's breaking things.

How do I enable dbow_words concurrent word-training? Is that changing the
setting from 1 to 0?

I don't know how to read the numpy config settings, but is there something
in here that looks incorrect?

import gensim
import scipy
scipy.show_config()
blas_info:
libraries = ['blas']
library_dirs = ['/usr/lib']
language = f77
lapack_info:
libraries = ['lapack']
library_dirs = ['/usr/lib']
language = f77
atlas_threads_info:
NOT AVAILABLE
blas_opt_info:
libraries = ['blas']
library_dirs = ['/usr/lib']
language = f77
define_macros = [('NO_ATLAS_INFO', 1)]
openblas_info:
NOT AVAILABLE
atlas_blas_threads_info:
NOT AVAILABLE
lapack_opt_info:
libraries = ['lapack', 'blas']
library_dirs = ['/usr/lib']
language = f77
define_macros = [('NO_ATLAS_INFO', 1)]
atlas_info:
NOT AVAILABLE
lapack_mkl_info:
NOT AVAILABLE
blas_mkl_info:
NOT AVAILABLE
atlas_blas_info:
NOT AVAILABLE
mkl_info:
NOT AVAILABLE
gensim.models.doc2vec.FAST_VERSION
1

It looks similar enough to the output in #336
#336 that nothing stood out as
a warning flag to me, but as I mentioned, I'm not sure how to read that.


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#532 (comment)

@piskvorky
Copy link
Owner

@tmylk don't use this style of (email?) replies -- it pollutes github with quoted texts and makes the conversation hard to follow.

@gojomo
Copy link
Collaborator

gojomo commented May 2, 2016

dbow_words=1 means you've already got the simultaneous word-training active (which would normally help parallelism a little).

Occasionally when people have more than one python/virtualenv, they're checking the FAST_VERSION in a different one than is running their application code – so I would make absolutely sure that the code that's running slow/single-core is reporting a non-negative FAST_VERSION.

Also, while I have no reason to think it's a problem here, I generally try to install anything that can be installed with pip, with pip (rather than apt-get) – so I'd not install python-numpy or python-virtualenv via apt-get. (You might be making some redundant or not-optimally-available installations.)

You could also try using conda, which I've had good luck with on Ubuntu. I usually start with the 'miniconda' installer (http://conda.pydata.org/miniconda.html), and create an environment based on the packages that conda does well (numpy/scipy/ipython/notebook). But then, use 'pip' to install gensim (because conda's version lags). Again, no specific reason to think this will help with your issues, but it's worth a try jsut to mix things up, given the mystery.

@mmroden
Copy link

mmroden commented May 3, 2016

What would I be looking for if some kind of taskset-- stye command were run?

For this provisioned instance (or in a local vagrant, whichever), there's only one venv, and gensim is only installed in the venv. I can only import gensim to check FAST_VERSION in that venv, so I feel pretty confident that there's no pollution that way.

I tried to use 1.11 of numpy, but that didn't seem to alleviate the problem. I'm going to wipe my venv and try again, but I've tried that numerous times in the last little while.

I think maybe the best thing would be that I not use my own code, but maybe use some kind of test case. Is there one, and if so, where is it?

@tmylk
Copy link
Contributor

tmylk commented May 3, 2016

Hi mmroden

There is a gensim doc2vec tutorial that is parallelized at https://github.com/piskvorky/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb

It is also easy to test if Python can parallelise some simple task, say x^2, on many cores on your instance.

@mmroden
Copy link

mmroden commented May 3, 2016

It seems like it might be useful to have this kind of test available in general, if I'm not the one who opened the ticket. Let me see what I can write up, but would it make sense to roll whatever I make as a potential contribution back?

@tmylk
Copy link
Contributor

tmylk commented May 3, 2016

It would be useful to have an example to test if any Python parallelisation is possible. It has a place in this issue and in FAQ for troubleshooting. But it is a general python thing, not particuarly related to gensim, so we would not merge it in as a PR

@mmroden
Copy link

mmroden commented May 3, 2016

Here is my timing test code. When it runs, I can clearly see all processors being used by htop. When I run the model creation code above, I only see one core working (ie, the original problem). So this isn't a general python/python multithreaded issue.

I'm running this locally, rather than in the ec2 instance, on a virtual machine with 4 cores. The machine is provisioned using that exact same provision.sh script, and I updated the requirements to include numpy > 1.10 (which overrides the system numpy).

Is there something wrong in the way that I'm creating the model, some parameter I didn't set? I really am genuinely baffled here.

import time
from multiprocessing import Pool
from math import sqrt


def f(x):
    return sqrt(x) * sqrt(x) * sqrt(x) * sqrt(x)  # just some busywork


def print_timing(func):
    def wrapper(*arg):
        t1 = time.time()
        res = func(*arg)
        t2 = time.time()
        print '%s took %0.3f ms' % (func.func_name, (t2-t1)*1000.0)
        return res
    return wrapper


@print_timing
def get_single_thread_time(n):
    return_list = []
    for i in xrange(0, n):
        return_list += [f(i)]
    return return_list


@print_timing
def get_pool_time(n):
    pool = Pool()
    return pool.map(f, xrange(n))


if __name__ == "__main__":
    n = 10000000
    single_thread = get_single_thread_time(n)
    pooled = get_pool_time(n)
    assert pooled == single_thread

@gojomo
Copy link
Collaborator

gojomo commented May 3, 2016

Unfortunately multiprocessing.Pool uses full-blown processes, so isn't really testing the same sort of multithreading as gensim Word2Vec/Doc2Vec. Using a ThreadPool and numpy operations that will likely release the Python GIL (allowing multithreaded speedup) would be a better test.

Locally, when I run the following (in an ipython notebook on OSX), it takes about 18 seconds, and reported CPU utilization of the Python process exceeds 500% – indicating multiple cores are engaged. I would also expect it to show multiple cores engaged under htop. Does it in your environment?

import numpy as np

def f(x): 
    return np.sum(np.sqrt(x))

from multiprocessing.pool import ThreadPool

pool = ThreadPool(8)
a = np.arange(10000000)
b = [a] * 1000
c = pool.map(f, b)

(While this ticket's description seems appropriate for your issue, I highly suspect that the original reporter's issue was something in their corpus streaming. It looks like your code brings all the examples into python-objects in main-memory before training begins, so I don't suspect that yours is really the same issue. Though, if in fact your corpus iterator is hiding some automatic on-demand fetching of content on each access/repeated-iteration, that might be a factor for you too. Is the single core that htop is showing in use at max-CPU-utilization?)

@mmroden
Copy link

mmroden commented May 4, 2016

So the posted code pegs all CPUs, and htop shows that one cpu is at max during gensim training.

Would there be a way to check that there could be some resource contention or locks? I noticed some people using gdb to attach to the process at random intervals and found randint calls in the other ticket, could something like that be at work here as well?

@gojomo
Copy link
Collaborator

gojomo commented May 4, 2016

The randint issue was fixed a few releases ago, so it's not (exactly) that. Does your corpus iterator return simple, in-memory lists-of-strings? Does the doc2vec-IMDB.ipynb demo notebook manage to utilize multiple cores on your system?

@mmroden
Copy link

mmroden commented May 4, 2016

More debugging: I do see that all cores are used when this step is shown:

2016-05-04 02:56:20,311 - gensim.models.doc2vec - INFO - precomputing L2-norms of doc weight vectors

That leads me to believe that there's some hidden generator in my corpus. I'm creating it like so:

def clean_doc(doc):
    # return re.sub('[^a-zA-Z]', ' ', doc).lower().split()
    return doc.lower().split()  # retaining punctuation

sentences = tuple(LabeledSentence(clean_doc(key),
                                  value) for key, value in corpus)

where tuple can be swapped out interchangeably with list and there's no difference in behavior. With lots of domain-specific revisions, the actual 'corpus' is built via:

training_data = []
for url_data in url_collection:
    body = url_data.body
    training_data.append([body, tuple("ID_" + str(id) for id in training_hash[url_data.url])])

Which looks like it would hold everything in memory to me, and when I change that tuple to an array, there's no behavior change. Maybe a profiler or something is the right way to go here, because if I am getting choked on some kind of generator, I'm not seeing it.

Side note: When I call corpus creation like so:

model = Doc2Vec(size=vector_count, min_count=1, dm=0, iter=5,
                dm_mean=1, dbow_words=0, workers=NUM_WORKERS)

(note iter=5)

I get this crash:

2016-05-04 12:56:13,546 - gensim.models.word2vec - INFO - PROGRESS: at 99.96% examples, 239200 words/s, in_qsize 1, out_qsize 7
2016-05-04 12:56:13,546 - gensim.models.word2vec - INFO - worker thread finished; awaiting finish of 3 more threads
2016-05-04 12:56:13,546 - gensim.models.word2vec - INFO - worker thread finished; awaiting finish of 2 more threads
2016-05-04 12:56:13,547 - gensim.models.word2vec - INFO - worker thread finished; awaiting finish of 1 more threads
something broke in trial 0, continuing

Not sure if that's a separate issue. I can get the stack trace if that helps. This issue does not occur if iter=1.

@tmylk
Copy link
Contributor

tmylk commented May 4, 2016

Regarding the side note: stack trace in a separate issue will be greatly appreciated.

@mmroden
Copy link

mmroden commented May 4, 2016

I've removed all try/excepts-- I'm having a hell of a time reproducing. If I see it again, I'll make the trace into another issue.

@gojomo
Copy link
Collaborator

gojomo commented May 4, 2016

Does the doc2vec-IMDB.ipynb demo notebook manage to utilize multiple cores on your system? This is an important question to resolve, since if it does, we know that the Python/numpy/scipy/gensim-cython code is working on your system, and don't need to do further investigation of those factors.

The "something broke in trial 0, continuing" message does not look like a gensim printout. What does it mean? Is your code printing it after some sort of timeout or other test of the results?

So that training_data object is also exactly your corpus variable? That the variable names vary make me suspicious some other changes are happening between creation and conversion to sentences.

You could try to make certain the sentences object is a simple list of TaggedDocuments (the preferred replacement for LabeledSentence), with simple lists-of-strings as its 'words' and 'tags', by doing:

sentences = [TaggedDocument(words=clean_doc(key), tags=[str(t) for t in value]) 
             for key, value in corpus]

@mmroden
Copy link

mmroden commented May 4, 2016

So it looks like the model training from the ipython notebook does run in parallel-- well, this line works:

>>> for epoch in range(passes):
...     shuffle(doc_list)  # shuffling gets best results
...     
...     for name, train_model in models_by_name.items():
...         # train
...         duration = 'na'
...         train_model.alpha, train_model.min_alpha = alpha, alpha
...         with elapsed_timer() as elapsed:
...             train_model.train(doc_list)
...             duration = '%.1f' % elapsed()

So that means that the problem is more in the data preparation side, right?

The 'something broke' message is from my code-- I was getting random failures when I was setting everything up, but those failures stopped, so I didn't remove it until now.

I actually have a group of training data sets, and a holdout set-- I was comparing training set formation and its effect on whether the system would reproduce the label on the set. Hence the different names, there's a function call in between.

When I switch to TaggedDocument instead of LabeledSentence, htop shows that there's a sum of 100% of processors used across all cores (ie, all cores suddenly get activated to various degrees, but the summation of the work done is 100%), whereas with LabeledSentence, one core would be maxed out at a time.

@gojomo
Copy link
Collaborator

gojomo commented May 4, 2016

LabeledSentence is a synonym for TaggedDocument retained for backward-compatibility, so it's not that change that's making a difference. It's instead the forcing of tags to be a simple list-of-strings. There was likely something indirect/laggy about the 'value' objects coming back from your corpus object.

Now that you've seen multiple cores engaged (in both the doc2vec-IMDB.ipynb example, and your own code), we know that's working – the remaining issue is increasing core-utilization. You should again try varying things like window, workers, size, and the training-modes to see if there are combinations that drive utilization higher. But, as I'd mentioned, there is still enough of the process that's happening in single-GIL-threaded python that you're unlikely to get 36 cores fully engaged. But you should be able to get some speedup over workers=1 – that's the goal to optimize. (If tweaks improve runtime over a workers=1 case, then you're getting some benefit from multiple cores.)

@mmroden
Copy link

mmroden commented May 4, 2016

I literally just swapped out LabeledSentence for TaggedDocument-- I made no other changes. And the summation of CPU usage is still capped at 100%, it's just spread out over more cores. I'm doing this on a 4-core local virtual machine, since running the 36 core machine for this is a bit too inefficient. But I do get that there must be something that's blocking using multiple cores for this procedure. I'll check out those other parameters.

@piskvorky
Copy link
Owner

piskvorky commented May 5, 2016

I think the issue should remain open -- we're really interested in figuring this out. We'll assist you with the debugging any way we can.

The sentences = ... line that @gojomo suggested is a good starting point. Does that work = use more than 100% CPU with workers > 1? What is the distribution of document lengths & number of tags per document?

@mmroden
Copy link

mmroden commented May 5, 2016

This is the format of the data that goes into the corpus line:

[u'Cancel\nA migrant runs away from tear gas during clashes with Macedonian police at the northern Greek border point of Idomeni, Greece, Wednesday, April 13, 2016. Thousands of migrants protested at the border and clashed with Macedonian police. (AP Photo/Amel Emric) The Associated Press\nBy COSTAS KANTOURIS, Associated Press\nIDOMENI, Greece (AP) \u2014 More than 100 migrants engaged in running battles Wednesday with Macedonian police on the other side of a fence on Greece\'s border with the country, in clashes that sent clouds of tear gas wafting over a crowded tent city of stranded refugees and other migrants.\nThe violence stopped a planned tour of the border fence in Macedonia by the visiting presidents of Croatia and Slovenia.\nNo injuries were reported from the clashes at the closed Idomeni crossing, while Greek riot police monitoring the stone-throwing migrants on their side of the fence made no arrests, did little to intervene and retreated during the tear-gas barrage.\nMacedonian police fired scores of tear gas canisters, stun grenades and rubber bullets at the protesters, who had earlier tried to scale the border fence using blankets issued by humanitarian groups to get over coils of razor wire. Many of the canisters were neutralized by blankets and earth thrown over them by the protesters.\nAbout 11,000 people have been living in the informal camp for weeks, since Macedonia closed its border to transient refugees and other migrants hoping to move north towards Europe\'s prosperous heartland. Before the shutdown, which was triggered by a similar move in Austria, further north on the migration corridor, about 850,000 people who had arrived in Greece on smugglers\' boats from Turkey had entered Macedonia from Idomeni.\nThe camp residents \u2014 mostly Syrian, Iraqi and Afghan refugees \u2014 have ignored repeated calls from Greek authorities to relocate to organized camps, and attempted several mass incursions into Macedonia in recent weeks, trying to bypass the fence or break through it.\nAlaeddin Mohamad, a 26-year-old law student from Aleppo, Syria, who has lived in the camp for a month, told The Associated Press that the protest started with a peaceful sit-down in front of the fence, and Macedonian police responded with tear gas.\n"We don\'t want to clash. We want the borders to open and get on with our lives," Mohamad said. "I want to continue my studies in Europe. I will stay here until the border opens. Otherwise, I will die here."\nOn Sunday, severe clashes between stone-throwing migrants and Macedonian police using tear gas, stun grenades, rubber bullets and a water cannon left scores injured. The violence increased friction between the two Balkan neighbors \u2014 at odds for a quarter-century over Macedonia\'s official name \u2014 with Macedonia accusing Greece of doing nothing to stop the rioters and Athens denouncing Skopje\'s heavy-handed response.\nOn the Macedonian side of the border, the presidents of Macedonia, Croatia and Slovenia \u2014 whose countries have sent police to help Macedonia guard its border \u2014 met in the town of Gevgelija, a few kilometers from Idomeni.\nCroatia\'s Kolinda Grabar Kitarovic said the European Union should clarify its immigration policy and send a clear message to migrants stranded in Balkan countries who are hoping the old route will reopen.\nIn comments to the press, she said that the wave of immigration will not stop "until (migrants) got a clear message" on who is eligible for asylum. The clashes prevented a planned visit to the closed border crossing.\n___\nKonstantin Testorides contributed from Gevgelija, Macedonia.\nCopyright 2016 The\xa0 Associated Press . All rights reserved. This material may not be published, broadcast, rewritten or redistributed.\n', 
['ID_30618076']]

As such, the key gets sent through clean_doc (as described above), and the tags are already set to an array of string values. Should tags be set as an array of arrays? I think I'm sending things in properly, as I get results for my holdout set validation that are very reasonable, indicating that the model has labels to me.

The corpus is randomly selected for each holdout run, but a typical run will have ~26k - ~30k items. The document lengths are in the neighborhood of an average of ~706 words, with a max of ~18821 and a min of 3. The number of tags can range from 1 to 11, with an average of 1.05 (ie, most documents are labeled once).

@gojomo
Copy link
Collaborator

gojomo commented May 5, 2016

Please, on a system where you can confirm seeing the "only one core used" problem with the original sentences = … line, try the alternate line I suggested. (In that line, the significant change was not the shift to TaggedDocument. You can leave it as LabeledSentence but make the other changes on that line.) Is that one change alone enough to change the number-of-cores-utilized (or change the observed behavior/results in any way)? If so, there's a bottleneck or bit of unclear magic in how your corpus object works.

Otherwise, the size/shape of the data shouldn't have big effects on the throughput, at least not in the latest release (where a batching-of-small-examples optimization exists).

(Side note: due to implementation limits in the cython path, a document of more than 10000 words will only have its first 10000 words considered. A workaround suitable for most cases would be to split the document at 10000-word intervals, but re-use the exact same tags for each sub-document. The training effect is essentially the same, and identical-but-for-small-ordering issues in pure DBOW.)

@mmroden
Copy link

mmroden commented May 5, 2016

I have done the alteration, no change.

I'm thinking that what we may need to do is set up some kind of remote debugging session, especially since you guys want to see this in action. Would that work? Who would I contact about doing that? Best case scenario, after five seconds, it's obvious that I screwed something up, and that becomes apparent when the whole system is in view.

@gojomo
Copy link
Collaborator

gojomo commented May 5, 2016

So was it "only one core" both before and after the change? Earlier you mentioned something changed the observed behavior from "only one core" to "many cores but still not more than 100% total". Can you still toggle between those two behaviors?

A more precise way to log core-utilization may be to use a command like mpstat. For example, mpstat -P ALL 2 30 will print per-core utilization every 2 seconds for 30 intervals. Starting this, then executing the train() step in another session, would give a stronger picture of how cores are/aren't utilized as training gets underway. (The transition from idle to all-threads-started, and then just 10-20 seconds of training-in-progress, would tell a lot, even if the full train() isn't recorded.)

Is this on the local VM or the AWS VM? If the local VM, has the local VM ever reproduced the "exactly one core" condition? What is the local virtualization system used?

For open-source support, I prefer the back-and-forth of a discussion log: it forces precise communication and incremental isolation, and creates an archived series of reference steps others can learn from in future similar situations.

If a full set of code & data that reproduces the problem elsewhere can be shared, that's great too.

But if fixing it requires looking at proprietary code/systems (even just for a little while), that can't be shared, that's a consulting gig.

@mmroden
Copy link

mmroden commented May 26, 2016

I've looked further into this, and I think I have a potential solution (sorry for the delay, but, you know, life).

I'm seeing a single core being used when the number of sentences is near to the number of labels; that is, if I have 25k sentences and 10k labels, that uses one core. But! If I have 160k sentences and 1k labels, that uses all cores.

So this:

10737 unique tags from a corpus of 23144 examples

Is 1 core, regardless of the size of the box, while this:

1081 unique tags from a corpus of 156068 examples

Uses 36 cores pretty well.

Does that observation vibe with how the algorithm is parallelizing internally?

@piskvorky
Copy link
Owner

Interesting finding and good question, @mmroden! @gojomo ideas?

@jlorince
Copy link

jlorince commented Oct 3, 2016

I'll just add to the discussion, noting that I'm having a similar issue. On both Windows and an Ubuntu box, I see no obvious evidence of Doc2vec parallelizing. I'm feeing the model with a TaggedLineDocument (~20M lines) like so:

documents = TaggedLineDocument(abs_dir+'docs{}.txt.gz'.format(test))
model = Doc2Vec(documents, size=200, window=5, min_count=5,workers=16)

This was originally happening with 0.12.4, but upgrading to 0.13.2 didn't change things.

Was there any movement on the point by @mmroden about the ration of labels to examples? For my case I'm not labeling any documents, so if I understand things correctly the number of labels and documents should be equal.

@gojomo
Copy link
Collaborator

gojomo commented Oct 3, 2016

@mmroden - I can't think of a good reason that'd make a difference... and to the extent I can imagine possible relationships, I might expect the alternate relation to hold: more tags might parallelize better, because different cores (which might not share the same CPU cache) would be writing to overlapping memory ranges slightly less often.

The bigger difference I see between your two scenarios is the number of examples. (The batching that happens means small datasets may never get a chance to spread over many threads.) Does the perceived relationship persist with a larger dataset?

@gojomo
Copy link
Collaborator

gojomo commented Oct 3, 2016

@jlorince The 1st thing to check: are you sure the cython-optimized versions are running? (There's effectively no parallelism possible without them.) You can check this by ensuring gensim.models.doc2vec.FAST_VERSION > -1.

Next is whether your IO/decompression (which can only happen in the master thread) is the real bottleneck. If you have the memory to read all examples into a list, then use that as the corpus, does that help?

Because large parts of the code are still subject to the Python GIL, saturating all cores is unattainable, and indeedtrying to use more workers can decrease total throughput through more contention (so also try values between 1 and NUM_CORES to see if it helps). But, I've always seen at least some multi-thread activity. How are your monitoring the thread activity (specifically on Ubuntu) and determining parallelization is not happening?

@jlorince
Copy link

jlorince commented Oct 4, 2016

I checked on both systems, and got a FAST_VERSION of 1, so that doesn't seem to be the problem. I'll experiment with loading the data fully in memory if I can (would the simplest approach be as simple as doclist = [doc for doc in documents] where documents = TaggedLineDocument(path_to_file)?

@tmylk
Copy link
Contributor

tmylk commented Oct 4, 2016

@jlorince Confirm that loading the docs into a list is the right approach to load into memory and exclude I/O

@gojomo
Copy link
Collaborator

gojomo commented Oct 4, 2016

@jlorince Yes, doclist = [doc for doc in documents] should work, if you've got the memory. (And if not, on some large subset of the documents that does fit.) It may be interesting to time the construction of the list, just to get a sense of the IO/decompression overhead... for reference even if you later go back to repeated-iteration-from-disk.

@jlorince
Copy link

jlorince commented Oct 4, 2016

Ok, just ran it on a subsample of documents, loading everything into memory, and I do see evidence of parallelization (in htop on linux, will test on Windows next). Regarding the discussion above of the pros/cons of multiple cores...if we have the memory to load all the raw data into RAM, is there still a downside to using as many workers as cores?

@gojomo
Copy link
Collaborator

gojomo commented Oct 4, 2016

The optimal number of worker threads will probably still be somewhere between 1 and the number of cores. Pre-loading into memory eliminates the bottleneck of single-threaded IO/decompression as a cause of idle threads. There's still the issue of the Python GIL, which means the pure-Python portions of the process contend with each other, and beyond some count of workers, such contention can mean more workers will stop helping and start hurting overall throughput. (As noted in some of my comments above, some parameter choices, by allowing more to be done inside the optimized cython GIL-oblivious blocks, can also see less-contention/higher-utilization.)

@tmylk
Copy link
Contributor

tmylk commented Oct 5, 2016

@jlorince Thanks for testing it. Waiting for update on Windows. If it parallelizes there then will close the issue.

@tmylk
Copy link
Contributor

tmylk commented Oct 5, 2016

Leaving open to investigate relationship between parallelization and tags/docs ratio in #532 (comment)

@tmylk tmylk added bug Issue described a bug difficulty hard Hard issue: required deep gensim understanding & high python/cython skills labels Oct 5, 2016
@christinazavou
Copy link

christinazavou commented May 8, 2017

My code also does not parallelize.

"""--------------------------------------------My inputs----------------------------------------------------"""
class LabeledLineSentence(object):
def init(self, filename, field):
self.filename = filename
self.field = field

def __iter__(self):
    df = read_df(self.filename)
    for idx, row in df.iterrows():
        text = eval_utf(row[self.field])
        tokens = [item for sublist in text for item in sublist]
        yield TaggedDocument(words=tokens, tags=['SENT_%s' % idx])

inputs = LabeledLineSentence(self.filename, self.field)

"""------------------------------------1st WAY----------------------------------------------------------"""
self.model = Doc2Vec(inputs, size=500, window=10, min_count=5, workers=3,
dbow_words=False, max_vocab_size=200000, iter=20, dm_mean=1)

"""------------------------------------2nd WAY----------------------------------------------------------"""
self.model = Doc2Vec(size=500, window=10, min_count=5, workers=3,
dbow_words=False, max_vocab_size=200000, , dm_mean=1, alpha=0.025, min_alpha=0.025)
self.model.build_vocab(inputs)
for epoch in range(iter):
self.model.train(inputs)
self.model.alpha -= 0.002
self.model.min_alpha = self.model.alpha

I've tried both ways on Windows with gensim.models.doc2vec.FAST_VERSION > -1 ensured works parallel but neither way utilizes more than one core. I also don't understand why the first way runs in 5 minutes and the second way runs in 25 minutes.

I hope I don't do anything stupid. Thanks in advance.

p.s. I have 10845 unique tags from a corpus of 10845 examples. Is this finally a problem of not utilizing the rest of the cores?

@tmylk
Copy link
Contributor

tmylk commented May 10, 2017

@christinazavou Try inputs = list(LabeledLineSentence(self.filename, self.field)) to make sure that it is pre-loaded in memory and exclude any I/O bottlenecks.

@christinazavou
Copy link

christinazavou commented May 10, 2017

@tmylk Thanks! This reduces time (1st way runs now in 1 minute and 2nd way in 7 minutes). However it still utilizes only one core. (i shouldn't have problems running multi-cores because gensim LDAmulticore runs as expected.)

@tmylk
Copy link
Contributor

tmylk commented May 15, 2017

@christinazavou That sounds like an issue. Could you please post more about your config?

import platform; print(platform.platform())
import sys; print("Python", sys.version)
import numpy; print("NumPy", numpy.__version__)
import scipy; print("SciPy", scipy.__version__)
import gensim; print("gensim", gensim.__version__)
from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)

@gojomo
Copy link
Collaborator

gojomo commented May 15, 2017

@christinazavou This would be better discussed on the project discussion list, as it does not appear you're hitting the known issue tracked here that gensim doesn't parallelize as much as we'd like beyond 2-8 cores, but rather something specific to your usage.

That said, there's no surprise that "2nd way" is slower - every call to train() will iterate over your corpus multiple times per the value of the model's iter parameter (default 5). (It's rare to need to call train() multiple times, and not-recommended, and the latest versions of gensim will actually break this style of usage unless an explicit epochs parameter is used which confirms the caller knows what they're doing.)

Also, what are you using to monitor that only one core is being used, and at what stage of the process? The initial creation of the list (reading from files), and the initial scan of that list to discover the corpus vocabulary, will each only use one core no matter what; it's only the model training passes that can start to use multiple cores. So only after your logging indicates that training has begun should you check if multiple cores are engaged, and with workers=3 you'll see at most 4 cores engaged – the main thread and 3 workers.

@gojomo
Copy link
Collaborator

gojomo commented Oct 10, 2017

While there's some useful discussion here, the oldest report fo such Word2Vec/Doc2Vec bottlenecks (also with useful discussion) is #336. Closing this issue in favor of continuing discussion of potential improvements there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug difficulty hard Hard issue: required deep gensim understanding & high python/cython skills
Projects
None yet
Development

No branches or pull requests

7 participants