Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

word2vec (& doc2vec) training doesn't benefit from all CPU cores with high workers values #336

Closed
jticknor opened this issue May 7, 2015 · 42 comments
Labels
difficulty hard Hard issue: required deep gensim understanding & high python/cython skills performance Issue related to performance (in HW meaning) wishlist Feature request

Comments

@jticknor
Copy link

jticknor commented May 7, 2015

I have been using word2vec with great success on a Macbook Pro, seeing nice increases with additional workers. However, when moving the code over to a MacPro (24 virtual cores), the code actually takes longer with the addition of more workers. When I run "gensim.models.word2vec.FAST_VERSION" I get a value of 0. This same code will give me a value of 1 on the Macbook Pro. Is there something I am missing moving over to the Xenon processors?

@piskvorky
Copy link
Owner

Maybe your input sentence iterator gets saturated? What is the throughput of your sentence stream? Can it sustain that many cores?

@jticknor
Copy link
Author

jticknor commented May 7, 2015

I am using the LineSentence iterator that comes with the gensim code. I should have said that the code is slower for any number of cores beyond one using the MacPro. However, I see similar improvements as your blog post using a Macbook Pro (going all the way to 8 virtual cores). These improvements are seen with the same code and datasets. My concern was the output of the FAST_VERSION check being different for each machine.

@piskvorky
Copy link
Owner

Hmm. FAST_VERSION reflects the return type of your BLAS interface (single vs. double). Both 0 and 1 should be fine, it's an internal detail.

Can you post the output of:

import scipy
scipy.show_config()

on either machine?

@jticknor
Copy link
Author

jticknor commented May 7, 2015

This is for the MacPro:

atlas_threads_info:
NOT AVAILABLE
blas_opt_info:
extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
extra_compile_args = ['-msse3', '-I/System/Library/Frameworks/vecLib.framework/Headers']
define_macros = [('NO_ATLAS_INFO', 3)]
atlas_blas_threads_info:
NOT AVAILABLE
openblas_info:
NOT AVAILABLE
lapack_opt_info:
extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
extra_compile_args = ['-msse3']
define_macros = [('NO_ATLAS_INFO', 3)]
openblas_lapack_info:
NOT AVAILABLE
atlas_info:
NOT AVAILABLE
lapack_mkl_info:
NOT AVAILABLE
blas_mkl_info:
NOT AVAILABLE
atlas_blas_info:
NOT AVAILABLE
mkl_info:
NOT AVAILABLE

@thomaskern
Copy link

I have the same problem on two independent machines, centos and ubuntu for that matter. on my macbook pro it works as expected and shown in your blogpost.
the ubuntu machine is an ec2 c4.2xlarge machine:

1 workers:
2015-05-08 07:32:42,118 - gensim.models.word2vec - INFO - training on 3406813 words took 8.3s, 409933 words/s
8.73291110992

4 workers:
2015-05-08 07:33:03,731 - gensim.models.word2vec - INFO - training on 3406813 words took 10.5s, 324499 words/s
10.9156110287

8 workers
2015-05-08 07:29:55,316 - gensim.models.word2vec - INFO - training on 3406813 words took 30.3s, 112300 words/s
30.7568218708

The Centos machine has the exact same behavior. FAST_VERSION is 1 on all 3 machines. Last line of each paragraph is the elapsed CPU time.

ec2:
In [2]: scipy.show_config()
lapack_info:
libraries = ['lapack']
library_dirs = ['/usr/lib']
language = f77
lapack_opt_info:
libraries = ['lapack', 'blas']
library_dirs = ['/usr/lib']
language = f77
define_macros = [('NO_ATLAS_INFO', 1)]
openblas_lapack_info:
NOT AVAILABLE
blas_info:
libraries = ['blas']
library_dirs = ['/usr/lib']
language = f77
atlas_3_10_blas_threads_info:
NOT AVAILABLE
atlas_threads_info:
NOT AVAILABLE
atlas_3_10_threads_info:
NOT AVAILABLE
atlas_blas_info:
NOT AVAILABLE
atlas_3_10_blas_info:
NOT AVAILABLE
atlas_blas_threads_info:
NOT AVAILABLE
openblas_info:
NOT AVAILABLE
blas_mkl_info:
NOT AVAILABLE
blas_opt_info:
libraries = ['blas']
library_dirs = ['/usr/lib']
language = f77
define_macros = [('NO_ATLAS_INFO', 1)]
atlas_info:
NOT AVAILABLE
atlas_3_10_info:
NOT AVAILABLE
lapack_mkl_info:
NOT AVAILABLE
mkl_info:
NOT AVAILABLE

local centos:
In [4]: scipy.show_config()
lapack_opt_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/opt/OpenBLAS/lib']
define_macros = [('HAVE_CBLAS', None)]
language = c
runtime_library_dirs = ['/opt/OpenBLAS/lib']
blas_opt_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/opt/OpenBLAS/lib']
define_macros = [('HAVE_CBLAS', None)]
language = c
runtime_library_dirs = ['/opt/OpenBLAS/lib']
openblas_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/opt/OpenBLAS/lib']
define_macros = [('HAVE_CBLAS', None)]
language = c
runtime_library_dirs = ['/opt/OpenBLAS/lib']
openblas_lapack_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/opt/OpenBLAS/lib']
define_macros = [('HAVE_CBLAS', None)]
language = c
runtime_library_dirs = ['/opt/OpenBLAS/lib']
blas_mkl_info:
NOT AVAILABLE

macbook:
In [2]: scipy.show_config()
lapack_opt_info:
extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
extra_compile_args = ['-msse3']
define_macros = [('NO_ATLAS_INFO', 3)]
blas_opt_info:
extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
extra_compile_args = ['-msse3', '-I/System/Library/Frameworks/vecLib.framework/Headers']
define_macros = [('NO_ATLAS_INFO', 3)]

This is the code i am running:

import gensim
import time
import os, sys

from gensim import models
from gensim.models.word2vec import logger, FAST_VERSION
import logging

logger.setLevel(logging.DEBUG)
print FAST_VERSION
ch = logging.StreamHandler(sys.stdout)
ch.setLevel(logging.DEBUG)
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
ch.setFormatter(formatter)
logger.addHandler(ch)


with open("english-text.json", "r") as f:
    lines = [x.lower().strip() for x in f.readlines()]


print "Data loaded"
model = models.Word2Vec(size=100,window=5,min_count=5,workers=4)
model.build_vocab(lines)
model.train(lines)

@piskvorky
Copy link
Owner

Since your input iterator is fast (for both of you), it must be some sort of resource contention. Maybe CPU cache? Or threading failing for some reason?

I'm out of simple ideas :( we'll have to profile properly and see what's going on in detail.
@gojomo @sebastien-j any ideas?

@thomaskern
Copy link

@piskvorky
if you want, i can give you access to the ec2 box. if you give me your email address, i can send you the private ssh key.

@sebastien-j
Copy link
Contributor

While I have never experienced that problem, I get some speed-up on my desktop by forcing BLAS to be single-threaded (in my case export OPENBLAS_NUM_THREADS=1). If each worker tries to use BLAS with multiple cores, I suppose that there could be scheduling issues.

Otherwise, I can't really tell what is going on. It could be worth trying out without hyper-threading, but I'd try to exhaust other possibilities before modifying BIOS settings.

@gojomo
Copy link
Collaborator

gojomo commented May 9, 2015

My testing has so far been limited to an MBP where going from 1 to 4 workers has given the expected speedup, the FAST_VERSION reports as '0' (doubles), and the scipy.show_config() info is similar to the above MacPro and MBP reports. (There's one other extra_compile_args value, '-DAPPLE_ACCELERATE_SGEMV_PATCH', but that appears to be related to a different issue.)

It certainly seems like a BLAS issue, but for completeness, it would be good to record the exact OSX and Python versions in play.

A wild guess: perhaps BLAS tries to use "all" cores it sees, each operation. So each of your 4 python threads may, on each BLAS operation, try to use all 24 virtual cores – meaning peak attempted use of 96 cores, and thus the throughput loss from contention (both scheduling & L1 cache-stomping). If so, when trying the OPENBLAS_NUM_THREADS setting @sebastien-j mentions, there's a chance that values more than 1 but less than the actual number of cores would be optimal. (For example, 4 workers on "24" virtual cores might be best served by a BLAS num-threads value of ~6, Unsure if that's how that setting works, but seems worth a try.)

@piskvorky
Copy link
Owner

Thanks guys.

@gojomo sdot operations (BLAS level 1) are too small to spawn threads, I doubt that's it. But I have no other ideas, so may be worth testing anyway :) Something to do with a global lock somewhere / cache contention would be my guess too.

@thomaskern
Copy link

The setting OPENBLAS_NUM_THREAD has no impact of the behavior, no matter the number (tried 1-8)

@thomaskern
Copy link

I just ran strace -f for the python process on centos 6 that runs word2vec and i see millions of those lines, repeating once the initiatilising is done. is this expected behavior?

[pid 41871] futex(0xe97760, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 41868] futex(0xe97760, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 41871] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid 41868] <... futex resumed> )       = 1
[pid 41864] <... futex resumed> )       = 0
[pid 41871] futex(0x104eed0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 41868] futex(0x104eed0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 41864] futex(0xe97760, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 41871] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid 41868] <... futex resumed> )       = 0
[pid 41871] futex(0xe97760, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 41868] futex(0x104eed0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 41879] <... futex resumed> )       = 0
[pid 41871] <... futex resumed> )       = 1
[pid 41879] futex(0xe97760, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 41871] futex(0x104eed0, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 41868] <... futex resumed> )       = 0
[pid 41871] futex(0xe97760, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 41868] futex(0xe97760, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 41871] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid 41868] <... futex resumed> )       = 1
[pid 41878] <... futex resumed> )       = 0
[pid 41871] futex(0x104eed0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 41878] futex(0xe97760, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 41868] futex(0x104eed0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 41871] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid 41868] <... futex resumed> )       = 0
[pid 41871] futex(0xe97760, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 41868] futex(0x104eed0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 41871] <... futex resumed> )       = 1
[pid 41885] <... futex resumed> )       = 0
[pid 41871] futex(0x104eed0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 41885] futex(0xe97760, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 41871] <... futex resumed> )       = 1
[pid 41868] <... futex resumed> )       = 0
[pid 41871] futex(0xe97760, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 41868] futex(0xe97760, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 41871] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid 41868] <... futex resumed> )       = 1
[pid 41850] <... futex resumed> )       = 0
[pid 41871] futex(0x104eed0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 41868] futex(0x104eed0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 41850] futex(0xe97760, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 41871] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid 41868] <... futex resumed> )       = 0
[pid 41871] futex(0xe97760, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 41868] futex(0x104eed0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 41871] <... futex resumed> )       = 1
[pid 41884] <... futex resumed> )       = 0
[pid 41871] futex(0x104eed0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 41884] futex(0xe97760, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>

on my mac, executing sudo dtruss -f -p 85586 yields no such thing, i see a lot of this:

85586/0x4ec79d:  psynch_cvsignal(0x7F96DBD6CAD8, 0x19C5AB0019C5AD00, 0x19C5AB00)         = 256 0
85586/0x4ec79e:  psynch_cvwait(0x7F96DBD6CAD8, 0x19C5B00219C5B100, 0x19C5B000)       = 0 0
85586/0x4ec79e:  psynch_cvwait(0x7F96DBD6CAD8, 0x19C5B10219C5B200, 0x19C5B100)       = 258 0
85586/0x4ec79e:  psynch_cvwait(0x7F96DBD6CAD8, 0x19C5B20219C5B300, 0x19C5B200)       = 0 0
85586/0x4ec79e:  psynch_cvwait(0x7F96DBD6CAD8, 0x19C5B30219C5B400, 0x19C5B300)       = 258 0
85586/0x4ec6be:  psynch_cvsignal(0x7F96DBD6CAD8, 0x19C5B40019C5B500, 0x19C5B400)         = 258 0
85586/0x4ec6be:  psynch_cvsignal(0x7F96DBD6CAD8, 0x19C5B50019C5B600, 0x19C5B500)         = 258 0
85586/0x4ec6be:  psynch_cvsignal(0x7F96DBD6CAD8, 0x19C5B60019C5B700, 0x19C5B600)         = 258 0
85586/0x4ec6be:  psynch_cvsignal(0x7F96DBD6CAD8, 0x19C5B70019C5B800, 0x19C5B700)         = 258 0
85586/0x4ec6be:  psynch_cvsignal(0x7F96DBD6CAD8, 0x19C5B80019C5B900, 0x19C5B800)         = 258 0
85586/0x4ec6be:  psynch_cvsignal(0x7F96DBD6CAD8, 0x19C5B90019C5BA00, 0x19C5B900)         = 258 0
85586/0x4ec6be:  psynch_cvsignal(0x7F96DBD6CAD8, 0x19C5BA0019C5BB00, 0x19C5BA00)         = 258 0
85586/0x4ec6be:  psynch_cvsignal(0x7F96DBD6CAD8, 0x19C5BB0019C5BC00, 0x19C5BB00)         = 258 0
85586/0x4ec6be:  psynch_cvsignal(0x7F96DBD6CAD8, 0x19C5BC0019C5BD00, 0x19C5BC00)         = 258 0
85586/0x4ec6be:  psynch_cvsignal(0x7F96DBD6CAD8, 0x19C5BD0019C5BE00, 0x19C5BD00)         = 258 0
85586/0x4ec6be:  psynch_cvsignal(0x7F96DBD6CAD8, 0x19C5BE0019C5BF00, 0x19C5BE00)         = 258 0
85586/0x4ec6be:  psynch_cvsignal(0x7F96DBD6CAD8, 0x19C5BF0019C5C000, 0x19C5BF00)         = 258 0
85586/0x4ec6be:  psynch_cvsignal(0x7F96DBD6CAD8, 0x19C5C00019C5C100, 0x19C5C000)         = 258 0
85586/0x4ec6be:  psynch_cvsignal(0x7F96DBD6CAD8, 0x19C5C10019C5C200, 0x19C5C100)         = 258 0
85586/0x4ec6be:  psynch_cvsignal(0x7F96DBD6CAD8, 0x19C5C20019C5C300, 0x19C5C200)         = 258 0
85586/0x4ec6be:  psynch_cvsignal(0x7F96DBD6CAD8, 0x19C5C30019C5C400, 0x19C5C300)         = 258 0
85586/0x4ec6be:  psynch_cvsignal(0x7F96DBD6CAD8, 0x19C5C40019C5C500, 0x19C5C400)         = 258 0
85586/0x4ec6be:  psynch_cvsignal(0x7F96DBD6CAD8, 0x19C5C50019C5C600, 0x19C5C500)         = 258 0
85586/0x4ec6be:  psynch_cvsignal(0x7F96DBD6CAD8, 0x19C5C60019C5C700, 0x19C5C600)         = 258 0
85586/0x4ec6be:  psynch_cvsignal(0x7F96DBD6CAD8, 0x19C5C70019C5C800, 0x19C5C700)         = 258 0
85586/0x4ec6be:  psynch_cvsignal(0x7F96DBD6CAD8, 0x19C5C80019C5C900, 0x19C5C800)         = 258 0
85586/0x4ec6be:  psynch_cvwait(0x7F96DBD6CAD8, 0x19C5C90219C5CA00, 0x19C5C900)       = 0 0
85586/0x4ec6be:  psynch_cvsignal(0x7F96DBD6CAD8, 0x19C5CA0019C5CB00, 0x19C5CA00)         = 258 0
85586/0x4ec6be:  psynch_cvsignal(0x7F96DBD6CAD8, 0x19C5CB0019C5CC00, 0x19C5CB00)         = 258 0
85586/0x4ec6be:  psynch_cvsignal(0x7F96DBD6CAD8, 0x19C5CC0019C5CD00, 0x19C5CC00)         = 258 0
85586/0x4ec6be:  psynch_cvsignal(0x7F96DBD6CAD8, 0x19C5CD0019C5CE00, 0x19C5CD00)         = 258 0
85586/0x4ec6be:  psynch_cvsignal(0x7F96DBD6CAD8, 0x19C5CE0019C5CF00, 0x19C5CE00)         = 258 0
85586/0x4ec79d:  psynch_mutexwait(0x7F96DBD6CB08, 0xB016C803, 0xB016C600)        = -1340684285 0
85586/0x4ec79d:  psynch_cvsignal(0x7F96DBD6CAD8, 0x19C5D70019C5D900, 0x19C5D700)         = 256 0
85586/0x4ec79e:  psynch_cvwait(0x7F96DBD6CAD8, 0x19C5E00219C5E100, 0x19C5E000)       = 0 0
85586/0x4ec79e:  psynch_cvwait(0x7F96DBD6CAD8, 0x19C5E10219C5E200, 0x19C5E100)       = 0 0
85586/0x4ec79e:  psynch_cvwait(0x7F96DBD6CAD8, 0x19C5E20219C5E300, 0x19C5E200)       = 0 0
85586/0x4ec79e:  psynch_cvwait(0x7F96DBD6CAD8, 0x19C5E30219C5E400, 0x19C5E300)       = 0 0
85586/0x4ec79e:  psynch_cvwait(0x7F96DBD6CAD8, 0x19C5E40219C5E500, 0x19C5E400)       = 0 0
85586/0x4ec79e:  psynch_cvwait(0x7F96DBD6CAD8, 0x19C5E50219C5E600, 0x19C5E500)       = 258 0
85586/0x4ec79e:  psynch_cvwait(0x7F96DBD6CAD8, 0x19C5E60219C5E700, 0x19C5E600)       = 0 0
85586/0x4ec79e:  psynch_cvwait(0x7F96DBD6CAD8, 0x19C5E70219C5E800, 0x19C5E700)       = 258 0
85586/0x4ec79e:  psynch_cvwait(0x7F96DBD6CAD8, 0x19C5E80219C5E900, 0x19C5E800)       = 0 0
85586/0x4ec79e:  psynch_cvwait(0x7F96DBD6CAD8, 0x19C5E90219C5EA00, 0x19C5E900)       = 0 0
85586/0x4ec79d:  psynch_cvwait(0x7F96DBD6CAD8, 0x19C5F90219C5FA00, 0x19C5F900)       = 0 0
85586/0x4ec79d:  psynch_cvwait(0x7F96DBD6CAD8, 0x19C5FD0219C5FE00, 0x19C5FD00)       = 0 0
85586/0x4ec79d:  psynch_cvwait(0x7F96DBD6CAD8, 0x19C5FE0219C5FF00, 0x19C5FE00)       = 0 0
85586/0x4ec79d:  psynch_cvwait(0x7F96DBD6CAD8, 0x19C5FF0219C60000, 0x19C5FF00)       = 0 0
85586/0x4ec79d:  psynch_cvwait(0x7F96DBD6CAD8, 0x19C6000219C60100, 0x19C60000)       = 0 0
85586/0x4ec79d:  psynch_cvwait(0x7F96DBD6CAD8, 0x19C6010219C60200, 0x19C60100)       = 0 0
85586/0x4ec79d:  psynch_cvwait(0x7F96DBD6CAD8, 0x19C6020219C60300, 0x19C60200)       = 0 0
85586/0x4ec79d:  psynch_cvwait(0x7F96DBD6CAD8, 0x19C6030219C60400, 0x19C60300)       = 0 0
85586/0x4ec79d:  psynch_cvwait(0x7F96DBD6CAD8, 0x19C6040219C60500, 0x19C60400)       = 0 0
85586/0x4ec79d:  psynch_cvwait(0x7F96DBD6CAD8, 0x19C6050219C60600, 0x19C60500)       = 0 0
85586/0x4ec79d:  psynch_cvwait(0x7F96DBD6CAD8, 0x19C6060219C60700, 0x19C60600)       = 0 0

Is this all expected behavior?

@thomaskern
Copy link

using gdb on centos 6 (20 cores, 40 virtual ones) i randomly picked one of the 20 cores (utilization 5%) and saw this:

Thread 1 (process 10075):
#0  0x00000035e640d930 in sem_wait () from /lib64/libpthread.so.0
#1  0x00000000004ddbd8 in PyThread_acquire_lock (lock=0xb468a0, waitflag=<value optimized out>) at Python/thread_pthread.h:324
#2  0x00000000004e21f2 in lock_PyThread_acquire_lock (self=0x7f2dd7523390, args=<value optimized out>)
   at ./Modules/threadmodule.c:52
#3  0x00007f2dcb768c8f in __Pyx_PyObject_Call (func=0x7f2dc59ddb48, arg=<value optimized out>, kw=<value optimized out>)
   at numpy/random/mtrand/mtrand.c:29167
#4  0x00007f2dcb76dfe4 in __pyx_pf_6mtrand_11RandomState_20randint (__pyx_v_self=0x7f2dcb722e50, __pyx_args=<value optimized out>, 
   __pyx_kwds=<value optimized out>) at numpy/random/mtrand/mtrand.c:9604
#5  __pyx_pw_6mtrand_11RandomState_21randint (__pyx_v_self=0x7f2dcb722e50, __pyx_args=<value optimized out>, 
   __pyx_kwds=<value optimized out>) at numpy/random/mtrand/mtrand.c:9459
#6  0x00007f2dc61db181 in __Pyx_PyObject_Call (func=0x7f2dcb723488, arg=<value optimized out>, kw=0x0)
   at ./gensim/models/word2vec_inner.c:9880
#7  0x00007f2dc61dd37f in __Pyx__PyObject_CallOneArg (__pyx_self=<value optimized out>, __pyx_v_model=<value optimized out>, 
   __pyx_v_sentence=0x7f2dc59d4e60, __pyx_v_alpha=<value optimized out>, __pyx_v__work=0x7f2d00000008)
   at ./gensim/models/word2vec_inner.c:9995
#8  __Pyx_PyObject_CallOneArg (__pyx_self=<value optimized out>, __pyx_v_model=<value optimized out>, 
   __pyx_v_sentence=0x7f2dc59d4e60, __pyx_v_alpha=<value optimized out>, __pyx_v__work=0x7f2d00000008)
   at ./gensim/models/word2vec_inner.c:10009
#9  __pyx_pf_5trunk_6gensim_6models_14word2vec_inner_train_sentence_sg (__pyx_self=<value optimized out>, 
   __pyx_v_model=<value optimized out>, __pyx_v_sentence=0x7f2dc59d4e60, __pyx_v_alpha=<value optimized out>, 
   __pyx_v__work=0x7f2d00000008) at ./gensim/models/word2vec_inner.c:5598

Is it possible you are calling randint a bit too often and numpy's RandomState is locking itself out?

@gojomo
Copy link
Collaborator

gojomo commented May 13, 2015

Re: OPENBLAS_NUM_THREADS – note that should be plural (just in case you tried it as singular as you typed it here), and also the OpenBLAS README mentions in some cases (OPEN_MP in use?) that a different variant is required.

I suppose it's possible the np.random.rand_int call is a bottleneck; it happens on each (neg-sampling) train_* while still holding the python GIL. If you confirm more threads all waiting there, it may be worth refactoring the code to call it less frequently. But maybe you're seeing a more general issue, where since a significant amount of the training is still in pythonland, its GIL is preventing full multithreading benefits. Improving that might require deeper refactoring. (Perhaps, changing the train_sentence_* methods to take batches as train_sentences_*, and then doing larger batches of work without the GIL.)

@gojomo
Copy link
Collaborator

gojomo commented Jun 6, 2015

It looks like the randint() calls are the main problem. I recently started testing performance on an 8-core Ubuntu machine, and began seeing the same problem: workers=8 was much, much slower than workers=1. In fact, even workers=3 was worse. And stack samples confirmed many of the threads at numpy's mtrand.c, as per @thomaskern's report. I'm not sure why numpy.random.randint() is such a problem, but the fact that it does its own GIL-release is likely a factor. It was being called once per word, to calculate the reduced_window values; simply replacing that with the call that gets a bunch at once has led to giant speedups. (See preceding PR #355 , from single-issue branch https://github.com/gojomo/gensim/tree/fix_randint_slows_workers)

Where more workers had hurt (Ubuntu), they now help up through the number of cores I can test (8). And even where more workers were already helping (OSX), or where the only contention was with the job-dispatcher thread (workers=1), I've seen reductions in training-time for the same amount of data of 20%-40%.

@piskvorky
Copy link
Owner

That's amazing. Thanks guys for reporting, investigating & @gojomo for fixing! (see my comment in the PR)

@jticknor @thomaskern can you test & check this fix? Does it help for you?

@tkern
Copy link

tkern commented Jun 22, 2015

@piskvorky sorry, was on vacation. I see you merged it into develop 2 weeks ago. Do you want me to test it against the fix_randint_slows_workers branch or the current develop branch?

@gojomo
Copy link
Collaborator

gojomo commented Jun 22, 2015

For verifying this fix, 'develop' is good. (If you're doing any doc2vec stuff, broader testing of the PR #356 "bigdocvec_pr branch" would be helpful but not needed for checking this issue.)

@thomaskern
Copy link

htop looks good! 20 cores working at ~75% each is a massive increase.

gensim

however, the training times still look poor:

times for a 50k set:
20 cores: training on 3456813 words took 54.4s, 63526 words/s
02 cores: training on 3456813 words took 28.1s, 122845 words/s
01 cores: training on 3456813 words took 50.8s, 68006 words/s

times for a 500k set
20 cores: training on 4416512 words took 47.3s, 93401 words/s
2 cores: training on 4416512 words took 92.0s, 48024 words/s

@gojomo
Copy link
Collaborator

gojomo commented Jun 22, 2015

Progress! There may still be other contention or bottlenecks that prevent 20 cores from being the fastest. (I was once worried about the locking around the progress-count... but it wasn't a factor in my smaller tests. Still, at many workers similar little hotspots might grow in impact.)

It'd be interesting to know at what worker-count (between 2 & 20) throughput was maximized. Bumping up the size of the job-queue (source-code edit: maxsize more than 2x thread-count) or the job-chunksize (optional parameter to train()) would be other small changes that might move the needle in a contended situation.

@gojomo
Copy link
Collaborator

gojomo commented Jun 22, 2015

And one more thought: at the margin, it seems that more-dimensions, more negative-examples, larger windows, or longer sentences/documents each tend to require relatively more time inside the no-GIL/optimized code – and more time there can be a good thing for parallel progress. So all other things held equal, you may get more utilization out of all the cores by upping those parameters. (Whether those parameter tweaks improve results or are worth the net extra time/space cost, could go either way. But I'd expect the 'observed optimal core count' to get higher with each of those values rising.)

@thomaskern
Copy link

I just tried it with 3 workers on the 50k one:
training on 3456813 words took 40.1s, 86185 words/s

That is already slower than the 2-core run (which i just re-run and got the same results).

@thomaskern
Copy link

Using window=10, negative=100, dimensions=400 (unchanged) on the 50k with 2 and 20 cores i get:

  • 100% cpu usage in each of the 20 cores
  • 20 cores is faster, for once! ;)

20 cores: 6837 words/s
2 cores: 2331words/s

@fortiema
Copy link

fortiema commented Jul 1, 2015

Hi guys!

I also got interested in word2vec parallel training performance recently, and observed strange results. I ran some benchmarks of my own on 2 different machines (using LineSentence, 300 dim., 5 words, 10e-5 subsampling and compiled OpenBLAS from source).

Unit is Kwords/sec:

Core i5-4570, 16GB RAM, Ubuntu 14.04, Python 2.7.6:

version algo 2 workers 4 workers
0.11.1-1 NEG-5 108 131
NEG-15 47 63
HS 101 134
0.11.1.post1 NEG-5 114 150
NEG-15 51 67
HS 113 165

Dual Xeon E5-2420 v2, 32GB RAM, CentOS 6, Python 2.7.10:

version algo 2 workers 4 workers 8 workers 20 workers
0.11.1-1 NEG-5 62 82 - 65
NEG-15 33 51 - 61
HS 46 61 - 60
0.11.1.post1 NEG-5 70 92 108 103
NEG-15 37 61 76 78
HS 58 79 - 81

I ran a basic numpy/BLAS benchmark on both machines (dot() on 1000x1000 matrix).

Desktop:
screenshot from 2015-07-01 17 21 21

Server:
screenshot from 2015-07-01 17 21 30

Note: Playing with OPENBLAS_NUM_THREADS value had marginal effect in every case.

Although recent commits provided some welcome gains, it appears current code has trouble scaling upwards of 4 threads, and tops around 8 threads. Also, I have no clue how to explain the substantial difference of performance favoring my simple i5 desktop, even if numpy/OpenBLAS performance is higher on the server... Maybe you can help me shed some light on all that?

Thanks!

@gojomo
Copy link
Collaborator

gojomo commented Jul 3, 2015

@fortiema - Thanks for the detailed info-gathering! I've got no sure answers, but a dump of thoughts:

  • my latest PR Split build_vocab to scan, scale, finalize; train() loop/locking refactor; downsampling into cython #380 tries to move some extra work to worker threads, and eliminate a few possible places of contention; I've seen a slight improvement up to 8 cores but it might help more at larger core counts.
  • it also adds a 'queue_factor' parameter to train() which can be used to make the jobs-queues relatively bigger (as a multiple of worker-count) – in some scheduling scenarios that might smooth over occasional queue-full hold-ups. (Making it 100 – up from 2 – seemed to be worth a few percent speedup for me.)
  • the largeness of the model arrays, and randomness of updates across the whole arrays, may be a big factor preventing CPU cache warmness... at some point more workers may be thrashing the cache as much as helping on computations. New ways of batching the stages might help, but I don't see quick/simple fixes. (Maybe sorting the syn0 by word-frequency so all the hottest words are adjacent would help? Haven't tried.)
  • have you eliminated all IO from the loop in both places, and ideally all tokenization (as if in a sentences generator) as well?
    • Ubuntu vs CentOS and desktop vs server might be using different task scheduler settings or 'tick' resolutions? Anything discussed at https://doc.opensuse.org/documentation/html/openSUSE_121/opensuse-tuning/cha.tuning.taskscheduler.html#sec.tuning.taskscheduler.cfs.tuning that's different between the two might be fertile ground for tweaking. I don't have a great understanding of which directions to try, other than that in general, since throughput not latency is the goal, anything that lets single threads run longer in the same place might help.
  • maybe one has a python binary compiled slightly more specifically for its local processor? are both 64-bit? I vaguely recall once seeing a slight drop when switching between Python 2.7.x versions (though not at the magnitude of your desktop vs server difference).

@fortiema
Copy link

fortiema commented Jul 3, 2015

@gojomo Thanks for the pointers!

About the last point, I compiled Python from source on the server, versus using built-in Python on the desktop so if there is a difference the server should have an advantage here.

I'm using LineSentence for my benches, so this is as raw IO as it gets I guess? The only thing being done is splitting on whitespace.

I'll have a look at those scheduler parameters, this is mostly new to me!

Looking forward for your PR to be merged and run some additional tests!

P.S.: I would also have a look down under, but I'm totally new to Cython and parallel Python, and currently lacking any time to dig into this. Hopefully I get a chance to pick it up soon, would be more than happy to help!

@stevenxxiu
Copy link

I'm also having a similar problem. I'm using doc2vec with doc2vec negative sampling with the dbow model. With 64 threads (32 cores), the usage of each thread is similar and hovers around 8%, and I'm getting about 89k words/s. It takes word2vec <10 minutes whereas gensim is taking 1 hour.

I've also messed with queue_factor but didn't get any speedups.

Some system stats:

numpy==1.10.1
scipy==0.16.1
gensim==0.12.3

python: 3.4.3 x64

The corpus is fed in as a list of tagged documents, with words themselves being lists too so hopefully that's not a bottleneck.

Here's a breakdown with different numbers of workers:

threads     words/s        usage/thread
4           186070          95%
8           219110          70%
16          188950          32%
32          108063          16%
64          89324           8%

Setting export OPENBLAS_NUM_THREADS=1 made no difference.

This might be related to #450, my sentences are tweets.

@gojomo
Copy link
Collaborator

gojomo commented Nov 8, 2015

@jskda, any chance you could try the same tests as in your 'breakdown' table, with the code in this branch: https://github.com/gojomo/gensim/tree/batching ? It's my experimental approach to passing larger runs-of-small-examples to the cython nogil code regions.

(It requires recompilation of the cython extensions and thus the process is roughly: (1) clear existing gensim installations from environment; (2) clone the git branch; (3) pip install from repo directory: pip install -e .)

When constructing your model, add a batch_target=10000 (or larger) parameter to the initialization-parameters. (Then, small examples to pass up-to that many words in at once will be combined to one call into the cython training routines.)

I don't think this will help with workers > 32 (actual cores), but it may help with worker counts in the 4-32 range, making the optimal number for your task higher. (My ideal would be workers = cores will always perform best, but I don't think this change is enough yet to achieve that, for large numbers of cores.)

@stevenxxiu
Copy link

I've run the same tests on your branch (ran cython on the pyx files and pip install) and it seems to improve on a few percentage points for more cores, but strangely it reduces performance for 8 threads. I've add in some word2vec counts for comparison too.

threads     words/s         usage/thread    word2vec words/s (computed from  x*threads*1000)
4           103263          99%             82000
8           180497          99%             181360
16          167833          96%             320000
32          237236          80%             460480
64          241351          40%             732800

@gojomo
Copy link
Collaborator

gojomo commented Nov 9, 2015

Thanks for the detailed numbers! I am a bit perplexed by the results, both before and after.

I didn't expect, but could kind of understand, a small batching penalty at low worker counts... but it's surprisingly large, and I haven't noticed similar issues in my ad-hoc testing. I'm surprised that the throughput for 64 threads has increased: I would've expected performance to saturate and even suffer from extra contention earlier, at least in python.

And the word2vec.c numbers are also mysterious: I don't have a mental model that could explain peak performance at just 4 threads, and worst at 8, but 64-almost-as-good-as-4.

Just stabbing in the dark:

What are the chances of some hidden, under-controlled variable? Are there other processes competing for IO/CPU, or is this a virtualized environment (where such competition might be hidden)? Is the iterator providing the examples doing something atypical, either in IO or decompression/preprocessing? Are your doc-vecs memory-mapped and randomly-accessed? (Memory-mapping for larger-than-core arrays of doctags-in-training should work best if training passes make an orderly sweep over the mmapped array.)

@stevenxxiu
Copy link

@gojomo Sorry word2vec had an extra 0 for 4 threads so it's increasing. This is running on a dedicated server, there are some other processes going on but they're fairly constant in cpu usage. I'll try again now to see if the numbers change much.

The documents are provided as a list of TaggedDocument, which is then again a tuple of lists. I'm not sure what you mean by memory-mapped? The doc tags are just a single integer [i].

@gojomo
Copy link
Collaborator

gojomo commented Nov 9, 2015

Re: memory-mapping

There's an optional Doc2Vec argument (docvecs_mapfile) which can make the array of docvecs-in-training be backed by a memory-mapped file. It can make using a training-set that wouldn't fit in RAM work a little better than just triggering normal swapping. Probably using it would be neither a positive nor negative factor here (and its use would be more likely to show up as CPU underutilization – occasional pauses to page address ranges in/out).

Are you (re)shuffling the examples each training-pass? While that can best for results it might thwart CPU cache efficiency, especially with very small documents.)

@stevenxxiu
Copy link

The training set does fit in ram and while training there's 11gb used (machine has 500gb). Training is done like this, there's only a single pass:

# docs is some iterator at the start
model = Doc2Vec(...)
docs = list(TaggedDocument(doc, [i]) for i, doc in enumerate(docs))
model.build_vocab(docs)
model.train(docs)

I can give you a version of my script to reproduce those numbers.

@gojomo
Copy link
Collaborator

gojomo commented Nov 11, 2015

@jskda – sure, the script could be helpful. I want to dig deeper into whatever's blocking greater utilization with large numbers of cores… but may not have much time for it anytime soon.

@gojomo gojomo changed the title word2vec not improving with workers - MacPro word2vec (& doc2vec) not improving with workers - MacPro Nov 18, 2015
@kudkudak
Copy link

kudkudak commented Aug 19, 2016

Noticed that as well, run word2vec on EC2 cpu large instance on 64 threads and htop indicates 10% cpu usage per thread (word rate is 800k). Any progress on that?

@gojomo
Copy link
Collaborator

gojomo commented Aug 19, 2016

64 threads is going to be a challenge given the amount of GIL-locked code that still needs to run: you may get higher utilization with fewer contending threads. If the texts are small, make sure you're using the latest gensim – older versions didn't bundle as many together to minimize GIL-noGIL transitions.

But also: make sure your corpus iteration isn't a bottleneck (it's still single-threaded) – if you can pretokenize & keep-in-memory the whole corpus, that may help.

Finally, some options that would normally cost in performance – larger windows, more negative examples – may become nearly costless in a situation where contention is a factor. (They help keep each thread in the noGIL blocks longer, using 'free' CPU that'd be idle anyway.) So it's something else to try, if increasing those parameters are helpful to your application. (They aren't always.)

Other than that, you may be hitting inherent limits of the current implementation. I'd like to explore working-around such limits further, but don't often have an idle more-than-8-core machine and the time to dig deeper with a dataset/config that shows the problem.

@kudkudak
Copy link

Thanks!

Ad latest gensim - yes, I used dev version
Ad in-memory - tried that, haven't profiled though
Ad parameters - used mostly default, as above - haven't checked that more thoroughly

In general, it is fast enough for me, just seemed like it could be like few times faster given CPU utilizations. I can share scripts to reproduce if that helps, I am actually fitting wikipedia corpus from gensim.

@gojomo
Copy link
Collaborator

gojomo commented Aug 19, 2016

Sharing the scripts could help to reproduce (or trigger other suggestions) – but even if available, might not have a chance to look at or run in a similar setup for a long while. But barring code optimizations, my main recommendations are to read pre-tokenized text from memory, and try fewer-workers-than-cores (because it seems the overhead of contention itself becomes an issue) if searching for maximal throughput. (Tweaking other parameters would affect the properties of your end-vectors, and thus would be a project-specific consideration.)

@tmylk tmylk added wishlist Feature request difficulty hard Hard issue: required deep gensim understanding & high python/cython skills labels Oct 5, 2016
@menshikh-iv menshikh-iv added the performance Issue related to performance (in HW meaning) label Oct 3, 2017
@gojomo
Copy link
Collaborator

gojomo commented Oct 10, 2017

See especially #532 for other discussion of important factors - but this issue should capture any discussion/decisions about improvements in Word2Vec/Doc2Vec parallel thread utilization/throughput.

@gojomo gojomo changed the title word2vec (& doc2vec) not improving with workers - MacPro word2vec (& doc2vec) training doesn't benefit from all CPU cores with high workers values Oct 10, 2017
@raphaelauv
Copy link

raphaelauv commented Oct 27, 2020

With a corpus_file, we found an optimal speed at workers = cpu (vcpu) available in the instance.

And this is using all the cpu of the instance.

@raphaelauv
Copy link

@piskvorky maybe this issue could be close ?

@piskvorky
Copy link
Owner

piskvorky commented Oct 29, 2020

Yes, issue is stale. If the issue persist, please open a new ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty hard Hard issue: required deep gensim understanding & high python/cython skills performance Issue related to performance (in HW meaning) wishlist Feature request
Projects
None yet
Development

No branches or pull requests