-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decouple number of OpenBLAS threads and max number of OpenMP threads #2985
Comments
In OMP parallel section OpenBLAS falls down to one thread, not 4 or 40. Something OMP 3.0 feature level or so. It allocates some fixed structures based on system size, like starting thread for starters. Questionable performance-wise would be to restart thread pool on each call... |
Maybe I wasn't clear enough. Here a code sample in pseudo code:
Or other way round:
It is also a problem because when compiling OpenBLAS you need to set the maximum number of threads you want it to use. So for this case you'd set it to 4 at compile time because you explicitely tell OpenBLAS to only use 4 at runtime. However now you cannot use 16 OMP threads in your main program as that will crash OpenBLAS due to exceeding its limits. |
If you call sgemm inside parallel section, not right after it, it runs on a single core. |
Sure. But that was not my point. |
My point is known limitation that it does not play optimally with OMP thread limits, normally you would just set OMP thread hierarchy variables, and/or make another parallel section nested in outer one. That would suffice to avoid wrong size thread pool in all cases. |
My expectation is two-fold:
To be compatible with the non-omp version openblas should not increase it thread count when the omp thread count is increased. As mentioned above this is very important because the max number of openblas threads is fixed at compile time. |
OMP provides hierarchy mechanism that openblas does not instrument. Usually re-invented wheel comes out square at best. |
Can you elaborate how hierarchy is involved here or what form of hierarchy you mean exactly? OMP is a runtime for threading. I don't see why any particularities there should mean that OpenBLAS ignores the users choice of number of threads to use (for OpenBLAS) or why it (OpenBLAS) should influence the users program in any way. |
Ok, so you were talking about running OpenBLAS withing an OMP parallel region (where OpenBLAS will run with 1 thread). Thanks for clearing that up! I.e.
|
Easiest would be to plainly alias get function to omp (B) and discard set parameter to align with (A) until (C) is heavily changed to permit multiple calls. EDIT: either way you are right that programming global thread number is wrong. |
I tested So aliasing the OpenBLAS function(s) to the OMP function would be wrong. Similar to how https://software.intel.com/content/www/us/en/develop/documentation/mkl-developer-reference-c/top/support-functions/threading-control/mkl-set-num-threads.html defines it:
There should be a separate state for OpenBLAS: The number of OMP threads that are used. Hence the set function should NOT call the OMP set function and the get function should change neither the number of OpenBLAS nor of the OMP threads. So my proposal would be simply: Set changes the OpenBLAS number ignoring any OMP setting, get returns the The behavior in parallel section (1 thread only) could be kept. Otherwise it should be checked whether nested parallelism is actually supported. |
It is influenced on initialisation. Ill look into it in weekend, functions look so badly scattered around the code that it is not as simple as commenting line 78 Normally you can have a machine with 2^N sockets , each socket with 2^n internal NUMA domains, OMP intends to model it well, MKL kind of permits to set domain numbers and volumes, OpenBLAS here considers all domains have one core. In real life python and R has parallel modules that spawn processes for domains with that many cores, but could wish nested sections to spawn OMP parallel section around sockets. |
Sure, using the number of OMP threads as the default is reasonable. Only setting the number of OpenBLAS threads should not influence OMP threads and vice versa (after the init/first OpenBLAS call)
Noticed that too, especially the variables (*num_cpus, *num_threads). The 2 linked functions above should help to get started, I even think changing those 2 is enough, but this needs checking. |
Hi, has the status on this evolved ? I find it unexpected as well that setting the number of threads for openblas changes the internal state of openmp. I have a program with some openblas calls and some openmp code (at different places, not nested) and I find it more than surprising that setting a limit on the number of threads for openblas impacts the openmp part of my program. I'd expect the result of |
Also I tested: openblas_set_num_threads(1);
omp_set_num_threads(original_omp_num_threads);
# caling openblas runs as if I didn't change any num_threads
# (at least it seems after doing some benchmarks)
# even if
openblas_get_num_threads(); # returns 1 Which adds to the confusion to me because you can end up in state where the value of |
PR #3773 added a new, OpenMP-compatible environment variable OPENBLAS_DEFAULT_NUM_THREADS |
It is possible to set the number of threads used by OpenBLAS via
openblas_set_num_threads
. For the "custom thread" solution this works quite well: Independent of what the application may do the set number of threads is used inside OpenBLAS.However for OpenMP this is not the case: An application might want to have OpenBLAS use 4 of 16 threads while using OpenMP itself to schedule other work or run 4 OpenBLAS operations in parallel each using 4 threads (up to the runtime if that is even possible, but the first use case should be). Another use case would be that OpenBLAS should use only 4 threads (e.g. due to performance reasons, usual matrix size, ...) but the application wants to use OpenMP (at other times, so not in parallel to OpenBLAS) with all 16 threads.
Now OpenBLAS does something nasty: It uses the max number of openmp threads and sets the max number of used threads to that value. So it is impossible to use less than the number of OpenMP threads.
In code the problem is 2-fold:
num_cpu_avail
which should only query does a modification: https://github.com/xianyi/OpenBLAS/blob/develop/common_thread.h#L154-L156So for a first fix I'd suggest to make
num_cpu_avail
return the lesser ofblas_cpu_number
andopenmp_nthreads
instead of setting anything.The text was updated successfully, but these errors were encountered: