Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decouple number of OpenBLAS threads and max number of OpenMP threads #2985

Closed
Flamefire opened this issue Nov 11, 2020 · 17 comments
Closed

Decouple number of OpenBLAS threads and max number of OpenMP threads #2985

Flamefire opened this issue Nov 11, 2020 · 17 comments

Comments

@Flamefire
Copy link
Contributor

It is possible to set the number of threads used by OpenBLAS via openblas_set_num_threads. For the "custom thread" solution this works quite well: Independent of what the application may do the set number of threads is used inside OpenBLAS.

However for OpenMP this is not the case: An application might want to have OpenBLAS use 4 of 16 threads while using OpenMP itself to schedule other work or run 4 OpenBLAS operations in parallel each using 4 threads (up to the runtime if that is even possible, but the first use case should be). Another use case would be that OpenBLAS should use only 4 threads (e.g. due to performance reasons, usual matrix size, ...) but the application wants to use OpenMP (at other times, so not in parallel to OpenBLAS) with all 16 threads.

Now OpenBLAS does something nasty: It uses the max number of openmp threads and sets the max number of used threads to that value. So it is impossible to use less than the number of OpenMP threads.

In code the problem is 2-fold:

So for a first fix I'd suggest to make num_cpu_avail return the lesser of blas_cpu_number and openmp_nthreads instead of setting anything.

@brada4
Copy link
Contributor

brada4 commented Nov 11, 2020

In OMP parallel section OpenBLAS falls down to one thread, not 4 or 40. Something OMP 3.0 feature level or so.

It allocates some fixed structures based on system size, like starting thread for starters. Questionable performance-wise would be to restart thread pool on each call...

@Flamefire
Copy link
Contributor Author

Flamefire commented Nov 11, 2020

Maybe I wasn't clear enough. Here a code sample in pseudo code:

// Program init
omp_set_num_threads(16);
openblas_set_num_threads(4);

// some code
#pragma omp parallel for // --> woops only 4 threads used, but 16 should be
for(...)
  extract_image(i)

blas_sgemm(image[0]) // Correct: 4 threads used

Or other way round:

// Program init
openblas_set_num_threads(4);
omp_set_num_threads(16);

// some code
#pragma omp parallel for // --> OK, 16 threads used
for(...)
  extract_image(i)

blas_sgemm(image[0]) // Woops: 16 threads used despite the first line setting it to 4

It is also a problem because when compiling OpenBLAS you need to set the maximum number of threads you want it to use. So for this case you'd set it to 4 at compile time because you explicitely tell OpenBLAS to only use 4 at runtime. However now you cannot use 16 OMP threads in your main program as that will crash OpenBLAS due to exceeding its limits.

@brada4
Copy link
Contributor

brada4 commented Nov 11, 2020

If you call sgemm inside parallel section, not right after it, it runs on a single core.

@Flamefire
Copy link
Contributor Author

Sure. But that was not my point.
My point is that OpenBLAS interacts with the maximum number of OpenMP threads in both directions: overwriting application set limits and overwriting user set OpenBLAS limits. See the "woops" lines above

@brada4
Copy link
Contributor

brada4 commented Nov 11, 2020

My point is known limitation that it does not play optimally with OMP thread limits, normally you would just set OMP thread hierarchy variables, and/or make another parallel section nested in outer one. That would suffice to avoid wrong size thread pool in all cases.
Do you think openblas_set_num_threads should somehow complement omp setting? I think it could be plain alias for omp function in general to stay compatible with non-omp versions.

@Flamefire
Copy link
Contributor Author

My expectation is two-fold:

  • During execution: when the number of OMP threads is higher than the number of openblas threads then the omp setting is not touched and the lower openblas num threads is used. I see no reasoning why it should be otherwise. I mean: The "native" thread server doesn't just keep creating a random amount of threads either but uses the set number of openblas threads.
  • setting the number of openblas threads: If the number of OMP threads is higher, then it should not be touched. If it is lower then a clearly communicated decision needs to be made. Either the value is retained an during execution the lower number is used, or (similar to current) the number of OMP threads is increased. Or the number of openblas threads is capped at the number of omp threads during the set. But again I see no point in decreasing the number of OMP threads allowed for an application from withing openblas.

To be compatible with the non-omp version openblas should not increase it thread count when the omp thread count is increased. As mentioned above this is very important because the max number of openblas threads is fixed at compile time.

@brada4
Copy link
Contributor

brada4 commented Nov 12, 2020

OMP provides hierarchy mechanism that openblas does not instrument. Usually re-invented wheel comes out square at best.

@Flamefire
Copy link
Contributor Author

Can you elaborate how hierarchy is involved here or what form of hierarchy you mean exactly? OMP is a runtime for threading. I don't see why any particularities there should mean that OpenBLAS ignores the users choice of number of threads to use (for OpenBLAS) or why it (OpenBLAS) should influence the users program in any way.

@Flamefire
Copy link
Contributor Author

Flamefire commented Nov 17, 2020

Ok, so you were talking about running OpenBLAS withing an OMP parallel region (where OpenBLAS will run with 1 thread). Thanks for clearing that up!
My suggestion is NOT about that case. I want to make the behavior in the "normal" (OpenBLAS outside a OMP parallel region) case more consistent with the non-OMP version and have it not influence the users program.

I.e.

@brada4
Copy link
Contributor

brada4 commented Nov 17, 2020

  • A Actual similar mkl_set_num_threads sets threads in subsequent invocations in parallel sections (where OpenBLAS has ONE invariably) - that might be "right" and "expected"
  • B Setting OMP threads is unexpected, especially setting them up may cause oversubscription of CPU cores.
  • C But there is a global lock actually limiting to one L3 call at a time per process.

Easiest would be to plainly alias get function to omp (B) and discard set parameter to align with (A) until (C) is heavily changed to permit multiple calls.

EDIT: either way you are right that programming global thread number is wrong.

@Flamefire
Copy link
Contributor Author

I tested mkl_set_num_threads/mkl_get_max_threads and that does not influence the omp numbers and is also not influenced by it.

So aliasing the OpenBLAS function(s) to the OMP function would be wrong. Similar to how https://software.intel.com/content/www/us/en/develop/documentation/mkl-developer-reference-c/top/support-functions/threading-control/mkl-set-num-threads.html defines it:

Specifies the number of OpenMP* threads to use.

There should be a separate state for OpenBLAS: The number of OMP threads that are used. Hence the set function should NOT call the OMP set function and the get function should change neither the number of OpenBLAS nor of the OMP threads.

So my proposal would be simply: Set changes the OpenBLAS number ignoring any OMP setting, get returns the min(OpenBLAS threads, OMP threads). This aligns with the above MKL definition: The number of OMP threads to use. If there are less available (user set omp_max_threads to 4 and OpenBLAS threads to 8) then all 4 OMP threads are used (not 8)

The behavior in parallel section (1 thread only) could be kept. Otherwise it should be checked whether nested parallelism is actually supported.

@brada4
Copy link
Contributor

brada4 commented Nov 17, 2020

It is influenced on initialisation. Ill look into it in weekend, functions look so badly scattered around the code that it is not as simple as commenting line 78

Normally you can have a machine with 2^N sockets , each socket with 2^n internal NUMA domains, OMP intends to model it well, MKL kind of permits to set domain numbers and volumes, OpenBLAS here considers all domains have one core. In real life python and R has parallel modules that spawn processes for domains with that many cores, but could wish nested sections to spawn OMP parallel section around sockets.

@Flamefire
Copy link
Contributor Author

It is influenced on initialisation.

Sure, using the number of OMP threads as the default is reasonable. Only setting the number of OpenBLAS threads should not influence OMP threads and vice versa (after the init/first OpenBLAS call)

functions look so badly scattered around the code that it is not as simple as commenting line 78

Noticed that too, especially the variables (*num_cpus, *num_threads). The 2 linked functions above should help to get started, I even think changing those 2 is enough, but this needs checking.

@jeremiedbb
Copy link

Hi, has the status on this evolved ? I find it unexpected as well that setting the number of threads for openblas changes the internal state of openmp.

I have a program with some openblas calls and some openmp code (at different places, not nested) and I find it more than surprising that setting a limit on the number of threads for openblas impacts the openmp part of my program.

I'd expect the result of omp_get_max_threads to be independent of calling openblas_set_num_threads or not.

@jeremiedbb
Copy link

Also I tested:

openblas_set_num_threads(1);
omp_set_num_threads(original_omp_num_threads);

# caling openblas runs as if I didn't change any num_threads
# (at least it seems after doing some benchmarks)
# even if
openblas_get_num_threads();  # returns 1

Which adds to the confusion to me because you can end up in state where the value of openblas_get_num_threads does not reflect the number of threads it will actually use.

@martin-frbg
Copy link
Collaborator

PR #3773 added a new, OpenMP-compatible environment variable OPENBLAS_DEFAULT_NUM_THREADS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants