Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] make it possible to protect against oversubsription for various threadpool nesting cases #16

Merged
merged 79 commits into from
Jun 3, 2019

Conversation

ogrisel
Copy link
Contributor

@ogrisel ogrisel commented Apr 2, 2019

As discussed there #14 (comment) .

For now here is some test infrastructure that highlight the problem with 2 OpenMP enabled Cython extensions.

@ogrisel ogrisel force-pushed the nested-openmp-tests branch from c3a64f5 to 5603a7c Compare April 25, 2019 18:01
@ogrisel
Copy link
Contributor Author

ogrisel commented Apr 25, 2019

I forced pushed a debug commit 5603a7cand here is the unexpected output.

tests/test_threadpoolctl.py::test_openmp_nesting[None] # no limit
[0] libomp: 2, libgomp: 2, inner loop omp_get_num_threads: 2
[0] libomp: 2, libgomp: 2, inner loop omp_get_num_threads: 2
[0] libomp: 2, libgomp: 2, inner loop omp_get_num_threads: 2
[2] libomp: 2, libgomp: 2, inner loop omp_get_num_threads: 2
[2] libomp: 2, libgomp: 2, inner loop omp_get_num_threads: 2
[0] libomp: 2, libgomp: 2, inner loop omp_get_num_threads: 2
[2] libomp: 2, libgomp: 2, inner loop omp_get_num_threads: 2
[2] libomp: 2, libgomp: 2, inner loop omp_get_num_threads: 2
[3] libomp: 2, libgomp: 2, inner loop omp_get_num_threads: 2
[3] libomp: 2, libgomp: 2, inner loop omp_get_num_threads: 2
[1] libomp: 2, libgomp: 2, inner loop omp_get_num_threads: 2
[3] libomp: 2, libgomp: 2, inner loop omp_get_num_threads: 2
[1] libomp: 2, libgomp: 2, inner loop omp_get_num_threads: 2
[3] libomp: 2, libgomp: 2, inner loop omp_get_num_threads: 2
[1] libomp: 2, libgomp: 2, inner loop omp_get_num_threads: 2
[1] libomp: 2, libgomp: 2, inner loop omp_get_num_threads: 2
# inner openmp loop (clang-8) limited to 1
[2] libomp: 2, libgomp: 1, inner loop omp_get_num_threads: 2
[2] libomp: 2, libgomp: 2, inner loop omp_get_num_threads: 2
[2] libomp: 2, libgomp: 1, inner loop omp_get_num_threads: 2
[0] libomp: 1, libgomp: 1, inner loop omp_get_num_threads: 1
[2] libomp: 2, libgomp: 2, inner loop omp_get_num_threads: 2
[3] libomp: 2, libgomp: 1, inner loop omp_get_num_threads: 2
[0] libomp: 1, libgomp: 1, inner loop omp_get_num_threads: 1
[3] libomp: 2, libgomp: 2, inner loop omp_get_num_threads: 2
[3] libomp: 2, libgomp: 1, inner loop omp_get_num_threads: 2
[3] libomp: 2, libgomp: 2, inner loop omp_get_num_threads: 2
[0] libomp: 1, libgomp: 1, inner loop omp_get_num_threads: 1
[0] libomp: 1, libgomp: 1, inner loop omp_get_num_threads: 1
[1] libomp: 1, libgomp: 1, inner loop omp_get_num_threads: 1
[1] libomp: 1, libgomp: 1, inner loop omp_get_num_threads: 1
[1] libomp: 1, libgomp: 1, inner loop omp_get_num_threads: 1
[1] libomp: 1, libgomp: 1, inner loop omp_get_num_threads: 1
FAILED

The prompt number in brackets is the index of the outer loop (4 outer iterations x 4 inner iteration == 16 lines) x 2 (without and with limit).

The results are really hard to understand but the inner calls to openmp.omp_get_num_threads() are always consistent with threadpool_info.

@ogrisel
Copy link
Contributor Author

ogrisel commented Apr 25, 2019

Hum, my debug code is causing the MKL enabled case to deadlock at process shutdown time now... and this test is not even using the BLAS routines. And the nested OMP tests pass with MKL.

Here is the output of that test that passes but causes the shutdown to deadlock once all the tests pass... note that libgomp (the inner openmp runtime) is has both 2 and 1 threads when it's limited to 1 and the inner call to omp_get_num_threads always returns 1:

tests/test_threadpoolctl.py::test_openmp_nesting[4] # no limit
[0] libiomp: 2, libgomp: 2, libomp: 2, inner loop omp_get_num_threads: 1
[2] libiomp: 2, libgomp: 2, libomp: 2, inner loop omp_get_num_threads: 1
[0] libiomp: 2, libgomp: 2, libomp: 2, inner loop omp_get_num_threads: 1
[2] libiomp: 2, libgomp: 2, libomp: 2, inner loop omp_get_num_threads: 1
[0] libiomp: 2, libgomp: 2, libomp: 2, inner loop omp_get_num_threads: 1
[2] libiomp: 2, libgomp: 2, libomp: 2, inner loop omp_get_num_threads: 1
[0] libiomp: 2, libgomp: 2, libomp: 2, inner loop omp_get_num_threads: 1
[2] libiomp: 2, libgomp: 2, libomp: 2, inner loop omp_get_num_threads: 1
[1] libiomp: 2, libgomp: 2, libomp: 2, inner loop omp_get_num_threads: 1
[3] libiomp: 2, libgomp: 2, libomp: 2, inner loop omp_get_num_threads: 1
[1] libiomp: 2, libgomp: 2, libomp: 2, inner loop omp_get_num_threads: 1
[3] libiomp: 2, libgomp: 2, libomp: 2, inner loop omp_get_num_threads: 1
[1] libiomp: 2, libgomp: 2, libomp: 2, inner loop omp_get_num_threads: 1
[3] libiomp: 2, libgomp: 2, libomp: 2, inner loop omp_get_num_threads: 1
[3] libiomp: 2, libgomp: 2, libomp: 2, inner loop omp_get_num_threads: 1
[1] libiomp: 2, libgomp: 2, libomp: 2, inner loop omp_get_num_threads: 1
# inner openmp loop (gcc) limited to 1
[0] libiomp: 1, libgomp: 1, libomp: 1, inner loop omp_get_num_threads: 1
[2] libiomp: 1, libgomp: 2, libomp: 2, inner loop omp_get_num_threads: 1
[0] libiomp: 1, libgomp: 1, libomp: 1, inner loop omp_get_num_threads: 1
[2] libiomp: 1, libgomp: 2, libomp: 2, inner loop omp_get_num_threads: 1
[2] libiomp: 1, libgomp: 2, libomp: 2, inner loop omp_get_num_threads: 1
[0] libiomp: 1, libgomp: 1, libomp: 1, inner loop omp_get_num_threads: 1
[2] libiomp: 1, libgomp: 2, libomp: 2, inner loop omp_get_num_threads: 1
[0] libiomp: 1, libgomp: 1, libomp: 1, inner loop omp_get_num_threads: 1
[3] libiomp: 1, libgomp: 2, libomp: 2, inner loop omp_get_num_threads: 1
[1] libiomp: 1, libgomp: 1, libomp: 1, inner loop omp_get_num_threads: 1
[3] libiomp: 1, libgomp: 2, libomp: 2, inner loop omp_get_num_threads: 1
[1] libiomp: 1, libgomp: 1, libomp: 1, inner loop omp_get_num_threads: 1
[3] libiomp: 1, libgomp: 2, libomp: 2, inner loop omp_get_num_threads: 1
[1] libiomp: 1, libgomp: 1, libomp: 1, inner loop omp_get_num_threads: 1
[3] libiomp: 1, libgomp: 2, libomp: 2, inner loop omp_get_num_threads: 1
[1] libiomp: 1, libgomp: 1, libomp: 1, inner loop omp_get_num_threads: 1
PASSED

@ogrisel
Copy link
Contributor Author

ogrisel commented Apr 25, 2019

@ogrisel
Copy link
Contributor Author

ogrisel commented Apr 25, 2019

@jeremiedbb @tomMoral if you have any idea on what is going on, feel free to pitch in.

@jeremiedbb
Copy link
Collaborator

I'm as lost as you are :'(
some more infos:

  • If I print the threadpool info from inside the outer loop, the openmp max thread is always 1 for the inner loop openmp in thread 0, and always 2 in the other threads.
  • If I set the env var OMP_NUM_THREADS=1 the issue disappears
  • If I reset the threadpool limits from inside the outer loop, the issue disappears

@jeremiedbb
Copy link
Collaborator

Another remark. Setting openmp max threads for each prefix independently and not setting the prange num_thread param as in first api attempt seems to not work either.

from threadpoolctl import threadpool_limits
with threadpool_limits(limits={'libomp': 2, 'libgomp': 1}):
    check_nested_openmp_loops(n)

has the same issue

@jeremiedbb
Copy link
Collaborator

Also, I'm able to reproduce this in pure C, but only if gcc is outer and clang is inner.

@jeremiedbb
Copy link
Collaborator

jeremiedbb commented Apr 26, 2019

And I can confirm that it's not just a wrong value of omp_get_max_threads. the inner loops in the master thread of the outer loop use only 1 thread and the inner loops in the other threads of the outer loop effectively use 2 threads.

@ogrisel
Copy link
Contributor Author

ogrisel commented Apr 26, 2019

Might have to do with "Note: Functionality in this module may only be used from the main thread or parallel regions due to OpenMP restrictions. ".

https://cython.readthedocs.io/en/latest/src/userguide/parallelism.html

@ogrisel
Copy link
Contributor Author

ogrisel commented Apr 26, 2019

Maybe you can try to instrument your C code to report https://linux.die.net/man/2/gettid (or we could also do it in the cython version of this PR).

@jeremiedbb
Copy link
Collaborator

jeremiedbb commented May 29, 2019

Let me summarize the state of this.

We want to be able to limit number of threads for inner parallel region in nested parallelism. Inner parallelism can be directly using OpenMP or through a BLAS lib.

  • Inner BLAS

    • In that case, using the current api works well with threadpool_limits({'blas': n}).
    • There's one un-tested case (OpenBLAS built with clang/icc)
    • Note that in this case it's not easy to check the number of threads used by the BLAS lib. It might be different than the value returned by openblas_get_num_threads(). For example, if OpenBLAS is built with OpenMP and both inner and outer are built with the same compiler, inner will likely use only 1 thread because in OpenMP nested parallelism is disable by default. However it's not a big deal since when we want to limit the number of threads actually we want to set an upper bound.
  • inner OpenMP

    • In that case, when inner and outer OpenMP are different, and outer is GNU and inner is Intel or LLVM, then only the number of threads in the master thread of the outer region is correctly limited. In the other ones it's unlimited. It might be a bug in Intel OpenMP. I've reported it, waiting for their answer.
    • Note that as previously, if both are built with the same compiler, inner will likely automatically fall back to sequential (this is not currently tested since we always limit to 1 in the tests). Again this is not problematic.
    • Note that this situation is not the typical use-case. In sklearn for instance it will only be used to limit BLAS threads when called in a prange.

I think it would be reasonable to skip the tests in the problematic envs temporarily, merge this PR and maybe even make a first release.
We should mention somewhere that it's a best effort which covers most typical cases, and explain in which cases we currently have no solution.

@ogrisel
Copy link
Contributor Author

ogrisel commented Jun 3, 2019

There's one un-tested case (OpenBLAS built with clang/icc)

We have not tested BLIS built by either clang/icc or gcc either.

@ogrisel
Copy link
Contributor Author

ogrisel commented Jun 3, 2019

I do not see any simple way to add a test for the remaining uncovered lines from this PR so I propose to leave that as it is.

@ogrisel ogrisel changed the title [WIP] make it possible to protect against oversubsription for various threadpool nesting cases [MRG] make it possible to protect against oversubsription for various threadpool nesting cases Jun 3, 2019
@ogrisel ogrisel merged commit 57979da into joblib:master Jun 3, 2019
@ogrisel ogrisel deleted the nested-openmp-tests branch June 3, 2019 13:34
@jeremiedbb
Copy link
Collaborator

I've tested OpenBLAS built with icc and outer built with gcc.
And it works !

We have 3 libs: libopenblas, libiomp & libgomp
I set the limits to 1 with user_api='blas'.
In the master thread of outer loop, both libopenblas & libiomp have num_threads=1
In the other threads, only libopenblas has num_threads=1, libiomp has number of cores.
However, looking at htop, only the expected number of threads is used. This seems to indicate that openblas only uses openblas_num_threads to spawn the threadpool.

@ogrisel
Copy link
Contributor Author

ogrisel commented Jun 27, 2019

Good news. Could you please also try with blis compiled with clang or icc?

@jeremiedbb
Copy link
Collaborator

jeremiedbb commented Jun 27, 2019

I'm on it. Just finished to build and install it :)
(the doc is great !!!)

@jeremiedbb
Copy link
Collaborator

I've tested with BLIS (openmp or pthreads) built with either gcc or icc in an outer OpenMP parallel loop and limiting number of threads for BLIS works fine like with mkl or openblas \o/
I've opened #23 to support BLIS. I need to put that in the CI (built with clang instead of icc)

@ogrisel
Copy link
Contributor Author

ogrisel commented Jun 28, 2019

Great news. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants