Slowdown when using openblas-pthreads alongside openmp based parallel code #3187

jeremiedbb · 2021-04-22T12:01:09Z

Hi,

I have a code which mixes BLAS calls (gemm) and OpenMP based parallel loops (they are not nested). When OpenBLAS is built using OpenMP everything is fine but when OpenBLAS is built with pthreads there's a huge slowdown. Below is a reproducible example (sorry it's from python/cython)

%load_ext cython

%%cython -f --compile-args=-fopenmp --link-args=-fopenmp
# cython: profile=True, cdivision=True, boundscheck=False, wraparound=False

import time
import numpy as np
from cython.parallel import prange
 
def f2(double[:, ::1] A):
     cdef:
         double v = 0
         int m = A.shape[0]
         int n = A.shape[1]
         int i, j
 
    with nogil:
        for i in prange(m):      # OpenMP parallel for loop (*)
            for j in range(n):
                v += A[i, j]
 
    return v
    
    
def f1(U, V):
    v = 0
    for n_iter in range(100):
        UV = U @ V.T             # BLAS call (gemm)
        v += f2(UV)              # function runs an OpenMP parallel for loop
    return v
    

U = np.random.randn(10000, 100)
V = np.random.randn(10, 100)
 
t = time.time()
v = f1(U, V)
print(time.time() - t)

On my laptop (2 physical cores), when I use a sequential loop in (*), it runs in 0.26s. When I use a parallel loop it runs in 2.6s (10x slower). This is with OpenBLAS 0.3.12 built with pthreads. This conda env allows to reproduce conda create -n tmp -c conda-forge python numpy cython ipython.

However, if I use OpenBLAS built with OpenMP, it runs in 0.26s with and without prange. This is with OpenBLAS 0.3.9 built with OpenMP. This conda env allows to reproduce conda create -n tmp -c conda-forge python numpy cython ipython blas[build=openblas] libopenblas=0.3.9.

The text was updated successfully, but these errors were encountered:

martin-frbg · 2021-04-22T12:09:55Z

most likely you get too many threads running in parallel in the non-openmp case

jeremiedbb · 2021-04-22T12:18:17Z

omp_get_max_threads() returns 4 (I have 2 physical cores + hyperthreading). Do you think OpenBLAS is using too many threads ?
It only occurs when I use an OpenMP loop, so it means there's an interaction between OpenBLAS built with pthreads and OpenMP.

jeremiedbb · 2021-04-22T12:20:09Z

For what it's wotrh the number of running threads showed by htop only increases by 4 when I run this

jeremiedbb · 2021-04-22T13:01:22Z

a profiling with linux perf shows that most of the time comes from do_wait from libgomp

ogrisel · 2021-04-23T13:28:57Z

Would be worth checking this:

2020-07-17
Conda-forge is building openblas with both pthreads and openmp on Linux
The main change is that openblas will use pthreads for threading by default on Linux instead of the previous openmp default. The openmp builds can be recovered by installing libopenblas=*=*openmp*.

from https://conda-forge.org/docs/user/announcements.html

isuruf · 2021-04-23T18:09:28Z

Is it really libgomp? Note that we are using libomp (from LLVM) with openblas openmp build because libgomp is not fork safe. If you are loading libgomp.so.3 from conda-forge, it's a symlink to libomp, but if you are loading libgomp.so.3 from the system it'll be the real libgomp and therefore you'll have two instances of openmp loaded which is not advisable.

jeremiedbb · 2021-04-23T19:11:39Z

My main concern is that the issue happens with OpenBLAS built with pthreads (not when built with OpenMP) !

When I execute my snippet above, libopenblas is loaded because numpy is imported and libgomp is loaded because of the prange loop. It can be confirmed with

from threadpoolctl import threadpool_info
threadpool_info()
[{'filepath': '/home/jeremie/miniconda/envs/tmp/lib/libopenblasp-r0.3.12.so',
  'prefix': 'libopenblas',
  'user_api': 'blas',
  'internal_api': 'openblas',
  'version': '0.3.12',
  'num_threads': 4,
  'threading_layer': 'pthreads'},
 {'filepath': '/home/jeremie/miniconda/envs/tmp/lib/libgomp.so.1.0.0',
  'prefix': 'libgomp',
  'user_api': 'openmp',
  'internal_api': 'openmp',
  'version': None,
  'num_threads': 4}]

If I replace prange by range, libgomp disappears from the list and if I don't import numpy libopenblas disappears.

It also confirms that I have only 1 OpenMP runtime loaded (threadpoolctl takes symlinks into account).
Also, for this specific environment (conda create -n tmp -c conda-forge python numpy cython ipython) it's libgomp and not libomp which is installed by conda-forge

libgomp            conda-forge/linux-64::libgomp-9.3.0-h2828fa1_19

isuruf · 2021-04-23T19:16:56Z

Thanks for the info. That's to be expected though. When running openmp, you create a few threads and then each thread calls blas which in turn creates more threads. In a multi-threaded environment, it's safest to just do openblas_set_num_threads(1) before calling blas.

jeremiedbb · 2021-04-23T19:19:30Z

It's not an oversubscription issue here. They are not nested. I first call gemm and then call a function which executes a parallel loop (with no blas inside).

jeremiedbb · 2021-04-23T19:28:28Z

Also I ran ps -o nlwp <pid> on the process running my python script and it returns 8 which is what I'd expect: 4 from openblas and 4 from openmp

isuruf · 2021-04-23T19:36:18Z

I see the same issue. 30x slowdown with libgomp and 2x slowdown with libomp.

isuruf · 2021-04-23T19:50:20Z

OMP_PROC_BIND=TRUE reduces the slowdown in libgomp.

brada4 · 2021-04-23T20:46:14Z

OpenBLAS assumes each thread has CPU at its hands with all outermost caches. 30x slowdoen means multiple threads are stuck on same CPU and actually get to spill to the main memory instead of cache.
You can set OPENBLAS_NUM_THREADS=1 running samples right now, before building OMP version to address oversubscription issue. pthread version has no idea of OpenMP spinnnging full room of threads, and spins up own in each thread.
EDIT: hyperthreads do not add any value to HPC.You can disable them to get single-threaded stuff end faster, parallel will have half double-speed cores essentially, i.e same timing as with HT

jeremiedbb · 2021-04-26T15:21:09Z

Sorry I'm not sure to understand your answer. I agree that HT is useless in HPC most of the time but it does not seem to be the only issue here since the program is fast when OpenBLAS is built with pthreads and I run the loop in sequential mode. The issue only appears when I also run the loop in parallel with OpenMP. I recall that they are not nested but run one after the other.

I post a pure C reproducible code here, hope it will make my concerns clearer.

#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
#include "cblas.h"


double f2(double *X, int n){
    double v = 0.0;

    #pragma omp parallel for reduction(+:v)
    for(int i=0; i<n; i++){
        v += X[i];
    }

    return v;
}


int main(int argc, char **argv){
    int m = 10000,
        n = 10,
        k = 100;

    double *A = (double*) malloc(m * k * sizeof(double)),
           *B = (double*) malloc(n * k * sizeof(double)),
           *C = (double*) malloc(m * n * sizeof(double));

    for(int i=0; i<m*k; i++){
        A[i] = 0.1;
    }
    for(int i=0; i<n*k; i++){
        B[i] = 1.2;
    }

    double v = 0.0;

    for(int i=0; i<1000; i++){
        // BLAS call
        cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasTrans, m, n, k, 1.0, A, k, B, k, 0.0, C, n);    // C <- A @ B.T
        // Followed by parallel loop
        v += f2(C, m * n);
    }

    free(A), free(B), free(C);

    return 0;
}

Here a the timings on a 40 cores machine (20 physical + HT).

	openblas pthreads	openblas openmp
sequential loop	1.22s	0.62s
parallel loop	21s	0.51s

Focusing on the pthreads + parallel loop case:

If I set OPENBLAS_NUM_THREADS=20 and OMP_NUM_THREADS=20 the result becomes 0.34s which confirms that HT is detrimental but I don't understand what is the difference with the openblas openmp + sequential loop case.
If I set OPENBLAS_NUM_THREADS=21 and OMP_NUM_THREADS=21, the result is 0.5s, i.e. 50% slower than above, which I would not expect since it's only 1 more thread.
If I set OPENBLAS_NUM_THREADS=22 and OMP_NUM_THREADS=22, the result is 5s ! I don't think the only issue here is basic oversubscription.

isuruf · 2021-04-26T17:02:18Z

@jeremiedbb, can you also try with OMP_PROC_BIND=TRUE?

jeremiedbb · 2021-04-26T17:22:24Z

can you also try with OMP_PROC_BIND=TRUE?

@isuruf it reduces the time from 21s to 2.6s. It's better but still much slower than expected

brada4 · 2021-04-26T22:51:05Z

That just forces each new unnecessary OpenBLAS pthread swarm onto a single core,slightly better at data locality than completely unbound but still bad.
40-cores machine might be 2 sockets with CoD each, i.e 4 sockets. Fitting task into one CoD node (10 threads) will make it fall on a single cache (uncore) complex and perfrom optimally, you can manage 4 of those via OMP, a bit of a hack but works most of the time.
PS you think 50x slowdown is bad.... 8-socket systems with subpar data locality do like 500x down....

isuruf · 2021-04-26T22:54:49Z

That just forces each new unnecessary OpenBLAS pthread swarm onto a single core,slightly better at data locality than completely unbound but still bad.

OpenBLAS pthread behaviour should not be affected by what openmp does since the openblas calls are from the main thread and an unrelated work happens in the openmp threads.

brada4 · 2021-04-27T10:29:04Z

behaviour should not be affected

OMP placement policy actually sets CPU affinity for OMP threads, so all the following pthreads cannot escape that. There is no hidden side-step API that nobody else uses.

ogrisel · 2021-04-27T14:13:00Z

@jeremiedbb maybe there is a way to introspect the CPU affinity of new pthreads started before and after calling the OMP f2 function in your minimal reproducer? That would help us confirm whether or not this is the cause of the problem.

https://man7.org/linux/man-pages/man3/pthread_getaffinity_np.3.html

brada4 · 2021-04-28T15:01:30Z

cat /proc/self/status | grep _allowed

jeremiedbb · 2021-05-05T15:22:40Z

here's the output of cat /proc/self/status | grep _allowed

Cpus_allowed:	ff,ffffffff
Cpus_allowed_list:	0-39
Mems_allowed:	00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001
Mems_allowed_list:	0

Sorry but I've no idea how to interpret this :/

brada4 · 2021-05-05T15:58:27Z

Put a one-minute sleep between OMP loops and check thread binding to CPU cores. It is actually documented in GOMP manual pages on any Linux system.

jeremiedbb · 2021-10-12T13:11:44Z

I introspected the affinities for both OpenMP and OpenBLAS threadpools and it turns out that no affinity constraint is set (openblas is built with NO_AFFINITY). Here's the output of sched_getaffinity (I also checked the /proc/thread-self/status file and the results are the same) from the snippet:

* iteration 0
-- BLAS --
    # thread 34120
      - sched_getcpu: 2
      - sched_getaffinity: 0, 1, 2, 3, 
    # thread 34119
      - sched_getcpu: 3
      - sched_getaffinity: 0, 1, 2, 3, 
    # thread 34121
      - sched_getcpu: 0
      - sched_getaffinity: 0, 1, 2, 3, 
-- OMP --
    # thread 34124
      - sched_getcpu: 0
      - sched_getaffinity: 0,1,2,3,
    # thread 34122
      - sched_getcpu: 3
      - sched_getaffinity: 0,1,2,3,
    # thread 34123
      - sched_getcpu: 1
      - sched_getaffinity: 0,1,2,3,
    # thread 34118
      - sched_getcpu: 0
      - sched_getaffinity: 0,1,2,3,

So the affinity seems to not be the reason of the bad interaction between openblas-pthreads and openmp. However I found that when the openmp loop ends, the threads are still waiting for computation in an active way (OMP_WAIT_POLICY), which consumes resources and prevent openblas to start computations right away. By default, openmp makes waiting threads spin for a while.

Unfortunately, setting OMP_WAIT_POLICY=passive does not really improve the performances on a machine with many cores for some reason that I don't understand yet. The best solution I found so far is to set the num threads for both to half the number of threads, besides building openblas with openmp of course.

I guess this is a wont fix from the OpenBLAS side. OpenMP programs do not interact well with other libraries managing their own threadpool. Feel free to close the issue if you think there's no more to add about that.

Still I wonder if there is the same kind of wait policy in openblas.

jeremiedbb · 2021-10-12T13:13:30Z

@isuruf I think this issue is a good reason to always try to use an openblas built with openmp for the scikit-learn builds on conda-forge (I noticed it was not always the case).

martin-frbg · 2021-10-12T14:14:13Z

@jeremiedbb OpenBLAS does have a similar wait policy for its threads, governed by the value of THREAD_TIMEOUT at build time (or the environment variable OPENBLAS_THREAD_TIMEOUT at runtime) which defines the number of clock cycles to wait as 1 << n where the default n is 26 (which according to the comment in Makefile.rule is equivalent to about 25ms at 3GHz).

martin-frbg · 2024-01-12T13:26:22Z

Closing after copying the relevant information to the FAQ in the wiki

jeremiedbb mentioned this issue Apr 22, 2021

[WIP] PERF Parallelize W/H updates of NMF with OpenMP scikit-learn/scikit-learn#16439

Open

ogrisel mentioned this issue Aug 5, 2021

Poor performance of sklearn.cluster.KMeans for numpy >= 1.19.0 scikit-learn/scikit-learn#20642

Closed

jeremiedbb mentioned this issue Sep 30, 2021

WIP: allow threading backend to be replaced by caller #2255

Closed

4 tasks

ogrisel mentioned this issue Nov 17, 2021

Performance issue on macOS arm64 (M1) when installing from wheels (2x libopenblas) scipy/scipy#15050

Closed

jjerphan mentioned this issue Dec 16, 2021

ENH Loss module LogisticRegression scikit-learn/scikit-learn#21808

Merged

jeremiedbb mentioned this issue Apr 21, 2023

Performance regression in pairwise_distances with the Euclidean metric on sparse data scikit-learn/scikit-learn#26097

Closed

ogrisel mentioned this issue Oct 6, 2023

Provide a proper wheel instead of a tarball MacPython/openblas-libs#86

Closed

thomasjpfan mentioned this issue Jan 5, 2024

ENH: Parallelize CSR (sparse matrix) methods scipy/scipy#19717

Open

martin-frbg closed this as completed Jan 12, 2024

jeremiedbb mentioned this issue Apr 16, 2024

When running the GaussianProcessClassifier on M-chips Mac takes extremely long time scikit-learn/scikit-learn#28715

Open

jeremiedbb mentioned this issue Apr 24, 2024

Configure OpenBLAS to use scikit-learn's OpenMP threadpool scikit-learn/scikit-learn#28883

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slowdown when using openblas-pthreads alongside openmp based parallel code #3187

Slowdown when using openblas-pthreads alongside openmp based parallel code #3187

jeremiedbb commented Apr 22, 2021

martin-frbg commented Apr 22, 2021

jeremiedbb commented Apr 22, 2021

jeremiedbb commented Apr 22, 2021

jeremiedbb commented Apr 22, 2021

ogrisel commented Apr 23, 2021 •

edited

Loading

isuruf commented Apr 23, 2021

jeremiedbb commented Apr 23, 2021 •

edited

Loading

isuruf commented Apr 23, 2021

jeremiedbb commented Apr 23, 2021

jeremiedbb commented Apr 23, 2021

isuruf commented Apr 23, 2021

isuruf commented Apr 23, 2021 •

edited

Loading

brada4 commented Apr 23, 2021 •

edited

Loading

jeremiedbb commented Apr 26, 2021 •

edited

Loading

isuruf commented Apr 26, 2021

jeremiedbb commented Apr 26, 2021

brada4 commented Apr 26, 2021

isuruf commented Apr 26, 2021

brada4 commented Apr 27, 2021 •

edited

Loading

ogrisel commented Apr 27, 2021 •

edited

Loading

brada4 commented Apr 28, 2021

jeremiedbb commented May 5, 2021 •

edited

Loading

brada4 commented May 5, 2021

jeremiedbb commented Oct 12, 2021

jeremiedbb commented Oct 12, 2021

martin-frbg commented Oct 12, 2021

martin-frbg commented Jan 12, 2024

Slowdown when using openblas-pthreads alongside openmp based parallel code #3187

Slowdown when using openblas-pthreads alongside openmp based parallel code #3187

Comments

jeremiedbb commented Apr 22, 2021

martin-frbg commented Apr 22, 2021

jeremiedbb commented Apr 22, 2021

jeremiedbb commented Apr 22, 2021

jeremiedbb commented Apr 22, 2021

ogrisel commented Apr 23, 2021 • edited Loading

isuruf commented Apr 23, 2021

jeremiedbb commented Apr 23, 2021 • edited Loading

isuruf commented Apr 23, 2021

jeremiedbb commented Apr 23, 2021

jeremiedbb commented Apr 23, 2021

isuruf commented Apr 23, 2021

isuruf commented Apr 23, 2021 • edited Loading

brada4 commented Apr 23, 2021 • edited Loading

jeremiedbb commented Apr 26, 2021 • edited Loading

isuruf commented Apr 26, 2021

jeremiedbb commented Apr 26, 2021

brada4 commented Apr 26, 2021

isuruf commented Apr 26, 2021

brada4 commented Apr 27, 2021 • edited Loading

ogrisel commented Apr 27, 2021 • edited Loading

brada4 commented Apr 28, 2021

jeremiedbb commented May 5, 2021 • edited Loading

brada4 commented May 5, 2021

jeremiedbb commented Oct 12, 2021

jeremiedbb commented Oct 12, 2021

martin-frbg commented Oct 12, 2021

martin-frbg commented Jan 12, 2024

ogrisel commented Apr 23, 2021 •

edited

Loading

jeremiedbb commented Apr 23, 2021 •

edited

Loading

isuruf commented Apr 23, 2021 •

edited

Loading

brada4 commented Apr 23, 2021 •

edited

Loading

jeremiedbb commented Apr 26, 2021 •

edited

Loading

brada4 commented Apr 27, 2021 •

edited

Loading

ogrisel commented Apr 27, 2021 •

edited

Loading

jeremiedbb commented May 5, 2021 •

edited

Loading