Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slowdown when using openblas-pthreads alongside openmp based parallel code #3187

Closed
jeremiedbb opened this issue Apr 22, 2021 · 27 comments
Closed

Comments

@jeremiedbb
Copy link

Hi,

I have a code which mixes BLAS calls (gemm) and OpenMP based parallel loops (they are not nested). When OpenBLAS is built using OpenMP everything is fine but when OpenBLAS is built with pthreads there's a huge slowdown. Below is a reproducible example (sorry it's from python/cython)

%load_ext cython

%%cython -f --compile-args=-fopenmp --link-args=-fopenmp
# cython: profile=True, cdivision=True, boundscheck=False, wraparound=False

import time
import numpy as np
from cython.parallel import prange
 
def f2(double[:, ::1] A):
     cdef:
         double v = 0
         int m = A.shape[0]
         int n = A.shape[1]
         int i, j
 
    with nogil:
        for i in prange(m):      # OpenMP parallel for loop (*)
            for j in range(n):
                v += A[i, j]
 
    return v
    
    
def f1(U, V):
    v = 0
    for n_iter in range(100):
        UV = U @ V.T             # BLAS call (gemm)
        v += f2(UV)              # function runs an OpenMP parallel for loop
    return v
    

U = np.random.randn(10000, 100)
V = np.random.randn(10, 100)
 
t = time.time()
v = f1(U, V)
print(time.time() - t)

On my laptop (2 physical cores), when I use a sequential loop in (*), it runs in 0.26s. When I use a parallel loop it runs in 2.6s (10x slower). This is with OpenBLAS 0.3.12 built with pthreads. This conda env allows to reproduce conda create -n tmp -c conda-forge python numpy cython ipython.

However, if I use OpenBLAS built with OpenMP, it runs in 0.26s with and without prange. This is with OpenBLAS 0.3.9 built with OpenMP. This conda env allows to reproduce conda create -n tmp -c conda-forge python numpy cython ipython blas[build=openblas] libopenblas=0.3.9.

@martin-frbg
Copy link
Collaborator

most likely you get too many threads running in parallel in the non-openmp case

@jeremiedbb
Copy link
Author

omp_get_max_threads() returns 4 (I have 2 physical cores + hyperthreading). Do you think OpenBLAS is using too many threads ?
It only occurs when I use an OpenMP loop, so it means there's an interaction between OpenBLAS built with pthreads and OpenMP.

@jeremiedbb
Copy link
Author

For what it's wotrh the number of running threads showed by htop only increases by 4 when I run this

@jeremiedbb
Copy link
Author

a profiling with linux perf shows that most of the time comes from do_wait from libgomp

@ogrisel
Copy link
Contributor

ogrisel commented Apr 23, 2021

Would be worth checking this:

2020-07-17
Conda-forge is building openblas with both pthreads and openmp on Linux
The main change is that openblas will use pthreads for threading by default on Linux instead of the previous openmp default. The openmp builds can be recovered by installing libopenblas=*=*openmp*.

from https://conda-forge.org/docs/user/announcements.html

@isuruf
Copy link
Contributor

isuruf commented Apr 23, 2021

Is it really libgomp? Note that we are using libomp (from LLVM) with openblas openmp build because libgomp is not fork safe. If you are loading libgomp.so.3 from conda-forge, it's a symlink to libomp, but if you are loading libgomp.so.3 from the system it'll be the real libgomp and therefore you'll have two instances of openmp loaded which is not advisable.

@jeremiedbb
Copy link
Author

jeremiedbb commented Apr 23, 2021

My main concern is that the issue happens with OpenBLAS built with pthreads (not when built with OpenMP) !

When I execute my snippet above, libopenblas is loaded because numpy is imported and libgomp is loaded because of the prange loop. It can be confirmed with

from threadpoolctl import threadpool_info
threadpool_info()
[{'filepath': '/home/jeremie/miniconda/envs/tmp/lib/libopenblasp-r0.3.12.so',
  'prefix': 'libopenblas',
  'user_api': 'blas',
  'internal_api': 'openblas',
  'version': '0.3.12',
  'num_threads': 4,
  'threading_layer': 'pthreads'},
 {'filepath': '/home/jeremie/miniconda/envs/tmp/lib/libgomp.so.1.0.0',
  'prefix': 'libgomp',
  'user_api': 'openmp',
  'internal_api': 'openmp',
  'version': None,
  'num_threads': 4}]

If I replace prange by range, libgomp disappears from the list and if I don't import numpy libopenblas disappears.

It also confirms that I have only 1 OpenMP runtime loaded (threadpoolctl takes symlinks into account).
Also, for this specific environment (conda create -n tmp -c conda-forge python numpy cython ipython) it's libgomp and not libomp which is installed by conda-forge

libgomp            conda-forge/linux-64::libgomp-9.3.0-h2828fa1_19

@isuruf
Copy link
Contributor

isuruf commented Apr 23, 2021

Thanks for the info. That's to be expected though. When running openmp, you create a few threads and then each thread calls blas which in turn creates more threads. In a multi-threaded environment, it's safest to just do openblas_set_num_threads(1) before calling blas.

@jeremiedbb
Copy link
Author

It's not an oversubscription issue here. They are not nested. I first call gemm and then call a function which executes a parallel loop (with no blas inside).

@jeremiedbb
Copy link
Author

Also I ran ps -o nlwp <pid> on the process running my python script and it returns 8 which is what I'd expect: 4 from openblas and 4 from openmp

@isuruf
Copy link
Contributor

isuruf commented Apr 23, 2021

I see the same issue. 30x slowdown with libgomp and 2x slowdown with libomp.

@isuruf
Copy link
Contributor

isuruf commented Apr 23, 2021

OMP_PROC_BIND=TRUE reduces the slowdown in libgomp.

@brada4
Copy link
Contributor

brada4 commented Apr 23, 2021

OpenBLAS assumes each thread has CPU at its hands with all outermost caches. 30x slowdoen means multiple threads are stuck on same CPU and actually get to spill to the main memory instead of cache.
You can set OPENBLAS_NUM_THREADS=1 running samples right now, before building OMP version to address oversubscription issue. pthread version has no idea of OpenMP spinnnging full room of threads, and spins up own in each thread.
EDIT: hyperthreads do not add any value to HPC.You can disable them to get single-threaded stuff end faster, parallel will have half double-speed cores essentially, i.e same timing as with HT

@jeremiedbb
Copy link
Author

jeremiedbb commented Apr 26, 2021

Sorry I'm not sure to understand your answer. I agree that HT is useless in HPC most of the time but it does not seem to be the only issue here since the program is fast when OpenBLAS is built with pthreads and I run the loop in sequential mode. The issue only appears when I also run the loop in parallel with OpenMP. I recall that they are not nested but run one after the other.

I post a pure C reproducible code here, hope it will make my concerns clearer.

#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
#include "cblas.h"


double f2(double *X, int n){
    double v = 0.0;

    #pragma omp parallel for reduction(+:v)
    for(int i=0; i<n; i++){
        v += X[i];
    }

    return v;
}


int main(int argc, char **argv){
    int m = 10000,
        n = 10,
        k = 100;

    double *A = (double*) malloc(m * k * sizeof(double)),
           *B = (double*) malloc(n * k * sizeof(double)),
           *C = (double*) malloc(m * n * sizeof(double));

    for(int i=0; i<m*k; i++){
        A[i] = 0.1;
    }
    for(int i=0; i<n*k; i++){
        B[i] = 1.2;
    }

    double v = 0.0;

    for(int i=0; i<1000; i++){
        // BLAS call
        cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasTrans, m, n, k, 1.0, A, k, B, k, 0.0, C, n);    // C <- A @ B.T
        // Followed by parallel loop
        v += f2(C, m * n);
    }

    free(A), free(B), free(C);

    return 0;
}

Here a the timings on a 40 cores machine (20 physical + HT).

openblas pthreads openblas openmp
sequential loop 1.22s 0.62s
parallel loop 21s 0.51s

Focusing on the pthreads + parallel loop case:

  • If I set OPENBLAS_NUM_THREADS=20 and OMP_NUM_THREADS=20 the result becomes 0.34s which confirms that HT is detrimental but I don't understand what is the difference with the openblas openmp + sequential loop case.
  • If I set OPENBLAS_NUM_THREADS=21 and OMP_NUM_THREADS=21, the result is 0.5s, i.e. 50% slower than above, which I would not expect since it's only 1 more thread.
  • If I set OPENBLAS_NUM_THREADS=22 and OMP_NUM_THREADS=22, the result is 5s ! I don't think the only issue here is basic oversubscription.

@isuruf
Copy link
Contributor

isuruf commented Apr 26, 2021

@jeremiedbb, can you also try with OMP_PROC_BIND=TRUE?

@jeremiedbb
Copy link
Author

can you also try with OMP_PROC_BIND=TRUE?

@isuruf it reduces the time from 21s to 2.6s. It's better but still much slower than expected

@brada4
Copy link
Contributor

brada4 commented Apr 26, 2021

That just forces each new unnecessary OpenBLAS pthread swarm onto a single core,slightly better at data locality than completely unbound but still bad.
40-cores machine might be 2 sockets with CoD each, i.e 4 sockets. Fitting task into one CoD node (10 threads) will make it fall on a single cache (uncore) complex and perfrom optimally, you can manage 4 of those via OMP, a bit of a hack but works most of the time.
PS you think 50x slowdown is bad.... 8-socket systems with subpar data locality do like 500x down....

@isuruf
Copy link
Contributor

isuruf commented Apr 26, 2021

That just forces each new unnecessary OpenBLAS pthread swarm onto a single core,slightly better at data locality than completely unbound but still bad.

OpenBLAS pthread behaviour should not be affected by what openmp does since the openblas calls are from the main thread and an unrelated work happens in the openmp threads.

@brada4
Copy link
Contributor

brada4 commented Apr 27, 2021

behaviour should not be affected

OMP placement policy actually sets CPU affinity for OMP threads, so all the following pthreads cannot escape that. There is no hidden side-step API that nobody else uses.

@ogrisel
Copy link
Contributor

ogrisel commented Apr 27, 2021

@jeremiedbb maybe there is a way to introspect the CPU affinity of new pthreads started before and after calling the OMP f2 function in your minimal reproducer? That would help us confirm whether or not this is the cause of the problem.

https://man7.org/linux/man-pages/man3/pthread_getaffinity_np.3.html

@brada4
Copy link
Contributor

brada4 commented Apr 28, 2021

cat /proc/self/status | grep _allowed

@jeremiedbb
Copy link
Author

jeremiedbb commented May 5, 2021

here's the output of cat /proc/self/status | grep _allowed

Cpus_allowed:	ff,ffffffff
Cpus_allowed_list:	0-39
Mems_allowed:	00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001
Mems_allowed_list:	0

Sorry but I've no idea how to interpret this :/

@brada4
Copy link
Contributor

brada4 commented May 5, 2021

Put a one-minute sleep between OMP loops and check thread binding to CPU cores. It is actually documented in GOMP manual pages on any Linux system.

@jeremiedbb
Copy link
Author

I introspected the affinities for both OpenMP and OpenBLAS threadpools and it turns out that no affinity constraint is set (openblas is built with NO_AFFINITY). Here's the output of sched_getaffinity (I also checked the /proc/thread-self/status file and the results are the same) from the snippet:

* iteration 0
-- BLAS --
    # thread 34120
      - sched_getcpu: 2
      - sched_getaffinity: 0, 1, 2, 3, 
    # thread 34119
      - sched_getcpu: 3
      - sched_getaffinity: 0, 1, 2, 3, 
    # thread 34121
      - sched_getcpu: 0
      - sched_getaffinity: 0, 1, 2, 3, 
-- OMP --
    # thread 34124
      - sched_getcpu: 0
      - sched_getaffinity: 0,1,2,3,
    # thread 34122
      - sched_getcpu: 3
      - sched_getaffinity: 0,1,2,3,
    # thread 34123
      - sched_getcpu: 1
      - sched_getaffinity: 0,1,2,3,
    # thread 34118
      - sched_getcpu: 0
      - sched_getaffinity: 0,1,2,3,

So the affinity seems to not be the reason of the bad interaction between openblas-pthreads and openmp. However I found that when the openmp loop ends, the threads are still waiting for computation in an active way (OMP_WAIT_POLICY), which consumes resources and prevent openblas to start computations right away. By default, openmp makes waiting threads spin for a while.

Unfortunately, setting OMP_WAIT_POLICY=passive does not really improve the performances on a machine with many cores for some reason that I don't understand yet. The best solution I found so far is to set the num threads for both to half the number of threads, besides building openblas with openmp of course.

I guess this is a wont fix from the OpenBLAS side. OpenMP programs do not interact well with other libraries managing their own threadpool. Feel free to close the issue if you think there's no more to add about that.

Still I wonder if there is the same kind of wait policy in openblas.

@jeremiedbb
Copy link
Author

@isuruf I think this issue is a good reason to always try to use an openblas built with openmp for the scikit-learn builds on conda-forge (I noticed it was not always the case).

@martin-frbg
Copy link
Collaborator

@jeremiedbb OpenBLAS does have a similar wait policy for its threads, governed by the value of THREAD_TIMEOUT at build time (or the environment variable OPENBLAS_THREAD_TIMEOUT at runtime) which defines the number of clock cycles to wait as 1 << n where the default n is 26 (which according to the comment in Makefile.rule is equivalent to about 25ms at 3GHz).

@martin-frbg
Copy link
Collaborator

Closing after copying the relevant information to the FAQ in the wiki

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants