Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linking against Intel MKL BLAS recommended; OpenBLAS can cause 100x slowdown #1

Closed
ghost opened this issue Oct 23, 2019 · 29 comments
Closed
Labels
external bug porting issue, or problem with external library or system

Comments

@ghost
Copy link

ghost commented Oct 23, 2019

For version 5.4, Suitesparse_config.mk recommended openblas. For version 5.6, intel mkl is now recommended and the config file looks like wants to link against intel openmp.

TL;DR is the cholmod performance degradation related to the threading library (pthreads vs GNU openmp vs intel openmp)? Can you explain a little more (or just link to a discussion about the issue). Does this issue affect version 5.4 when linked against openblas/pthreads?

Some background about why I'm curious about which threading library to use:

I am a "causal" user of suitesparse mostly as it can be used by other packages such as Sundials, numpy, scikit-umfpack, sparseqr, scikit-sparse, etc...

I've built my own Suitesparse shared libraries (and the downstream packages mentioned above) on linux systems (mostly ubuntu) for several years now - one set of libraries linked against openblas/pthreads and one linked against mkl/GNU openmp. Mostly I stick with openblas/pthreads as I get sufficient performance boost without the head aches related to trying to understand which thread library to use (or how).

However, after noting the (new) recommendation for Suitesparse 5.6 and subsequently trying to work out what to do about it, I've run across discussions like this or this as well as the intel mkl linking recommendations here.

The hobgoblin in my little mind is now screaming at me to be foolishly consistent and either link Suitesparse and all subsequent downstream packages that depend on it against GNU openmp or against intel openmp - but not both (mixed) and certainly not mixed with pthreads.

Is this consistent with your understanding about threading library choice with respect to Suitesparse?

@DrTimothyAldenDavis
Copy link
Owner

DrTimothyAldenDavis commented Oct 23, 2019 via email

@ghost
Copy link
Author

ghost commented Oct 23, 2019

Thank you for responding. I understand that it is broken wrt cholmod, I was hoping to learn a little more about why and if it affects my choice of threading library.

That said, I also understand openblas, mkl, threading libraries and my choice of downstream libraries to use with Suitesparse are not really your concern (and I'm certainly able to figure it out for myself) so I will close this.

Thank you for your efforts on Suitesparse and making it available as opensource.

@ghost ghost closed this as completed Oct 23, 2019
@DrTimothyAldenDavis
Copy link
Owner

I'd like to keep this as an open issue. SuiteSparse has to play nicely with other libraries, and I don't want its performance to suffer if the wrong mix of libraries is used. I at least need to figure out how to navigation the linking process with multiple threading libraries, and provide some documentation on how end users should link the various libraries.

@copyme
Copy link

copyme commented Oct 28, 2019

@DrTimothyAldenDavis I think this should be documented in a more visible place. I found this thread by chance and as suggested I recompiled SuiteSparse with MKL, only to find a significant speed improvement. Thanks!

@DrTimothyAldenDavis
Copy link
Owner

I added a comment to this in the top-level README.md file, in bold. This is a recent discovery. I'm hoping to track down the source of this slowdown, and figure out a workaround. It may be a bug in OpenBLAS, or perhaps it's a library build issue. In the meantime, use the Intel MKL.

@ViralBShah
Copy link
Contributor

One thing is to turn off multi-threading in openblas for small sizes with a build flag like GEMM_MULTITHREADING_THRESHOLD=50. Does that fix the issue?

@gruenich
Copy link
Contributor

Can we improve the title? This issue difficult to spot. I was searching for "performance slowdown with multiple threads".

@vorticle
Copy link

I don’t know if this is an option, but it seems that BLIS is offering what is needed to remedy this situation:

https://github.com/flame/blis/blob/master/docs/Multithreading.md

They implement a new BLAS-Like interface that allows to specify the number of threads for each call individually. They also seem to offer a BLAS compatibility layer.

@drhpc
Copy link

drhpc commented May 22, 2021

I am considering packaging suitesparse with pkgsrc to be used in octave and scipy. We had a aging prototype package for version 4.0.2 so far that I used for building software for my users. I do not know if anyone would be affected by this bug here, I am not using octave or scipy myself.

Do you have a simple test case that I can run to check if our system (16 cores per node) is affected by the OpenBLAS bug with a current build of OpenBLAS with pthread or OpenMP?

@DrTimothyAldenDavis
Copy link
Owner

DrTimothyAldenDavis commented May 22, 2021 via email

@drhpc
Copy link

drhpc commented May 23, 2021

As I have no experience with suitesparse, I have to guess … but wouldn't any call to CHOLMOD with some generated test data show the performance degradation when comparing it with differing BLAS implementations (or even just with setting openblas or openmp thread count)? I'd just be happy with some code snipped that triggers the right paths for manual comparison.

It is disheartening to see that this issue is somewhat known on the openblas side and here, but nobody has had the time to address it yet. I am wondering how prevalent the issue is at all.

Also, did this only start to happen with certain versions of suitesparse? Or just when CPU core counts became high enough for people to notice the threading issues with excessive locking?

@DrTimothyAldenDavis
Copy link
Owner

DrTimothyAldenDavis commented May 24, 2021 via email

@drhpc
Copy link

drhpc commented May 25, 2021

That would be really great! I think when we got a self-contained program that shows the issue (not inside some python binding, octave etc.), it is much more likely to get the OpenBLAS people to fix it. So far it is something they generally have on the radar, but have no grip on.

@drhpc
Copy link

drhpc commented May 29, 2021

Nudge from the other side … OpenMathLib/OpenBLAS#1886

I have recently identified a historical oversight where a memory buffer is allocated although it is never used […] speed difference can reach a factor of ten, and some of these functions (mostly level 2 BLAS) appear to be on the code path of cholmod. Old versions of OpenBLAS until about three years ago did not employ any locking around buffer allocations so were much faster but wildly thread-unsafe. This may be the origin of the advice in the suitesparse code.

Does that fit the occurence of the issue with SuiteSparse? 3 Years ago?

@DrTimothyAldenDavis
Copy link
Owner

DrTimothyAldenDavis commented Jun 2, 2021 via email

@martin-frbg
Copy link

Would running the cholmod_demo application with one of the provided test cases from its Matrix directory be expected to reproduce the problem, if it still exists ? Or do you recall it requiring particular conditions or matrix sizes not represented there ?

@DrTimothyAldenDavis
Copy link
Owner

DrTimothyAldenDavis commented Jun 2, 2021 via email

@drhpc
Copy link

drhpc commented Jun 9, 2021

We got OpenBLAS people interested, but so far we only have seen things like 20% performance difference compared to MKL.

Are we sure this 100-fold slowdown thread-deadlock thing wasn't some bad interaction of OpenMP/pthreads with too many threads?

We need a test case that demonstrates this. Don't we have some SuiteSparse user that would be affected? Is there a forum where you could place a call? For something as strong as the recommendation to stick to MKL here, there needs to be evidence. You know, science and all that;-)

@DrTimothyAldenDavis
Copy link
Owner

DrTimothyAldenDavis commented Jun 9, 2021 via email

@pjaap
Copy link

pjaap commented Nov 9, 2021

Hi! I followed this thread and remembered that I posted a hopefully related issue once in the Ubuntu bug tracker:
https://bugs.launchpad.net/ubuntu/+source/suitesparse/+bug/1843277

There is a small test program attached. I just tested it on a Debian 12 machine with libopenblas-dev installed and had a run time factor of about 6 on a 12-thread/6-core CPU. To emphasize: In parallel it takes 6 times longer than sequential!

Maybe this test helps to find the problem? If I remove libopenblas-dev (I guess libatlas-base-dev is used then) the run time is equal both parallel and sequential (beside a 400% vs 100% load).

@martin-frbg
Copy link

@pjaap thank you very much. Your test case does indeed display pathological behaviour not seen in my earlier CHOLMOD tests with very small and very large matrices. (Now to find where and why threads are spinning instead of doing actual work...)

@martin-frbg
Copy link

...though absurd slowdown only occurs when a non-OpenMP build of OpenBLAS is paired with SuiteSparse, probably just leading to oversubscription. When both projects are built for OpenMP, it appears to be just an issue of finding the sweet spot between using all or only one core (With the test case on a 6-core Haswell with multithreading disabled, running on all cores takes twice as long as the optimum of running on only two, while a single thread is 20 percent slower. The non-OpenMP OpenBLAS however is ninety times slower on all cores compared to a single one, and does not benefit from multithreading at all.

@DrTimothyAldenDavis DrTimothyAldenDavis changed the title Request for more information about why linking against mkl is now recommended for Suitesparse 5.6.0 Linking against Intel MKL BLAS recommend; OpenBLAS can cause 100x slowdown Nov 16, 2021
@DrTimothyAldenDavis
Copy link
Owner

Thanks for the test case. I'll give it a try. I've also revised the title to this issue.

@DrTimothyAldenDavis DrTimothyAldenDavis added the external bug porting issue, or problem with external library or system label Dec 11, 2022
@freeseek
Copy link

freeseek commented Mar 8, 2023

I have encountered the same issue in a completely reproducible way. I used the cholmod_gpu_stats() to display how much time was spent in each function. When CHOLMOD 4.0.3 was linked to MKL I got this:

CHOLMOD GPU/CPU statistics:
SYRK  CPU calls          856 time   1.6010e-03
      GPU calls            0 time   0.0000e+00
GEMM  CPU calls          633 time   9.2912e-04
      GPU calls            0 time   0.0000e+00
POTRF CPU calls          235 time   4.7159e-04
      GPU calls            0 time   0.0000e+00
TRSM  CPU calls          223 time   5.6744e-04
      GPU calls            0 time   0.0000e+00
time in the BLAS: CPU   3.5691e-03 GPU   0.0000e+00 total:   3.5691e-03
assembly time   0.0000e+00    0.0000e+00

When running with the Debian version of SuiteSparse which uses CHOLMOD 3.0.14 and OpenBLAS:

CHOLMOD GPU/CPU statistics:
SYRK  CPU calls          830 time   3.6409e-03
      GPU calls            0 time   0.0000e+00
GEMM  CPU calls          610 time   1.3313e-03
      GPU calls            0 time   0.0000e+00
POTRF CPU calls          232 time   1.8505e-03
      GPU calls            0 time   0.0000e+00
TRSM  CPU calls          220 time   1.3902e-01
      GPU calls            0 time   0.0000e+00
time in the BLAS: CPU   1.4584e-01 GPU   0.0000e+00 total:   1.4584e-01
assembly time   0.0000e+00    0.0000e+00

The TRSM calls are indeed 100x slower. Is there a way to check at runtime what version of BLAS is being used? Is there a way I could code into my software a check for the presence of this problem so that I could inform the user that he should recompile and link CHOLMOD to a different BLAS implementation? The main issue is that this happens with the default Debian version of SuiteSparse.

@martin-frbg
Copy link

Same reproducer as in the Ubuntu ticket, or something different ? I really, really want to fix this on the OpenBLAS side. (With the original reproducer, at least the current version realizes that it should run only one thread for the TRSM call, but the overhead from having all the other threads sitting around asking for work still makes it 5x slower. Looks to be a design flaw going back to earliest GotoBLAS, I'm experimenting with actually using openblas_set_num_threads(1) internally - which BTW could be called from SuiteSparse-enabled programs as well already if the workload is known to be small)

@freeseek
Copy link

@martin-frbg, this is not the same reproducer. Somehow it seems that all of my matrices are affected by this issue. And I had not understood before that CHOLMOD 4 limits the number of threads for the TRSM calls.

For other users bumping into this issue, CHOLMOD 4 has now two new variables in the cholmod_common structure:

    double chunk ;      // chunksize for computing # of threads to use.
                        // Given nwork work to do, # of threads is
                        // max (1, min (floor (work / chunk), nthreads_max))
    int nthreads_max ;  // max # of threads to use in CHOLMOD.  Defaults to
                        // SUITESPARSE_OPENMP_MAX_THREADS.

Which are initialized by cholmod_start() as:

    // -------------------------------------------------------------------------
    // OpenMP initializations
    // -------------------------------------------------------------------------

    Common->chunk = 128000 ;
    Common->nthreads_max = SUITESPARSE_OPENMP_MAX_THREADS ;

Together with the new function cholmod_nthreads():

//==============================================================================
//=== openmp support ===========================================================
//==============================================================================

static inline int cholmod_nthreads  // returns # of OpenMP threads to use
(
    double work,                    // total work to do
    cholmod_common *Common
)
{ 
    #ifdef _OPENMP
    double chunk = Common->chunk ;  // give each thread at least this much work
    int nthreads_max = Common->nthreads_max ;   // max # of threads to use
    if (nthreads_max <= 0)
    {
        nthreads_max = SUITESPARSE_OPENMP_MAX_THREADS ;
    }
    work  = MAX (work, 1) ;
    chunk = MAX (chunk, 1) ;
    int64_t nthreads = (int64_t) floor (work / chunk) ;
    nthreads = MIN (nthreads, nthreads_max) ;
    nthreads = MAX (nthreads, 1) ;
    return ((int) nthreads) ;
    #else
    return (1) ;
    #endif
}

That is now used in t_cholmod_super_numeric.c for the supernodal numeric factorization and the number of threads is then set through:

#pragma omp parallel for num_threads(nthreads)

I am guessing here that CHOLMOD 4 is trying to limit the number of threads for small matrices, to avoid the slowdown issue. Though, for some reasons, this infrastructure is not present in t_cholmod_super_solve.c instead.

So maybe it would be enough to reach out to @svillemot, the maintainer of the suitesparse Debian package, and see if he can update it to a more recent version as the current version is still based on CHOLMOD 3.

@martin-frbg
Copy link

that may be a different part of the problem, though not directly related to openblas. I'm currently testing with the latest release of Suitesparse built from source, so not affected by any delays in getting new versions of either package into Debian. (btw the openblas provided by debian is possibly a bit older as well)

@svillemot
Copy link
Contributor

Debian sid/unstable currently has CHOLMOD 3.0.14 from SuiteSparse 5.12.0. For the time being I cannot update it, because Debian is currently frozen (preparing for the release of Debian “Bookworm” 12).
Regarding OpenBLAS, Debian sid/unstable has version 0.3.21, which is up-to-date.

@DrTimothyAldenDavis DrTimothyAldenDavis changed the title Linking against Intel MKL BLAS recommend; OpenBLAS can cause 100x slowdown Linking against Intel MKL BLAS recommended; OpenBLAS can cause 100x slowdown Oct 9, 2023
@DrTimothyAldenDavis
Copy link
Owner

DrTimothyAldenDavis commented Jan 23, 2024

This issue is likely fixed by OpenMathLib/OpenBLAS#4441 which fixes OpenMathLib/OpenBLAS#2265 and OpenMathLib/OpenBLAS#2846 . The fix will appear on OpenBLAS 0.3.27.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
external bug porting issue, or problem with external library or system
Projects
None yet
Development

No branches or pull requests

10 participants