-
Notifications
You must be signed in to change notification settings - Fork 263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Linking against Intel MKL BLAS recommended; OpenBLAS can cause 100x slowdown #1
Comments
openBlas causes a 100x slowdown in CHOLMOD, as compared to the non
supernodal method and as compared to the supernodal method with the mkl
blas.
It’s broken
|
Thank you for responding. I understand that it is broken wrt cholmod, I was hoping to learn a little more about why and if it affects my choice of threading library. That said, I also understand openblas, mkl, threading libraries and my choice of downstream libraries to use with Suitesparse are not really your concern (and I'm certainly able to figure it out for myself) so I will close this. Thank you for your efforts on Suitesparse and making it available as opensource. |
I'd like to keep this as an open issue. SuiteSparse has to play nicely with other libraries, and I don't want its performance to suffer if the wrong mix of libraries is used. I at least need to figure out how to navigation the linking process with multiple threading libraries, and provide some documentation on how end users should link the various libraries. |
@DrTimothyAldenDavis I think this should be documented in a more visible place. I found this thread by chance and as suggested I recompiled SuiteSparse with MKL, only to find a significant speed improvement. Thanks! |
I added a comment to this in the top-level README.md file, in bold. This is a recent discovery. I'm hoping to track down the source of this slowdown, and figure out a workaround. It may be a bug in OpenBLAS, or perhaps it's a library build issue. In the meantime, use the Intel MKL. |
One thing is to turn off multi-threading in openblas for small sizes with a build flag like |
Can we improve the title? This issue difficult to spot. I was searching for "performance slowdown with multiple threads". |
I don’t know if this is an option, but it seems that BLIS is offering what is needed to remedy this situation: https://github.com/flame/blis/blob/master/docs/Multithreading.md They implement a new BLAS-Like interface that allows to specify the number of threads for each call individually. They also seem to offer a BLAS compatibility layer. |
I am considering packaging suitesparse with pkgsrc to be used in octave and scipy. We had a aging prototype package for version 4.0.2 so far that I used for building software for my users. I do not know if anyone would be affected by this bug here, I am not using octave or scipy myself. Do you have a simple test case that I can run to check if our system (16 cores per node) is affected by the OpenBLAS bug with a current build of OpenBLAS with pthread or OpenMP? |
No, I don't have a test for this bug. I do need to figure out a workaround
or at least a test, but I haven't had a chance to do that yet. It's not
easy working around bugs in other packages.
…On Sat, May 22, 2021 at 5:54 AM drhpc ***@***.***> wrote:
I am considering packaging suitesparse with pkgsrc to be used in octave
and scipy. We had a aging prototype package for version 4.0.2 so far that I
used for building software for my users. I do not know if anyone would be
affected by this bug here, I am not using octave or scipy myself.
Do you have a simple test case that I can run to check if our system (16
cores per node) is affected by the OpenBLAS bug with a current build of
OpenBLAS with pthread or OpenMP?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEYIIOO7BNBDOXS4TIRIRMDTO6ENJANCNFSM4JEBB25A>
.
|
As I have no experience with suitesparse, I have to guess … but wouldn't any call to CHOLMOD with some generated test data show the performance degradation when comparing it with differing BLAS implementations (or even just with setting openblas or openmp thread count)? I'd just be happy with some code snipped that triggers the right paths for manual comparison. It is disheartening to see that this issue is somewhat known on the openblas side and here, but nobody has had the time to address it yet. I am wondering how prevalent the issue is at all. Also, did this only start to happen with certain versions of suitesparse? Or just when CPU core counts became high enough for people to notice the threading issues with excessive locking? |
I can see about putting a test case together. The issue would not likely
arise in dense matrix computations, and I would think that's were a BLAS
library gets tested by its developers. The issue would arise within a
sparse multifrontal or supernodal method, which has a huge number of "tiny
BLAS" calls, and a smaller number of "medium size", smaller still of "big
size" and so on.
Most high-performance sparse direct methods are like this, so this
performance issue would affect most of them.
…On Sat, May 22, 2021 at 8:30 PM drhpc ***@***.***> wrote:
As I have no experience with suitesparse, I have to guess … but wouldn't
any call to CHOLMOD with some generated test data show the performance
degradation when comparing it with differing BLAS implementations (or even
just with setting openblas or openmp thread count)? I'd just be happy with
some code snipped that triggers the right paths for manual comparison.
It is disheartening to see that this issue is somewhat known on the
openblas side and here, but nobody has had the time to address it yet. I am
wondering how prevalent the issue is at all.
Also, did this only start to happen with certain versions of suitesparse?
Or just when CPU core counts became high enough for people to notice the
threading issues with excessive locking?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEYIIOMHE4Z57KDEBVXSHY3TPBLEHANCNFSM4JEBB25A>
.
|
That would be really great! I think when we got a self-contained program that shows the issue (not inside some python binding, octave etc.), it is much more likely to get the OpenBLAS people to fix it. So far it is something they generally have on the radar, but have no grip on. |
Nudge from the other side … OpenMathLib/OpenBLAS#1886
Does that fit the occurence of the issue with SuiteSparse? 3 Years ago? |
That sounds about the time that I saw the huge 100x slowdown (it was more
than 10x). So perhaps this is fixed now.
…On Sat, May 29, 2021 at 12:52 PM drhpc ***@***.***> wrote:
Nudge from the other side … OpenMathLib/OpenBLAS#1886
<OpenMathLib/OpenBLAS#1886>
I have recently identified a historical oversight where a memory buffer is
allocated although it is never used […] speed difference can reach a factor
of ten, and some of these functions (mostly level 2 BLAS) appear to be on
the code path of cholmod. Old versions of OpenBLAS until about three years
ago did not employ any locking around buffer allocations so were much
faster but wildly thread-unsafe. This may be the origin of the advice in
the suitesparse code.
Does that fit the occurence of the issue with SuiteSparse? 3 Years ago?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEYIIONVRAX2RRV3KLTMXZDTQESWLANCNFSM4JEBB25A>
.
|
Would running the cholmod_demo application with one of the provided test cases from its Matrix directory be expected to reproduce the problem, if it still exists ? Or do you recall it requiring particular conditions or matrix sizes not represented there ? |
I doubt it. The matrix would need to be larger. I can't recall the
conditions or matrix sizes that led to the 100x slowdown.
…On Wed, Jun 2, 2021 at 4:02 PM Martin Kroeker ***@***.***> wrote:
Would running the cholmod_demo application with one of the provided test
cases from its Matrix directory be expected to reproduce the problem, if it
still exists ? Or do you recall it requiring particular conditions or
matrix sizes not represented there ?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEYIIOPHZ6AMPAVBAGKMMFDTQ2MAFANCNFSM4JEBB25A>
.
|
We got OpenBLAS people interested, but so far we only have seen things like 20% performance difference compared to MKL. Are we sure this 100-fold slowdown thread-deadlock thing wasn't some bad interaction of OpenMP/pthreads with too many threads? We need a test case that demonstrates this. Don't we have some SuiteSparse user that would be affected? Is there a forum where you could place a call? For something as strong as the recommendation to stick to MKL here, there needs to be evidence. You know, science and all that;-) |
I understand the need for a test case. I will try to dig it up. I saw the
100x slowdown myself, in my own tests, but I need to locate that test case.
…On Wed, Jun 9, 2021 at 2:39 AM drhpc ***@***.***> wrote:
We got OpenBLAS people interested, but so far we only have seen things
like 20% performance difference compared to MKL.
Are we sure this 100-fold slowdown thread-deadlock thing wasn't some bad
interaction of OpenMP/pthreads with too many threads?
We need a test case that demonstrates this. Don't we have some SuiteSparse
user that would be affected? Is there a forum where you could place a call?
For something as strong as the recommendation to stick to MKL here, there
needs to be evidence. You know, science and all that;-)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEYIIOI3EEX7DYGRCKKDDC3TR4LEZANCNFSM4JEBB25A>
.
|
Hi! I followed this thread and remembered that I posted a hopefully related issue once in the Ubuntu bug tracker: There is a small test program attached. I just tested it on a Debian 12 machine with Maybe this test helps to find the problem? If I remove |
@pjaap thank you very much. Your test case does indeed display pathological behaviour not seen in my earlier CHOLMOD tests with very small and very large matrices. (Now to find where and why threads are spinning instead of doing actual work...) |
...though absurd slowdown only occurs when a non-OpenMP build of OpenBLAS is paired with SuiteSparse, probably just leading to oversubscription. When both projects are built for OpenMP, it appears to be just an issue of finding the sweet spot between using all or only one core (With the test case on a 6-core Haswell with multithreading disabled, running on all cores takes twice as long as the optimum of running on only two, while a single thread is 20 percent slower. The non-OpenMP OpenBLAS however is ninety times slower on all cores compared to a single one, and does not benefit from multithreading at all. |
Thanks for the test case. I'll give it a try. I've also revised the title to this issue. |
I have encountered the same issue in a completely reproducible way. I used the cholmod_gpu_stats() to display how much time was spent in each function. When CHOLMOD 4.0.3 was linked to MKL I got this:
When running with the Debian version of SuiteSparse which uses CHOLMOD 3.0.14 and OpenBLAS:
The TRSM calls are indeed 100x slower. Is there a way to check at runtime what version of BLAS is being used? Is there a way I could code into my software a check for the presence of this problem so that I could inform the user that he should recompile and link CHOLMOD to a different BLAS implementation? The main issue is that this happens with the default Debian version of SuiteSparse. |
Same reproducer as in the Ubuntu ticket, or something different ? I really, really want to fix this on the OpenBLAS side. (With the original reproducer, at least the current version realizes that it should run only one thread for the TRSM call, but the overhead from having all the other threads sitting around asking for work still makes it 5x slower. Looks to be a design flaw going back to earliest GotoBLAS, I'm experimenting with actually using openblas_set_num_threads(1) internally - which BTW could be called from SuiteSparse-enabled programs as well already if the workload is known to be small) |
@martin-frbg, this is not the same reproducer. Somehow it seems that all of my matrices are affected by this issue. And I had not understood before that CHOLMOD 4 limits the number of threads for the TRSM calls. For other users bumping into this issue, CHOLMOD 4 has now two new variables in the
Which are initialized by
Together with the new function
That is now used in
I am guessing here that CHOLMOD 4 is trying to limit the number of threads for small matrices, to avoid the slowdown issue. Though, for some reasons, this infrastructure is not present in So maybe it would be enough to reach out to @svillemot, the maintainer of the suitesparse Debian package, and see if he can update it to a more recent version as the current version is still based on CHOLMOD 3. |
that may be a different part of the problem, though not directly related to openblas. I'm currently testing with the latest release of Suitesparse built from source, so not affected by any delays in getting new versions of either package into Debian. (btw the openblas provided by debian is possibly a bit older as well) |
Debian sid/unstable currently has CHOLMOD 3.0.14 from SuiteSparse 5.12.0. For the time being I cannot update it, because Debian is currently frozen (preparing for the release of Debian “Bookworm” 12). |
This issue is likely fixed by OpenMathLib/OpenBLAS#4441 which fixes OpenMathLib/OpenBLAS#2265 and OpenMathLib/OpenBLAS#2846 . The fix will appear on OpenBLAS 0.3.27. |
For version 5.4, Suitesparse_config.mk recommended openblas. For version 5.6, intel mkl is now recommended and the config file looks like wants to link against intel openmp.
TL;DR is the cholmod performance degradation related to the threading library (pthreads vs GNU openmp vs intel openmp)? Can you explain a little more (or just link to a discussion about the issue). Does this issue affect version 5.4 when linked against openblas/pthreads?
Some background about why I'm curious about which threading library to use:
I am a "causal" user of suitesparse mostly as it can be used by other packages such as Sundials, numpy, scikit-umfpack, sparseqr, scikit-sparse, etc...
I've built my own Suitesparse shared libraries (and the downstream packages mentioned above) on linux systems (mostly ubuntu) for several years now - one set of libraries linked against openblas/pthreads and one linked against mkl/GNU openmp. Mostly I stick with openblas/pthreads as I get sufficient performance boost without the head aches related to trying to understand which thread library to use (or how).
However, after noting the (new) recommendation for Suitesparse 5.6 and subsequently trying to work out what to do about it, I've run across discussions like this or this as well as the intel mkl linking recommendations here.
The hobgoblin in my little mind is now screaming at me to be foolishly consistent and either link Suitesparse and all subsequent downstream packages that depend on it against GNU openmp or against intel openmp - but not both (mixed) and certainly not mixed with pthreads.
Is this consistent with your understanding about threading library choice with respect to Suitesparse?
The text was updated successfully, but these errors were encountered: