BLAS threading preferences when building SuiteSparse #432

ViralBShah · 2023-10-08T12:53:44Z

Is there a preferred way to build a BLAS library when linking to SuiteSparse? With OpenBLAS, we notice lots of issues with poor performance when UMFPACK uses a multi-threaded version.

OpenBLAS however offers a version with OpenMP, and I was wondering if there is experience in that working better.

We notice good out of the box performance with MKL, but it is a very heavy platform specific dependency,and we are wondering if there are better out of the box defaults that provide better performance with OpenBLAS.

DrTimothyAldenDavis · 2023-10-08T18:56:45Z

I've had performance issues with OpenBLAS and its multithreading, sometimes leading to a 100x slowdown. MKL works fine. I'm not sure what's causing it; it's an open issue in SuiteSparse. I haven't tried building OpenBLAS myself.

mmuetzel · 2023-10-09T06:10:16Z

If you are building SuiteSparse with OpenMP (the default if OpenMP is available), you should also use an OpenBLAS library that was built with OpenMP. Otherwise, you might end up with a multitude of the expected number of threads. That could impact performance significantly.
See: https://github.com/OpenMathLib/OpenBLAS/wiki/faq#multi-threaded

Depending on the CPU you are using, it might also be of advantage if you limit the number of OpenMP threads to the number of physical cores of your CPU(s) (as opposed to the number of possible hyperthreads). IIUC, that is because hyperthreads running on the same physical core might overwrite (some of) the CPU caches for each other - leading to repeated transfers between CPU caches and computer memory. That could also have a notable performance impact.
For that, set OMP_NUM_THREADS to the number of physical cores.

ViralBShah · 2023-10-09T13:00:19Z

In Julia, we build OpenBLAS and SuiteSparse with pthreads. I suspect the issue with the slowdown is because of OpenBLAS multi-threading, where for small matmuls, it slows down by using all the available threads, and it is perhaps best to just use single-threaded OpenBLAS when using with UMFPACK.

DrTimothyAldenDavis · 2023-10-09T15:05:28Z

That would slow down UMFPACK, CHOLMOD, and SPQR on large matrices.

Also, in the future, I will be revising SPQR to replace TBB with OpenMP tasking. In that case, I would need to use nested parallelism, where multiple fronts can be factorized in parallel, and each front uses a different number of threads. MKL supports that but OpenBLAS does not.

We also have a new parallel version of UMFPACK, called ParU, which will be added soon to SuiteSparse. That uses the same strategy, and it only works well with the Intel MKL BLAS.

Ideally, there would be a way to use OpenBLAS like this but I'm not sure if that's possible.

DrTimothyAldenDavis · 2023-10-09T15:06:20Z

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BLAS threading preferences when building SuiteSparse #432

BLAS threading preferences when building SuiteSparse #432

ViralBShah commented Oct 8, 2023 •

edited

Loading

DrTimothyAldenDavis commented Oct 8, 2023

mmuetzel commented Oct 9, 2023 •

edited

Loading

ViralBShah commented Oct 9, 2023

DrTimothyAldenDavis commented Oct 9, 2023

DrTimothyAldenDavis commented Oct 9, 2023

ViralBShah commented Oct 9, 2023

BLAS threading preferences when building SuiteSparse #432

BLAS threading preferences when building SuiteSparse #432

Comments

ViralBShah commented Oct 8, 2023 • edited Loading

DrTimothyAldenDavis commented Oct 8, 2023

mmuetzel commented Oct 9, 2023 • edited Loading

ViralBShah commented Oct 9, 2023

DrTimothyAldenDavis commented Oct 9, 2023

DrTimothyAldenDavis commented Oct 9, 2023

ViralBShah commented Oct 9, 2023

ViralBShah commented Oct 8, 2023 •

edited

Loading

mmuetzel commented Oct 9, 2023 •

edited

Loading