-
-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stack overflow on large lu decomposition with OpenBLAS #877
Comments
On another system where I built OpenBLAS locally, instead of using binarybuilder: julia> versioninfo()
Julia Version 1.8.0-DEV.660
Commit 153db7a7a8* (2021-10-05 16:17 UTC)
Platform Info:
OS: Linux (x86_64-generic-linux)
CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, cascadelake)
Environment:
JULIA_NUM_THREADS = 36
julia> using LinearAlgebra
julia> A = rand(10,10);
julia> lu(A);
julia> A = rand(100,100);
julia> lu(A);
julia> A = rand(1000,1000);
julia> lu(A);
julia> using MKL
julia> lu(A);
julia> BLAS.get_num_threads()
18 |
Can you get an rr trace (on Linux x86_64)? One way to do so is to download the |
I sent the file to you on slack, because github doesn't allow |
@chriselrod Could you bisect? |
@Keno @ianatol What would be the best way for @chriselrod to share his rr trace with you? DM it to you on Slack? |
|
😭 |
@Keno Is it possible that this bug was introduced by Reference-LAPACK/lapack#625? |
It can't be those changes. The routines there are related to eigenvalue computations and the issue here is the LU. It's not obvious where a stack overflow in https://github.com/JuliaLang/julia/blob/592cacc7bd29e40eaa8a47f0db60605ec8fda5f9/stdlib/LinearAlgebra/src/lapack.jl#L575-L579 would come from. |
Presumably OpenBLAS is internally allocating a large amount of stack memory for something when multithreaded for large enough |
Could it be related to the changes mentioned in https://discourse.julialang.org/t/blas-performance-testing-for-julia-1-8/69520 |
Yes, my guess is that we are allocating a lot of stack memory and 4096 max threads may be too much. 4096 was a bit too aggressive, I thought, and perhaps we need to go to 512 for now. This is the stuff in I wonder if their OpenMP versions are better on this front. https://github.com/xianyi/OpenBLAS/blob/develop/lapack/getrf/getrf_parallel_omp.c You will start seeing these in Julia source builds that don't use BB after JuliaLang/julia#42584 is merged to bring it in line with BB. @chriselrod Can you provide a reasonably good number for max threads that we can get away with for now? Is it 1024, 512, or even smaller? |
I feel we should be able to force |
I had not seen |
The |
|
I also don't really see why we can't use variable-length arrays (added in |
Wouldn't that require significant patching and testing of openblas? Not something I want to sign up for myself! |
Are the largest core counts today 128? The biggest Epycs are 64 cores, I believe in up to a 2 socket configuration for 128 cores. (If it has 128 cores/256 threads, we'd prefer 128 BLAS threads). Zen4 will be out in about a year. It is moving from 7nm to 5nm, so I'd expect it to bump up core counts again. Depending on how long you want this change to be good for, 256 will probably last a couple years for x86. It'd also be interesting to actually run benchmarks on systems with these huge core counts to see how things scale as a function of matrix size. Note that OpenBLAS doesn't look like it's doing well with non-monolithic memory, e.g. the monolithic Intel 10980XE seems over 4x faster than the AMD 5950X for matmul instead of the expected roughly 2x faster. So it may take some tuning to get a large percentage of peak flops. |
@chriselrod Is 512 good enough to avoid stack overflows? |
Because VLA were removed in C11, but there's always been alloca. |
I say the non-monolithic cases are handled by vendor BLAS through our LBT mechanism. We just want to pick reasonable defaults out of the box for openblas for now. IMO, any real effort we put in should go into our Julia native versions. |
@KristofferC this is what the latest openblas binaries fix. |
So this looks more like an OpenBLAS bug.
Only shows up when using multiple threads:
The text was updated successfully, but these errors were encountered: