Stack overflow on large lu decomposition with OpenBLAS #877

chriselrod · 2021-10-11T01:50:13Z

julia> versioninfo()
Julia Version 1.8.0-DEV.690
Commit bbe93759d5 (2021-10-09 16:35 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: Intel(R) Core(TM) i9-9940X CPU @ 3.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake-avx512)
Environment:
  JULIA_NUM_THREADS = 28

julia> using LinearAlgebra

julia> A = rand(10,10);

julia> lu(A);

julia> A = rand(100,100);

julia> lu(A);

julia> A = rand(1000,1000);

julia> lu(A);
ERROR: StackOverflowError:
Stacktrace:
 [1] getrf!(A::Matrix{Float64})
   @ LinearAlgebra.LAPACK ~/Documents/languages/julia/usr/share/julia/stdlib/v1.8/LinearAlgebra/src/lapack.jl:575
 [2] #lu!#155
   @ ~/Documents/languages/julia/usr/share/julia/stdlib/v1.8/LinearAlgebra/src/lu.jl:84 [inlined]
 [3] lu(A::Matrix{Float64}, pivot::RowMaximum; check::Bool)
   @ LinearAlgebra ~/Documents/languages/julia/usr/share/julia/stdlib/v1.8/LinearAlgebra/src/lu.jl:79
 [4] lu (repeats 2 times)
   @ ~/Documents/languages/julia/usr/share/julia/stdlib/v1.8/LinearAlgebra/src/lu.jl:79 [inlined]
 [5] top-level scope
   @ REPL[13]:1

julia> using MKL

julia> lu(A);

julia>

So this looks more like an OpenBLAS bug.

Only shows up when using multiple threads:

julia> BLAS.get_num_threads()
28

julia> BLAS.set_num_threads(8)

julia> lu(A);
ERROR: StackOverflowError:
Stacktrace:
 [1] getrf!(A::Matrix{Float64})
   @ LinearAlgebra.LAPACK ~/Documents/languages/julia/usr/share/julia/stdlib/v1.8/LinearAlgebra/src/lapack.jl:575
 [2] #lu!#155
   @ ~/Documents/languages/julia/usr/share/julia/stdlib/v1.8/LinearAlgebra/src/lu.jl:84 [inlined]
 [3] lu(A::Matrix{Float64}, pivot::RowMaximum; check::Bool)
   @ LinearAlgebra ~/Documents/languages/julia/usr/share/julia/stdlib/v1.8/LinearAlgebra/src/lu.jl:79
 [4] lu (repeats 2 times)
   @ ~/Documents/languages/julia/usr/share/julia/stdlib/v1.8/LinearAlgebra/src/lu.jl:79 [inlined]
 [5] top-level scope
   @ REPL[11]:1

julia> BLAS.set_num_threads(1)

julia> lu(A);

julia> BLAS.set_num_threads(2)

julia> lu(A);
ERROR: StackOverflowError:
Stacktrace:
 [1] getrf!(A::Matrix{Float64})
   @ LinearAlgebra.LAPACK ~/Documents/languages/julia/usr/share/julia/stdlib/v1.8/LinearAlgebra/src/lapack.jl:575
 [2] #lu!#155
   @ ~/Documents/languages/julia/usr/share/julia/stdlib/v1.8/LinearAlgebra/src/lu.jl:84 [inlined]
 [3] lu(A::Matrix{Float64}, pivot::RowMaximum; check::Bool)
   @ LinearAlgebra ~/Documents/languages/julia/usr/share/julia/stdlib/v1.8/LinearAlgebra/src/lu.jl:79
 [4] lu (repeats 2 times)
   @ ~/Documents/languages/julia/usr/share/julia/stdlib/v1.8/LinearAlgebra/src/lu.jl:79 [inlined]
 [5] top-level scope
   @ REPL[15]:1

julia>

chriselrod · 2021-10-11T01:55:21Z

On another system where I built OpenBLAS locally, instead of using binarybuilder:

julia> versioninfo()
Julia Version 1.8.0-DEV.660
Commit 153db7a7a8* (2021-10-05 16:17 UTC)
Platform Info:
  OS: Linux (x86_64-generic-linux)
  CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, cascadelake)
Environment:
  JULIA_NUM_THREADS = 36

julia> using LinearAlgebra

julia> A = rand(10,10);

julia> lu(A);

julia> A = rand(100,100);

julia> lu(A);

julia> A = rand(1000,1000);

julia> lu(A);

julia> using MKL

julia> lu(A);

julia> BLAS.get_num_threads()
18

DilumAluthge · 2021-10-11T02:01:27Z

Can you get an rr trace (on Linux x86_64)?

One way to do so is to download the rr_capture.jl script (https://github.com/JuliaLang/julia/blob/master/.buildkite/utilities/rr/rr_capture.jl), and then run your MWE as e.g. julia rr_capture.jl julia mwe.jl.

chriselrod · 2021-10-11T02:12:54Z

I sent the file to you on slack, because github doesn't allow .tar.zst files. I tried renaming to .txt but their filter seems smart enough to detect that.

DilumAluthge · 2021-10-11T02:20:17Z

@chriselrod Could you bisect?

DilumAluthge · 2021-10-11T02:23:56Z

@Keno @ianatol What would be the best way for @chriselrod to share his rr trace with you? DM it to you on Slack?

chriselrod · 2021-10-11T05:47:02Z

> git bisect good
77172e99d18bfe58fe2f38f27a52832a4fd72e66 is the first bad commit
commit 77172e99d18bfe58fe2f38f27a52832a4fd72e66
Author: Valentin Churavy <[email protected]>
Date:   Fri Oct 8 20:47:26 2021 -0400

    bump OpenBLAS_jll

 deps/checksums/openblas          | 184 +++++++++++++++++++--------------------
 stdlib/OpenBLAS_jll/Project.toml |   2 +-
 2 files changed, 93 insertions(+), 93 deletions(-)

DilumAluthge · 2021-10-11T05:47:48Z

😭

DilumAluthge · 2021-10-11T05:48:40Z

@Keno Is it possible that this bug was introduced by Reference-LAPACK/lapack#625?

andreasnoack · 2021-10-11T06:42:07Z

Is it possible that this bug was introduced by Reference-LAPACK/lapack#625?

It can't be those changes. The routines there are related to eigenvalue computations and the issue here is the LU. It's not obvious where a stack overflow in https://github.com/JuliaLang/julia/blob/592cacc7bd29e40eaa8a47f0db60605ec8fda5f9/stdlib/LinearAlgebra/src/lapack.jl#L575-L579 would come from.

chriselrod · 2021-10-11T07:37:28Z

Presumably OpenBLAS is internally allocating a large amount of stack memory for something when multithreaded for large enough getrf calls.

andreasnoack · 2021-10-11T13:46:10Z

Could it be related to the changes mentioned in https://discourse.julialang.org/t/blas-performance-testing-for-julia-1-8/69520

ViralBShah · 2021-10-11T14:29:54Z

Yes, my guess is that we are allocating a lot of stack memory and 4096 max threads may be too much. 4096 was a bit too aggressive, I thought, and perhaps we need to go to 512 for now. This is the stuff in getrf that is blowing the stack.:

https://github.com/xianyi/OpenBLAS/blob/8a87e80c742a146d3900d64046d4f4c7fd58b6b0/lapack/getrf/getrf_parallel.c#L400

I wonder if their OpenMP versions are better on this front.

https://github.com/xianyi/OpenBLAS/blob/develop/lapack/getrf/getrf_parallel_omp.c

You will start seeing these in Julia source builds that don't use BB after JuliaLang/julia#42584 is merged to bring it in line with BB.

@chriselrod Can you provide a reasonably good number for max threads that we can get away with for now? Is it 1024, 512, or even smaller?

staticfloat · 2021-10-11T18:14:56Z

I feel we should be able to force USE_ALLOC_HEAP by having a large number of max threads, and then use dynamic amounts (e.g. instead of MAX_CPU_NUMBER, use args->nthreads) to avoid allocating such a large amount.

ViralBShah · 2021-10-11T19:34:57Z

I had not seen ALLOC_HEAP before in the openblas code. Seems like they already have the dynamic versions. Is it just a matter of always passing USE_ALLOC_HEAP=1 to the openblas build, and it will do the right thing? I wonder if we may lose performance on small problems.

ViralBShah · 2021-10-11T19:53:18Z

The GETRF_MEM_ALLOC_THRESHOLD is quite low. Given that we are going with 4096 max threads, shouldn't we already be in the ALLOC_HEAP part of the code? Or is MAX_CPU_NUMBER based on runtime and not compile time?

https://github.com/xianyi/OpenBLAS/blob/8a87e80c742a146d3900d64046d4f4c7fd58b6b0/common.h#L394

staticfloat · 2021-10-11T19:55:21Z

MAX_CPU_NUMBER is based on compile time; what I'm suggesting is we instead use args->nthreads.

staticfloat · 2021-10-11T19:57:52Z

I also don't really see why we can't use variable-length arrays (added in c99) and just use args->nthreads even without ALLOC_HEAP.

ViralBShah · 2021-10-11T20:27:52Z

Wouldn't that require significant patching and testing of openblas? Not something I want to sign up for myself!

chriselrod · 2021-10-12T05:59:50Z

@chriselrod Can you provide a reasonably good number for max threads that we can get away with for now? Is it 1024, 512, or even smaller?

Are the largest core counts today 128?
The Ampere Altra max is 128 cores, this data sheet mentions a 2P configuration for 256 cores.

The biggest Epycs are 64 cores, I believe in up to a 2 socket configuration for 128 cores. (If it has 128 cores/256 threads, we'd prefer 128 BLAS threads).

Zen4 will be out in about a year. It is moving from 7nm to 5nm, so I'd expect it to bump up core counts again.
With Intel, next year's saphire rapids are expected to be up to 56 cores, but if they offer it in a 4 socket configuration they could of course still get >128.

Depending on how long you want this change to be good for, 256 will probably last a couple years for x86.
Could go for 512 if you also want to cover another doubling on the ARM side of things. The Neoverse N2 core still prioritizes core size, so presumably with a small node future chips will be able to fit a ton of them.

It'd also be interesting to actually run benchmarks on systems with these huge core counts to see how things scale as a function of matrix size.
Because parallelizing the reduction dimension isn't really practical and thus normally isn't limited, we're left with M and N (where M refers to the leading axis of A and C and N the lagging axis of B and C in C = A*B.
With 512 cores, even if M = 10_000, we'd want to parallelize along the N dimension, which does reduce the potential for L3 cache reuse.
Would probably take some work getting good scaling at these sizes.

Note that OpenBLAS doesn't look like it's doing well with non-monolithic memory, e.g. the monolithic Intel 10980XE seems over 4x faster than the AMD 5950X for matmul instead of the expected roughly 2x faster.
I did double check that OpenBLAS 3.17 has detection for Zen3.

So it may take some tuning to get a large percentage of peak flops.

ViralBShah · 2021-10-12T13:20:34Z

@chriselrod Is 512 good enough to avoid stack overflows?

vtjnash · 2021-10-12T13:49:12Z

Because VLA were removed in C11, but there's always been alloca.

ViralBShah · 2021-10-12T13:50:50Z

I say the non-monolithic cases are handled by vendor BLAS through our LBT mechanism. We just want to pick reasonable defaults out of the box for openblas for now. IMO, any real effort we put in should go into our Julia native versions.

ViralBShah · 2021-10-20T14:02:36Z

@KristofferC this is what the latest openblas binaries fix.

DilumAluthge added the bug Something isn't working label Oct 11, 2021

chriselrod closed this as completed Oct 11, 2021

chriselrod reopened this Oct 11, 2021

JuliaLang deleted a comment from staticfloat Oct 11, 2021

ViralBShah mentioned this issue Oct 11, 2021

Stack overflow when compiled with 4096 NUM_THREADS OpenMathLib/OpenBLAS#3403

Closed

ViralBShah mentioned this issue Oct 12, 2021

[OpenBLAS] Use 512 threads as max threads for OpenBLAS JuliaPackaging/Yggdrasil#3732

Merged

ViralBShah mentioned this issue Oct 19, 2021

Bump openblas libraries on master JuliaLang/julia#42708

Merged

ViralBShah closed this as completed in JuliaLang/julia#42708 Oct 20, 2021

KristofferC transferred this issue from JuliaLang/julia Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stack overflow on large lu decomposition with OpenBLAS #877

Stack overflow on large lu decomposition with OpenBLAS #877

chriselrod commented Oct 11, 2021 •

edited

Loading

chriselrod commented Oct 11, 2021

DilumAluthge commented Oct 11, 2021 •

edited

Loading

chriselrod commented Oct 11, 2021

DilumAluthge commented Oct 11, 2021

DilumAluthge commented Oct 11, 2021

chriselrod commented Oct 11, 2021

DilumAluthge commented Oct 11, 2021

DilumAluthge commented Oct 11, 2021

andreasnoack commented Oct 11, 2021

chriselrod commented Oct 11, 2021 •

edited

Loading

andreasnoack commented Oct 11, 2021

ViralBShah commented Oct 11, 2021 •

edited

Loading

staticfloat commented Oct 11, 2021

ViralBShah commented Oct 11, 2021 •

edited

Loading

ViralBShah commented Oct 11, 2021

staticfloat commented Oct 11, 2021

staticfloat commented Oct 11, 2021

ViralBShah commented Oct 11, 2021

chriselrod commented Oct 12, 2021

ViralBShah commented Oct 12, 2021

vtjnash commented Oct 12, 2021

ViralBShah commented Oct 12, 2021

ViralBShah commented Oct 20, 2021

Stack overflow on large lu decomposition with OpenBLAS #877

Stack overflow on large lu decomposition with OpenBLAS #877

Comments

chriselrod commented Oct 11, 2021 • edited Loading

chriselrod commented Oct 11, 2021

DilumAluthge commented Oct 11, 2021 • edited Loading

chriselrod commented Oct 11, 2021

DilumAluthge commented Oct 11, 2021

DilumAluthge commented Oct 11, 2021

chriselrod commented Oct 11, 2021

DilumAluthge commented Oct 11, 2021

DilumAluthge commented Oct 11, 2021

andreasnoack commented Oct 11, 2021

chriselrod commented Oct 11, 2021 • edited Loading

andreasnoack commented Oct 11, 2021

ViralBShah commented Oct 11, 2021 • edited Loading

staticfloat commented Oct 11, 2021

ViralBShah commented Oct 11, 2021 • edited Loading

ViralBShah commented Oct 11, 2021

staticfloat commented Oct 11, 2021

staticfloat commented Oct 11, 2021

ViralBShah commented Oct 11, 2021

chriselrod commented Oct 12, 2021

ViralBShah commented Oct 12, 2021

vtjnash commented Oct 12, 2021

ViralBShah commented Oct 12, 2021

ViralBShah commented Oct 20, 2021

chriselrod commented Oct 11, 2021 •

edited

Loading

DilumAluthge commented Oct 11, 2021 •

edited

Loading

chriselrod commented Oct 11, 2021 •

edited

Loading

ViralBShah commented Oct 11, 2021 •

edited

Loading

ViralBShah commented Oct 11, 2021 •

edited

Loading