Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stack overflow when compiled with 4096 NUM_THREADS #3403

Closed
ViralBShah opened this issue Oct 11, 2021 · 12 comments
Closed

Stack overflow when compiled with 4096 NUM_THREADS #3403

ViralBShah opened this issue Oct 11, 2021 · 12 comments

Comments

@ViralBShah
Copy link
Contributor

We decided to set a very high NUM_THREADS for building OpenBLAS - 4096. That leads to a stack overflow in getrf_parallel. I believe this setting should have led to ALLOC_HEAP being used, but perhaps 4096 is still too large for all the other buffers getting stack allocated.

JuliaLang/LinearAlgebra.jl#877

@martin-frbg
Copy link
Collaborator

4096 ?? Why would you even want to do that, do you have (or know of) any hardware with that many cores ? Presumably you would need to raise the default ulimit your operating system puts on stacksize as well.

@ViralBShah
Copy link
Contributor Author

ViralBShah commented Oct 12, 2021

We used to have it be conservative and have to repeatedly keep bumping it up. So, we decided to just make it really big and not have to worry about it.

A really nice feature would be if it were possible to heap allocate always based on the number of threads openblas started with - so that this was not a compile time decision (for the default Julia binaries we make available for download).

@martin-frbg
Copy link
Collaborator

I can look into allocating the queue array on the heap as well when ALLOC_HEAP is set, but just choosing an unrealistically high number of threads and hoping to get away with it without increasing the stack limits imposed by the shell is asking for trouble IMHO.
BTW 0.3.17 already added a fallback function to allocate additional space for another 512 threads on the heap when it runs out of the compile-time NUM_THREADS so hopefully it should no longer be necessary to build with a scary NUM_THREADS for distribution.

@ViralBShah
Copy link
Contributor Author

ViralBShah commented Oct 12, 2021

Oh that's interesting - and I suppose would solve our problem. So what would you recommend as the number of NUM_THREADS to build with? It would be great to allocate more on the heap esp. when ALLOC_HEAP is set. Also, is it possible to use arg->nthreads and allocate for number of threads openblas is using, instead of for MAX_CPU_NUMBER (which IIUC will be NUM_THREADS).

@staticfloat Please note this point.

@martin-frbg
Copy link
Collaborator

I have no idea of the range of hardware your software encounters, but I'd be surprised if you needed more than 512. Allocating based on the current arg->nthreads is a bit tricky as some things probably have to be in place already at the point where that number is known. (Remember that at the core of this is mostly ~15-20 years old code with little original documentation and poorly documented history for its early life as GotoBLAS).Also I do not currently have access to hardware with really large numbers of cores.

@ViralBShah
Copy link
Contributor Author

I came to the same conclusion - that for now and the foreseeable future 512 is more than sufficient.

@ViralBShah
Copy link
Contributor Author

ViralBShah commented Nov 1, 2021

We have now come down to 512 threads by default and that is working reliably in most cases. However, there are still some challenges here because when starting a distributed job (multiple Julia processes) on the same node, each process tries to initialize openblas with the max number of threads and the OS runs out of some kind of thread limits. For example, on aarch64, we get this error:

OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max

On distributed jobs, we usually set the number of openblas threads to 1, but that is happening after all the openblas library initialization and buffer allocation - perhaps a bit too late in the process.

@martin-frbg
Copy link
Collaborator

It should be possible to use smaller default now and have OpenBLAS allocate auxiliary buffer structures in case of overflow. (Maybe the number of buffers to add could be made into a build-time parameter, right now it is fixed at 512)

@ViralBShah
Copy link
Contributor Author

Does OpenBLAS allocate these auxiliary buffers automatically? If so, can we move down from something like 512 to 16 threads? Or even fewer?

@martin-frbg
Copy link
Collaborator

Yes it does, but the number of auxiliary buffers is currently fixed at 512 - though it should be trivial to make that configurable at build time. Also just one round of expansion.

@ViralBShah
Copy link
Contributor Author

ViralBShah commented Jan 24, 2024

@martin-frbg Thanks for your comment. Is that different from setting NUM_THREADS or does NUM_THREADS still represent the maximum number of threads openblas can use?

It would be nice to reduce the default number of auxiliary buffers if they allocate significantly memory or at least make it configurable.

@giordano
Copy link
Contributor

@martin-frbg based on quick experiments with this configuration make DYNAMIC_ARCH=1 LIBPREFIX=libopenblas64_ INTERFACE64=1 SYMBOLSUFFIX=64_ NUM_THREADS=16 -j36, Makefile variable NUM_THREADS appears to still set the maximum number of threads one can possibly have:

julia> using OpenBLAS_jll, LinearAlgebra

julia> strip(unsafe_string(ccall((BLAS.@blasfunc(openblas_get_config), libopenblas), Ptr{UInt8}, () )))
"OpenBLAS 0.3.26.dev  USE64BITINT DYNAMIC_ARCH NO_AFFINITY neoversev1 MAX_THREADS=16"

julia> BLAS.get_num_threads()
16

julia> BLAS.set_num_threads(72)

julia> BLAS.get_num_threads()
16

julia> BLAS.set_num_threads(8)

julia> BLAS.get_num_threads()
8

Even if on this system I have 72 threads, set_num_threads refuses to set a number of threads larger than NUM_THREADS, which was 16 at compile time. Is that accurate? @ViralBShah If so, I don't think we want to reduce NUM_THREADS in our builds?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants