-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stack overflow when compiled with 4096 NUM_THREADS #3403
Comments
4096 ?? Why would you even want to do that, do you have (or know of) any hardware with that many cores ? Presumably you would need to raise the default ulimit your operating system puts on stacksize as well. |
We used to have it be conservative and have to repeatedly keep bumping it up. So, we decided to just make it really big and not have to worry about it. A really nice feature would be if it were possible to heap allocate always based on the number of threads openblas started with - so that this was not a compile time decision (for the default Julia binaries we make available for download). |
I can look into allocating the queue array on the heap as well when ALLOC_HEAP is set, but just choosing an unrealistically high number of threads and hoping to get away with it without increasing the stack limits imposed by the shell is asking for trouble IMHO. |
Oh that's interesting - and I suppose would solve our problem. So what would you recommend as the number of NUM_THREADS to build with? It would be great to allocate more on the heap esp. when ALLOC_HEAP is set. Also, is it possible to use @staticfloat Please note this point. |
I have no idea of the range of hardware your software encounters, but I'd be surprised if you needed more than 512. Allocating based on the current |
I came to the same conclusion - that for now and the foreseeable future 512 is more than sufficient. |
We have now come down to 512 threads by default and that is working reliably in most cases. However, there are still some challenges here because when starting a distributed job (multiple Julia processes) on the same node, each process tries to initialize openblas with the max number of threads and the OS runs out of some kind of thread limits. For example, on
On distributed jobs, we usually set the number of openblas threads to |
It should be possible to use smaller default now and have OpenBLAS allocate auxiliary buffer structures in case of overflow. (Maybe the number of buffers to add could be made into a build-time parameter, right now it is fixed at 512) |
Does OpenBLAS allocate these auxiliary buffers automatically? If so, can we move down from something like 512 to 16 threads? Or even fewer? |
Yes it does, but the number of auxiliary buffers is currently fixed at 512 - though it should be trivial to make that configurable at build time. Also just one round of expansion. |
@martin-frbg Thanks for your comment. Is that different from setting It would be nice to reduce the default number of auxiliary buffers if they allocate significantly memory or at least make it configurable. |
@martin-frbg based on quick experiments with this configuration
Even if on this system I have 72 threads, |
We decided to set a very high NUM_THREADS for building OpenBLAS - 4096. That leads to a stack overflow in
getrf_parallel
. I believe this setting should have led toALLOC_HEAP
being used, but perhaps 4096 is still too large for all the other buffers getting stack allocated.JuliaLang/LinearAlgebra.jl#877
The text was updated successfully, but these errors were encountered: