-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issues when calling cblas_sgemm from multiple threads #2126
Comments
Which CPU and operating system is this ? I guess you could try compiling OpenBLAS with USE_SIMPLE_THREADED_LEVEL3=1, and/or try how the test case from #1755 behaves on your system. If your OS is Linux, perhaps you could also try running your code from valgrind to see if that flags any thread races (unless you are using OpenMP, which confuses the tool) or invalid memory accesses. |
Just going by your username, if you are building on ARMV8 with CMAKE this could also be something similar to #1870 (although that one should be fixed as well) |
Error is on x86_64 running Ubuntu as operating system. |
That should at least exclude #1870 - at some point it could be useful to know the cpu model as the choice of optimized gemm kernel depends on it. Have you checked that your program behaves correctly with the reference BLAS (or any other like ATLAS) just to exclude some subtle problem elsewhere in your application ? |
CPU model: Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz I have tested with blis and mkl, there is no error. With the sample code of #1755 (adding missing * to build) and without printf('.') I have errors. Note that I have no errors when printf('.') is adding a delay. Using sample code, with and without USE_SIMPLE_THREADED_LEVEL3 and USE_TLS I have errors. Valgrind is not returning error and it's not failing as valgrind adds a big delay. |
Hmm. 5820 is regular HASWELL kernel. At least we seem to have a simple reproducer now thanks to #1755 (pity about the printf actually hiding the problem). valgrind (with its default tool memcheck) at least suggests that there are no out-of-bounds accesses, next would be to run it with tool=helgrind to do an even slower check for race conditions. (Not saying that the burden is on you, just describing a workflow.) |
Vagring using tool=helgrind is listed multiple possible race conditions when running the sample code. Here is one:
|
#1851 could also be related (although either its fix or USE_SIMPLE_THREADED_LEVEL3 should both have taken care of this) |
Interestingly I cannot seem to reproduce the problem with the (corrected) test case from issue 1755 on my i7-8700K (also a hexacore, same HASWELL target as far as OpenBLAS is concerned). |
Yes I'm using the version 0.3.6. Can you reproduce the issue with the sample code without DEBUG=1? Timings seem important to reproduce the issue. |
That occured to me as well, but I still cannot reproduce the issue even without the DEBUG=1. Perhaps compiler version plays a role as well ? |
I used this compiler to build the sample code. g++ --version |
7.2.1 on my system, but if anything I would have expected compiler-related problems to show up with gcc 8 or 9 (whose more aggressive reuse of registers has uncovered a few bugs in the assembly kernels in the past). This is now beginning to look more like a hardware or firmware issue again. |
Have you compiled with USE_THREAD OFF and NUM_THREADS 0 as I did ? |
Good point, seems I got fixated on #1851 (and now I am not even sure if I had noticed the USE_THREAD=0 in #1755, based on the comments I made back then...). 🙁 |
Same issue with USE_THREAD=ON and NUM_THREADS=1 |
To be sure here is how I changed the sample code #1755 to build and to reproduce the issue, is it the same as yours ?
|
Yep, reproduced with USE_THREAD=0, can be mitigated ("BAD" result gone but lots of helgrind warnings remaining) by removing the "#if defined(SMP)" from a bunch of locking calls between lines 2700 to 2800 of memory.c . Will see later if helgrind goes quiet when I remove the remaining "SMP" clauses but there are also some potential conflicts between sgemm kernels that suggest the level3 |
Should be fixed by #2136 , which introduces a new option USE_LOCKING to be used in conjunction with USE_THREAD=0 for creating a single-threaded but thread-aware OpenBLAS. This avoids the (probable) performance penalty caused by the locking calls in use cases where OpenBLAS is not expected to be called from concurrent threads. |
Thanks for your support, it fixes my issue |
When calling cblas_sgemm in parallel, on different threads, I have sometimes wrong output results.
I my application I have two threads and use OpenBLAS with threads disabled (USE_THREAD OFF, NUM_THREADS 0).
I can easily test it running the same gemm twice (in the same thread) and verify that the results are the same.
Sometimes the results are different when using multiple threads on different data. In single thread the comparison is successful.
Problem looks similar to #1755, but not working on version 0.3.6 with and without USE_TLS.
The difference is huge and not minor rounding.
I have not managed to reproduce it yet with a simple example.
The text was updated successfully, but these errors were encountered: