-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thread safety issue with dgemm and >2 cores #1851
Comments
After changing |
OK, I think I get the story of this bug. With The way the main thread looks for idle threads is to cycle through them, and assign one task on the list to any worker that is found idle. When there is only one main thread, all worker threads are certain to be idle, and this means that the workers are always going to be assigned in the same order. With multiple main threads, this is no longer the case: the main thread will busy-wait for workers, assigning the second task on the list to whichever becomes idle the earliest (the first task is left for the main thread). So the correspondence between tasks and workers become essentially random. Now, Each task on the So here comes the punchline. The new value of This is where another main thread comes in with a new invocation of |
Also, I just found another way to arrive at the same bug: among the first batch of tasks with |
As a short-term measure, building with USE_SIMPLE_THREADED_LEVEL3=1 can be used to disable the additional splitting in the driver routine. This appears to be sufficient to "fix" the problem on a six-core cpu. |
If I read your analysis correctly, it might be possible to get around the problem by ensuring that each main thread will only assign tasks to its own set of worker threads (which I think should be doable by keeping track of thread IDs - perhaps simply by storing the ID of the master thread in the thread_status struct ) ? |
@martin-frbg That's correct (I assume you are proposing to spawn a new set of worker threads when you think you are the master thread, but discover that your own thread ID is different from the stored master thread ID). Although this may cause a new problem --- namely, spawning too many threads --- it is still better than silently giving incorrect results. |
Yes, I guess the only alternative would be to block until the other caller has finished, which does not look sensible. Unfortunately I have not figured out quite where to store and compare the master thread ID (perhaps repurpose the blas_server_avail flag ?) and how to avoid stepping on the existing thread structure (probably call goto_set_num_threads) This is getting confusing rather quickly... |
Serialisation of calls would be task of callers.... We cannot detect from all openblas build confs that one burns stuff using other/different threading mechanism.... N**2 is expected to get worse and worse over time. |
Unfortunately after enabling secondary master threads to add their own workers to the pool I am currently stuck with what appear to be new deadlocks. Perhaps it would make sense to make USE_SIMPLE_THREADED_LEVEL3=1 the default in 0.3.4 . |
If a secondary master thread is spawning its own workers, then those probably should be in its own pool. Synchronization objects like On the other hand, the "block until the other caller has finished" idea may not be as unreasonable as it sounds. From my understanding, once OpenBLAS decides to use multithreading for a matrix operation, it will always use all the worker threads and the master thread evenly (except for very particular matrix shapes). If For a concrete example, suppose a computer has 4 cores, there are two master threads A and B, and they share the same three worker threads X, Y, and Z. This idea is essentially saying, at any given time, either A, X, Y, Z are working together, or B, X, Y, Z are working together. I don't see any major problem with that. |
Trying to serialize with a mutex in the level3 thread dispatcher now, comments welcome. I have excluded OpenMP for now as the introduction of the NUM_PARALLEL option in #1536 supposedly provides a workaround for that already. |
You mean it will serialize multiple parallel calls to ensure cores are not wildly oversubscribed? Shouldnt it be wrapped in |
I do not think a non-SMP build would call level3_thread() at all, would it ? |
It will not, indeed.... |
Should be fixed by #1875, though we may want to revisit this at some point as parallel calls would probably improve performance if we can somehow get each "instance" of OpenBLAS to operate on its own subset of available cores. |
@bbbbbbbbba Thank you for your analysis of this issue--it was very helpful. |
On my test computer with 8 cores, the following test snippet fails as long as
OPENBLAS_NUM_THREADS
is set to something >= 3 (or left unset):The version of OpenBLAS I am using is
libopenblasp-r0-8dca6697.3.0.dev.so
(bundled with numpy 1.15.3).Originally posted by @bbbbbbbbba in #1844 (comment)
The text was updated successfully, but these errors were encountered: