Thread safety issue with dgemm and >2 cores #1851

bbbbbbbbba · 2018-11-03T01:51:56Z

On my test computer with 8 cores, the following test snippet fails as long as OPENBLAS_NUM_THREADS is set to something >= 3 (or left unset):

import numpy as np
import threading

def do_test(idx):
    for i in range(100 * idx, 100 * idx + 100):
        a = np.ones((1 << 20, 2)) * i
        b = np.eye(2)
        mat_0 = a.dot(b).reshape(-1)
        mat_1 = a.dot(b).reshape(-1)
        if np.any(mat_0 != mat_1):
            indices = np.nonzero(mat_0 != mat_1)[0]
            if len(indices) > 100: print('Error %d %d' % (i, len(indices)))
            else: print('Error %d %s' % (i, [(j, mat_0[j], mat_1[j]) for j in indices]))

for idx in range(8):
    threading.Thread(target=do_test, args=(idx,)).start()

The version of OpenBLAS I am using is libopenblasp-r0-8dca6697.3.0.dev.so (bundled with numpy 1.15.3).

Originally posted by @bbbbbbbbba in #1844 (comment)

The text was updated successfully, but these errors were encountered:

bbbbbbbbba · 2018-11-03T02:01:00Z

After changing np.ones((1 << 20, 2)) to np.ones((100000, 2)), the bug only happens with OPENBLAS_NUM_THREADS set to 3, 4, 5, or 6. Debugging seems to suggest that for my installation GEMM_R is set to 15856, which would mean the bug only happens when this loop is executed more than once.

bbbbbbbbba · 2018-11-03T06:51:46Z

OK, I think I get the story of this bug.

With OPENBLAS_NUM_THREADS set to nthreads, all operations large enough for OpenBLAS multithreading is split into tasks, which comes in batches of nthreads (or smaller, but that doesn't come up here). The main thread dispatches nthreads - 1 of those tasks to worker threads (there are nthreads - 1 of them, and they are shared between all main threads; the main thread holds a global exec_queue_lock to prevent anyone else from messing with task assignment), and leaves one task for itself. Then it waits for results from each worker thread before going on.

The way the main thread looks for idle threads is to cycle through them, and assign one task on the list to any worker that is found idle. When there is only one main thread, all worker threads are certain to be idle, and this means that the workers are always going to be assigned in the same order. With multiple main threads, this is no longer the case: the main thread will busy-wait for workers, assigning the second task on the list to whichever becomes idle the earliest (the first task is left for the main thread). So the correspondence between tasks and workers become essentially random.

Now, gemm_driver does something special. Due to the way gemm uses buffers, it needs to vertically split the matrix B into blocks, so that each task is responsible for buffering at most GEMM_R columns. In the event that n > GEMM_R * nthreads, this means multiple batches of tasks are created. However, those batches are using the same task list, queue, except that for each batch range_N is changed. So the same queue is passed to exec_blas multiple times.

Each task on the queue, except the first one, is eventually received by a worker thread. Before executing said task, the worker thread first makes sure that the buffers sa and sb are not NULL, which they will be the first time since gemm_driver only prepares them for the first task on the queue. If they are NULL, they are set so that they point into a worker thread specific buffer.

So here comes the punchline. The new value of sb is not only passed to the inner routine, but also put back into the queue entry struct. Of course, as mentioned before, that queue entry has a high chance to be assigned to another worker thread in a later iteration of the loop. Now the second worker thread is using the first worker thread's thread-specific buffer.

This is where another main thread comes in with a new invocation of gemm_driver, finds the first worker thread idle, and gives it a task with sb = NULL. Now two tasks running concurrently are sharing the same buffer. The result isn't pretty.

bbbbbbbbba · 2018-11-03T07:20:51Z

Also, I just found another way to arrive at the same bug: among the first batch of tasks with sb = NULL, it's entirely possible for two to be assigned to the same worker thread. To do this, a worker thread need to begin running immediately after being assigned one task, and finish in time for the same main thread, still looking for a worker thread, to see it idle again (this requires at least one other worker to be tied up by some other main thread, so it still won't happen with a single main thread). Of course, those two tasks are by definition not being run concurrently, so they are safe, but their successors, which inherit their sb, are not.

martin-frbg · 2018-11-03T16:13:47Z

As a short-term measure, building with USE_SIMPLE_THREADED_LEVEL3=1 can be used to disable the additional splitting in the driver routine. This appears to be sufficient to "fix" the problem on a six-core cpu.

martin-frbg · 2018-11-11T11:52:07Z

If I read your analysis correctly, it might be possible to get around the problem by ensuring that each main thread will only assign tasks to its own set of worker threads (which I think should be doable by keeping track of thread IDs - perhaps simply by storing the ID of the master thread in the thread_status struct ) ?

bbbbbbbbba · 2018-11-13T18:43:56Z

@martin-frbg That's correct (I assume you are proposing to spawn a new set of worker threads when you think you are the master thread, but discover that your own thread ID is different from the stored master thread ID). Although this may cause a new problem --- namely, spawning too many threads --- it is still better than silently giving incorrect results.

martin-frbg · 2018-11-13T21:25:21Z

Yes, I guess the only alternative would be to block until the other caller has finished, which does not look sensible. Unfortunately I have not figured out quite where to store and compare the master thread ID (perhaps repurpose the blas_server_avail flag ?) and how to avoid stepping on the existing thread structure (probably call goto_set_num_threads) This is getting confusing rather quickly...

brada4 · 2018-11-14T18:12:51Z

Serialisation of calls would be task of callers.... We cannot detect from all openblas build confs that one burns stuff using other/different threading mechanism.... N**2 is expected to get worse and worse over time.

martin-frbg · 2018-11-15T20:46:45Z

Unfortunately after enabling secondary master threads to add their own workers to the pool I am currently stuck with what appear to be new deadlocks. Perhaps it would make sense to make USE_SIMPLE_THREADED_LEVEL3=1 the default in 0.3.4 .
(I notice that helgrind is also flagging conflicting read/write and even write/write accesses by dgemm_oncopy, so it could be that a lot more needs to be duplicated than just the worker threads - unless this is another unrelated bug in the level3 code - but I am running out of spare time again now)

bbbbbbbbba · 2018-11-16T07:39:21Z

If a secondary master thread is spawning its own workers, then those probably should be in its own pool. Synchronization objects like server_lock and exec_queue_lock (and maybe others) may need to be duplicated, although I am not sure if they are even necessary if we don't let the master threads actually share a worker pool. The "master thread id" field might need to be protected by a new lock, or otherwise made atomic.

On the other hand, the "block until the other caller has finished" idea may not be as unreasonable as it sounds. From my understanding, once OpenBLAS decides to use multithreading for a matrix operation, it will always use all the worker threads and the master thread evenly (except for very particular matrix shapes). If OPENBLAS_NUM_THREADS is set to the total number of CPU cores in the system (as it is by default), then allowing different master threads to run concurrently may not actually increase the parallelism.

For a concrete example, suppose a computer has 4 cores, there are two master threads A and B, and they share the same three worker threads X, Y, and Z. This idea is essentially saying, at any given time, either A, X, Y, Z are working together, or B, X, Y, Z are working together. I don't see any major problem with that.

martin-frbg · 2018-11-19T13:18:35Z

Trying to serialize with a mutex in the level3 thread dispatcher now, comments welcome. I have excluded OpenMP for now as the introduction of the NUM_PARALLEL option in #1536 supposedly provides a workaround for that already.

brada4 · 2018-11-19T13:50:40Z

You mean it will serialize multiple parallel calls to ensure cores are not wildly oversubscribed? Shouldnt it be wrapped in #ifdef SMP too? Some may use single-threaded build in parallel (still subject to other possible thread safety issues though)

martin-frbg · 2018-11-19T14:01:40Z

I do not think a non-SMP build would call level3_thread() at all, would it ?

brada4 · 2018-11-19T14:20:48Z

It will not, indeed....

martin-frbg · 2018-11-25T21:40:38Z

Should be fixed by #1875, though we may want to revisit this at some point as parallel calls would probably improve performance if we can somehow get each "instance" of OpenBLAS to operate on its own subset of available cores.

embray · 2019-02-01T13:41:38Z

@bbbbbbbbba Thank you for your analysis of this issue--it was very helpful.

This was referenced Nov 12, 2018

Optimize gemv for small M, large N only if it can be done in a threadsafe manner #1865

Merged

AIX 7.1 with GCC/GFortran - Issue needing resolution #1803

Closed

charris mentioned this issue Nov 15, 2018

Non-deterministically corrupted results from np.dot (conditional on large imports) numpy/numpy#12394

Closed

martin-frbg mentioned this issue Nov 19, 2018

Serialize accesses to parallelized level3 functions from multiple cal… #1875

Merged

martin-frbg added this to the 0.3.4 milestone Nov 25, 2018

martin-frbg closed this as completed Dec 2, 2018

martin-frbg mentioned this issue Jan 2, 2019

ARM results in error #647

Open

martin-frbg mentioned this issue Feb 4, 2019

Race Condition in OpenBLAS on IBM OpenPower 8 #1191

Closed

martin-frbg mentioned this issue May 9, 2019

Issues when calling cblas_sgemm from multiple threads #2126

Closed

martin-frbg mentioned this issue May 30, 2019

Non-deterministic output with multiple OpenMP runtimes in the same process #2146

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thread safety issue with dgemm and >2 cores #1851

Thread safety issue with dgemm and >2 cores #1851

bbbbbbbbba commented Nov 3, 2018

bbbbbbbbba commented Nov 3, 2018

bbbbbbbbba commented Nov 3, 2018

bbbbbbbbba commented Nov 3, 2018

martin-frbg commented Nov 3, 2018

martin-frbg commented Nov 11, 2018

bbbbbbbbba commented Nov 13, 2018

martin-frbg commented Nov 13, 2018

brada4 commented Nov 14, 2018

martin-frbg commented Nov 15, 2018 •

edited

Loading

bbbbbbbbba commented Nov 16, 2018

martin-frbg commented Nov 19, 2018

brada4 commented Nov 19, 2018

martin-frbg commented Nov 19, 2018

brada4 commented Nov 19, 2018

martin-frbg commented Nov 25, 2018

embray commented Feb 1, 2019

Thread safety issue with dgemm and >2 cores #1851

Thread safety issue with dgemm and >2 cores #1851

Comments

bbbbbbbbba commented Nov 3, 2018

bbbbbbbbba commented Nov 3, 2018

bbbbbbbbba commented Nov 3, 2018

bbbbbbbbba commented Nov 3, 2018

martin-frbg commented Nov 3, 2018

martin-frbg commented Nov 11, 2018

bbbbbbbbba commented Nov 13, 2018

martin-frbg commented Nov 13, 2018

brada4 commented Nov 14, 2018

martin-frbg commented Nov 15, 2018 • edited Loading

bbbbbbbbba commented Nov 16, 2018

martin-frbg commented Nov 19, 2018

brada4 commented Nov 19, 2018

martin-frbg commented Nov 19, 2018

brada4 commented Nov 19, 2018

martin-frbg commented Nov 25, 2018

embray commented Feb 1, 2019

martin-frbg commented Nov 15, 2018 •

edited

Loading