Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thread safety issue with dgemm and >2 cores #1851

Closed
bbbbbbbbba opened this issue Nov 3, 2018 · 16 comments
Closed

Thread safety issue with dgemm and >2 cores #1851

bbbbbbbbba opened this issue Nov 3, 2018 · 16 comments
Milestone

Comments

@bbbbbbbbba
Copy link

On my test computer with 8 cores, the following test snippet fails as long as OPENBLAS_NUM_THREADS is set to something >= 3 (or left unset):

import numpy as np
import threading

def do_test(idx):
    for i in range(100 * idx, 100 * idx + 100):
        a = np.ones((1 << 20, 2)) * i
        b = np.eye(2)
        mat_0 = a.dot(b).reshape(-1)
        mat_1 = a.dot(b).reshape(-1)
        if np.any(mat_0 != mat_1):
            indices = np.nonzero(mat_0 != mat_1)[0]
            if len(indices) > 100: print('Error %d %d' % (i, len(indices)))
            else: print('Error %d %s' % (i, [(j, mat_0[j], mat_1[j]) for j in indices]))

for idx in range(8):
    threading.Thread(target=do_test, args=(idx,)).start()

The version of OpenBLAS I am using is libopenblasp-r0-8dca6697.3.0.dev.so (bundled with numpy 1.15.3).

Originally posted by @bbbbbbbbba in #1844 (comment)

@bbbbbbbbba
Copy link
Author

After changing np.ones((1 << 20, 2)) to np.ones((100000, 2)), the bug only happens with OPENBLAS_NUM_THREADS set to 3, 4, 5, or 6. Debugging seems to suggest that for my installation GEMM_R is set to 15856, which would mean the bug only happens when this loop is executed more than once.

@bbbbbbbbba
Copy link
Author

OK, I think I get the story of this bug.

With OPENBLAS_NUM_THREADS set to nthreads, all operations large enough for OpenBLAS multithreading is split into tasks, which comes in batches of nthreads (or smaller, but that doesn't come up here). The main thread dispatches nthreads - 1 of those tasks to worker threads (there are nthreads - 1 of them, and they are shared between all main threads; the main thread holds a global exec_queue_lock to prevent anyone else from messing with task assignment), and leaves one task for itself. Then it waits for results from each worker thread before going on.

The way the main thread looks for idle threads is to cycle through them, and assign one task on the list to any worker that is found idle. When there is only one main thread, all worker threads are certain to be idle, and this means that the workers are always going to be assigned in the same order. With multiple main threads, this is no longer the case: the main thread will busy-wait for workers, assigning the second task on the list to whichever becomes idle the earliest (the first task is left for the main thread). So the correspondence between tasks and workers become essentially random.

Now, gemm_driver does something special. Due to the way gemm uses buffers, it needs to vertically split the matrix B into blocks, so that each task is responsible for buffering at most GEMM_R columns. In the event that n > GEMM_R * nthreads, this means multiple batches of tasks are created. However, those batches are using the same task list, queue, except that for each batch range_N is changed. So the same queue is passed to exec_blas multiple times.

Each task on the queue, except the first one, is eventually received by a worker thread. Before executing said task, the worker thread first makes sure that the buffers sa and sb are not NULL, which they will be the first time since gemm_driver only prepares them for the first task on the queue. If they are NULL, they are set so that they point into a worker thread specific buffer.

So here comes the punchline. The new value of sb is not only passed to the inner routine, but also put back into the queue entry struct. Of course, as mentioned before, that queue entry has a high chance to be assigned to another worker thread in a later iteration of the loop. Now the second worker thread is using the first worker thread's thread-specific buffer.

This is where another main thread comes in with a new invocation of gemm_driver, finds the first worker thread idle, and gives it a task with sb = NULL. Now two tasks running concurrently are sharing the same buffer. The result isn't pretty.

@bbbbbbbbba
Copy link
Author

Also, I just found another way to arrive at the same bug: among the first batch of tasks with sb = NULL, it's entirely possible for two to be assigned to the same worker thread. To do this, a worker thread need to begin running immediately after being assigned one task, and finish in time for the same main thread, still looking for a worker thread, to see it idle again (this requires at least one other worker to be tied up by some other main thread, so it still won't happen with a single main thread). Of course, those two tasks are by definition not being run concurrently, so they are safe, but their successors, which inherit their sb, are not.

@martin-frbg
Copy link
Collaborator

As a short-term measure, building with USE_SIMPLE_THREADED_LEVEL3=1 can be used to disable the additional splitting in the driver routine. This appears to be sufficient to "fix" the problem on a six-core cpu.

@martin-frbg
Copy link
Collaborator

If I read your analysis correctly, it might be possible to get around the problem by ensuring that each main thread will only assign tasks to its own set of worker threads (which I think should be doable by keeping track of thread IDs - perhaps simply by storing the ID of the master thread in the thread_status struct ) ?

@bbbbbbbbba
Copy link
Author

@martin-frbg That's correct (I assume you are proposing to spawn a new set of worker threads when you think you are the master thread, but discover that your own thread ID is different from the stored master thread ID). Although this may cause a new problem --- namely, spawning too many threads --- it is still better than silently giving incorrect results.

@martin-frbg
Copy link
Collaborator

Yes, I guess the only alternative would be to block until the other caller has finished, which does not look sensible. Unfortunately I have not figured out quite where to store and compare the master thread ID (perhaps repurpose the blas_server_avail flag ?) and how to avoid stepping on the existing thread structure (probably call goto_set_num_threads) This is getting confusing rather quickly...

@brada4
Copy link
Contributor

brada4 commented Nov 14, 2018

Serialisation of calls would be task of callers.... We cannot detect from all openblas build confs that one burns stuff using other/different threading mechanism.... N**2 is expected to get worse and worse over time.

@martin-frbg
Copy link
Collaborator

martin-frbg commented Nov 15, 2018

Unfortunately after enabling secondary master threads to add their own workers to the pool I am currently stuck with what appear to be new deadlocks. Perhaps it would make sense to make USE_SIMPLE_THREADED_LEVEL3=1 the default in 0.3.4 .
(I notice that helgrind is also flagging conflicting read/write and even write/write accesses by dgemm_oncopy, so it could be that a lot more needs to be duplicated than just the worker threads - unless this is another unrelated bug in the level3 code - but I am running out of spare time again now)

@bbbbbbbbba
Copy link
Author

If a secondary master thread is spawning its own workers, then those probably should be in its own pool. Synchronization objects like server_lock and exec_queue_lock (and maybe others) may need to be duplicated, although I am not sure if they are even necessary if we don't let the master threads actually share a worker pool. The "master thread id" field might need to be protected by a new lock, or otherwise made atomic.

On the other hand, the "block until the other caller has finished" idea may not be as unreasonable as it sounds. From my understanding, once OpenBLAS decides to use multithreading for a matrix operation, it will always use all the worker threads and the master thread evenly (except for very particular matrix shapes). If OPENBLAS_NUM_THREADS is set to the total number of CPU cores in the system (as it is by default), then allowing different master threads to run concurrently may not actually increase the parallelism.

For a concrete example, suppose a computer has 4 cores, there are two master threads A and B, and they share the same three worker threads X, Y, and Z. This idea is essentially saying, at any given time, either A, X, Y, Z are working together, or B, X, Y, Z are working together. I don't see any major problem with that.

@martin-frbg
Copy link
Collaborator

Trying to serialize with a mutex in the level3 thread dispatcher now, comments welcome. I have excluded OpenMP for now as the introduction of the NUM_PARALLEL option in #1536 supposedly provides a workaround for that already.

@brada4
Copy link
Contributor

brada4 commented Nov 19, 2018

You mean it will serialize multiple parallel calls to ensure cores are not wildly oversubscribed? Shouldnt it be wrapped in #ifdef SMP too? Some may use single-threaded build in parallel (still subject to other possible thread safety issues though)

@martin-frbg
Copy link
Collaborator

I do not think a non-SMP build would call level3_thread() at all, would it ?

@brada4
Copy link
Contributor

brada4 commented Nov 19, 2018

It will not, indeed....

@martin-frbg martin-frbg added this to the 0.3.4 milestone Nov 25, 2018
@martin-frbg
Copy link
Collaborator

Should be fixed by #1875, though we may want to revisit this at some point as parallel calls would probably improve performance if we can somehow get each "instance" of OpenBLAS to operate on its own subset of available cores.

@embray
Copy link
Contributor

embray commented Feb 1, 2019

@bbbbbbbbba Thank you for your analysis of this issue--it was very helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants