-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non-deterministic output with multiple OpenMP runtimes in the same process #2146
Comments
This is probably a variation of #1851 - sadly there is more than just a single global variable involved in this design limitation from the original GotoBLAS. My crude workaround in #1875 left out (conventional) OPENMP on purpose as that appeared to do a good enough job at avoiding conflicts. You could build OpenBLAS with USE_SIMPLE_THREADED_LEVEL3=1 to avoid using the problematic multithreaded GEMM path altogether, or try removing the |
Thanks @martin-frbg for the information. Unfortunately I really do need these calls to run concurrently and with OpenMP for performance reasons so serializing them inside OpenBLAS is not a reasonable option. Are there plans to eliminate these global variables eventually in OpenBLAS? |
Depends on your definition of "eventually" - this project has only a small number of semi-regular contributors so PRs are always welcome. Have you checked performance with USE_SIMPLE_THREADED_LEVEL3 - this should still employ some parallelism ? |
There is just no way dual OMP mutex functions namd same could ever work. Your faulty setup is doomed to fail with or without OpenBLAS. |
@brada4 Read my full post again more carefully. I'm not using one of the normal implementations of the OpenMP runtime. I have a special one that supports concurrent instances of the OpenMP runtime in the same process. |
@martin-frbg I see calls into OpenMP being serialized as you describe and the results are deterministic. Performance is 2X less than it should be, which for this particular workloads is unacceptable. Pretty much the whole application consists of concurrent BLAS calls. |
Is using the non-OMP threading (USE_OPENMP=0) an option for you? |
@lightsighter both iomp and clang omp simulate gomp symbols to act as replacements. The library you point to just permits to nest KMP with OMP, you coud nest native threads within any OpenMP as @TiborGY points out. |
@TiborGY using OpenMP is a requirement for me for performance reasons. @brada4 You statement that "the library you point to just permits to next KMP with OMP" is incorrect. That is not at all how the OpenMP library that I'm using works or its purpose. I also can't do lots of tiny little BLAS calls on different threads for performance reasons. I need lots of threads cooperating on just a few very large BLAS operations. Any attempts to decompose them will degrade performance considerably. |
You could try removing the "!defined(USE_OPENMP)" from the |
I think you might have misunderstood what I tried to say. I am not saying you should stop using OpenMP in your program, but to compile OpenBLAS with USE_OPENMP=0 and USE_THREAD=1. This will use native threading inside OpenBLAS instead of the OMP threading. |
Just that you propose similar, working idea, that is close to library in the OP, which in turn wraps IPP parallel section inside OMP parallel ones, I just looked into it for signs of weird namespace masking/aliasing and found none. |
@lightsighter unfortunately I`m having some problems with setting up Legion for testing - I'm trying to build on the provided omp_saxpy example, but that one already fails to run with an error message from default_mapper about failing to find any valid variants for the task. Apparently it cannot figure out the (presence of?) the cpu - does this need any additional build- or runtime configuration, I did not notice anything about this in the documentation ? |
Yes, unfortunately the runtime under Legion (called Realm) still needs you to tell it how to configure the machine. In this case you'll want to use the I might also recommend that you talk to @marcinz from the cunumeric project as I know they have some tests that had to disable running with multiple OpenMP processors in Legion because of this issue, and they can probably give you a very simple command line that will do some BLAS computations using multiple OpenMP processors. |
Thanks - indeed @marcinz if you could point me to some test you had to xfail because of problems in OpenBLAS that would be very helpful. Running cunumeric's test.py without modifications did not bring up any obvious differences between trying with |
I suspect none of the tests in the test suite are actually big enough to exercise the bug. Legate's default partitioning heuristics won't over-partition the input arrays to span multiple OpenMP processors unless it is profitable to do so. You can force them to always decompose data to as many processors as available by running with |
Thanks for the LEGATE_TEST=1 - I'll just try looping the test for now until something breaks and/or running on a really big machine (provided I can get it to run on arm64 or power9 server hardware). |
I have a standalone omp program which runs multi-threaded openblas dgemm under 4 concurrent pthreads, and it produced the correct flops and results. I compiled the openblas with |
Make sure you increase the problem size as well. I run Legate pretty regularly on my M1 so it should work on arm64. I know there are folks running Legion on Power9 (Summit and Sierra), but I also know for certain that Legate does not have CI for PowerPC at the moment.
@eddy16112 I don't think that is sufficient to reproduce the issue. You've got four threads, but only one copy of the OpenMP runtime in the process. I know you're setting the number of OpenMP threads inside each pthread to a smaller value than the total number of threads in the thread pool, but I still think the fact that there is one OpenMP runtime will serialize the execution of those loops in some order by the OpenMP runtime. You need multiple copies of the OpenMP runtime in the same process to be capable of executing multiple loop bodies in parallel. |
I am not sure how to create multiple OpenMP runtime instances. OpenMP did not define it. The goal of my reproducer is to make sure that we are able to run multiple OpenBLAS DGEMMs (or other kernels) concurrently, e.g. one DGEMM per numa node, and all the DGEMMs will produce the correct flops (flops of sequential DGEMM * number of cores used by each DGEMM).
I do not think so, as I said, all the DGEMMs within each pthread achieve the correct performance. |
OK, is there an easy way to do this ? Remember I am not at all familiar with cunumeric, so far I'm only invoking cunumeric.test()
OK, will try that then - running on a 12-core x86_64 did not bring up anything so far and my build attempt on power9
right, that is quite a non-standard implementation (if not to say, extension) of OpenMP... (also found some rather strong-worded comments to this effect in an open issue ticket in tblis in the meantime) I have some experimental code that duplicates all the internal queue structures NUM_PARALLEL times which I think would remove any potential conflicts, but it would be a rather invasive and ugly change that would increase the memory footprint considerably. |
I observe non-deterministic output when calling into cblas_sgemv concurrently from two different threads each of which are bound to a different copy of an OpenMP runtime in the same process. I built OpenBLAS 0.3.5 with USE_OPENMP=1, USE_THREAD=1, NUM_PARALLEL=2, and NUM_THREADS=40. Each copy of the OpenMP runtime has 20 threads in its thread pool, and I call 'openblas_set_num_threads' to 20 once before doing any other OpenBLAS calls (which with 2 copies of the OpenMP runtime results in 40 total threads, the same as NUM_THREADS). Everything is deterministic if I take care to ensure that one thread completes its call to cblas_sgemv before the second thread does its call cblas_sgemv. However, if both threads call into cblas_sgemv concurrently each with their own OpenMP runtime, then non-determinism results. My guess is that there is one or more global variables somewhere in OpenBLAS that is/are suffering from a race with concurrent invocations each with OpenMP but I've been so far unable to find it in the source.
Note you can't reproduce this with any of the standard OpenMP runtimes (e.g. the ones shipped by GCC, LLVM, or Intel as they all have global variables that prevent multiple copies of their OpenMP runtimes from existing concurrently in the same process). I'm using the copy of OpenMP provided by the Realm runtime which does support concurrent copies of the OpenMP runtime existing in the same process:
https://github.com/StanfordLegion/legion/blob/stable/runtime/realm/openmp/openmp_api.cc
I realize this is a very unusual use case for you to consider, but I'm hoping that it's an easy fix to just get rid of one or a few global variables somewhere.
The text was updated successfully, but these errors were encountered: