-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revert "Lazyly reinit threads after a fork in OMP mode" #2982
Conversation
This reverts commit 3094fc6.
Ah, pity - possibly depends on the exact version of libgomp if it works or not @Flamefire |
Hm, that is strange. Before actually reverting this I'd like to understand why this is happening.
Not sure if the flush is needed or if there even needs to be another one inside the if, after the init call. And also not sure if the omp critical works when threads are spawned outside of OMP (hence my question why this is run by multiple threads), but I expect that to be the case. |
This is really outside my area of expertise. Using only one "core" to build and run tests still seg faults, so my assumption around thread safety is likely wrong. But it seemed like the most plausible explanation to me at the time. There might be something else going on at play, but like I said, I'm not super familiar with this domain. I'm just trying to package software. Do you know what version of libgomp you used? I'm using gcc 9.3.0 in this example. It might be specific to the version of libgomp I'm using. |
still failing:
|
Then I guess this needs further work as I suspect a larger issue here. Could you share the test you run? I.e. is it possible to test this by installing a numpy from pypi, install an openblas, run a test which fails, install a patched openblas, run the same test and have it pass? |
The problem seems to be that this only appears for some maintainers, as mentioned on the original thread, many others were able to build the package and run the tests. This may be a subtle error which exists with zen2 architectures. I would open to allowing someone to have ssh access to my server as an unprivileged user. Nixpkgs allows for unprivileged users to install packages. And I can setup an environment where the changes can be verified with just running |
@jonringer can you please provide the build options that Nixpkgs uses for the OpenBLAS build that failed the numpy tests ? |
Full info, but not meant to be human readable: openblas.drv
|
Thx. BTW "TARGET=ATHLON" looks like an unusual choice for a 64bit build |
It's most likely doing that to support as many cpu architectures as possible. |
Wonder why you set "NO_BINARY_MODE=" which is not really supposed to be user-modified, but again this probably has no bearing on the (as yet unreproduced) problem. |
It's set for previous aarch builds, may not be relevant anymore https://github.com/NixOS/nixpkgs/blob/81bd63045480df99fc3446ab9bba58517fd8e3cf/pkgs/development/libraries/science/math/openblas/default.nix#L161 |
buffersize to 20 also doesn't work:
|
Interesting, thanks.Wonder if you could do an OpenBLAS build with DEBUG=1 (effectively just adding -g to its CFLAGS) to get a somewhat more meaningful backtrace ? |
Compiled openblas and numpy with debuging symbols, and did a test run:
I tried adding
|
Strange that it would not show line numbers (are you sure numpy picked up the intended build of libopenblas ?), but what is visible now looks like a generic case of trashing the stack (gemm_itcopy_ZEN is the generic C gemm_tcopy_4 kernel so no fancy assembly there - probably its "b" argument is garbage on entry). Annoyingly this is/was also one of the manifestations of a too small BUFFERSIZE... and I take it that the crash is no longer shown as happening "inside libgomp", now that libopenblas is |
Uh, another potential gotcha - you are building with NUM_THREADS=64, but your TR testbed probably has HT so 128 cores seen at runtime ? |
I did a few tests. Using I did some experiments with OpenBLAS 0.3.12 and Numpy 1.19.4 and can't get it to segfault on a 12 core Intel system and a 256 core AMD system. Test script:
Maybe it is related to the 64 bit build? How does your site.cfg for the numpy build look like? |
Hmm. I did not see crashes with INTERFACE64=1 and the corresponding settings in python's site.cfg (and environment variables for the numpy build as mentioned in the comments there) for either suffixed or non-suffixed symbols. Cannot go beyond 12 threads in my local testing right now though. |
For the non-suffixed build (see above but with INTERFACE64=1) I uncommented
and set Without the numpy modification I get the crash. For numpy 1.17 it crashes in |
The "numpy modification" being what exactly - reducing BUFFERSIZE or something else ? (Note you may also need to set up the openblas includes in site.cfg to point to the INTERFACE64 version of the installed build |
Setting the variables for the numpy build to use the ILP64 mode. (Include and library paths are set up, just didn't show those) |
one other difference is that I'm using gcc 9.3.0
This seems to be my issue..... which is odd, as these builds use cgroups to limit the number of cores available (I have mine limited to 32 cores on a single build). I was able to run the numpy tests with just bumping the NUM_THREADS to |
According to https://github.com/xianyi/OpenBLAS/blob/develop/USAGE.md#troubleshooting , it looks like i should be able to bump the threads up to 256. Or, I could limit the cores at runtime with some environment variables. |
@Flamefire ok that's expected and unavoidable |
I think this is expected. This package will be distributed to host with different hardware specs.
This is what I was afraid of. Don't want to do the overhead of 256 cores, for it to run on a machine with only 16 cores... I'll think of something |
Conclusion is I will close the thread as reverting the commit doesn't seem to be the correct course of action. However, it might be nice to emit some warning when openblas has less thread capacity than the host machine, and limit it's thread usage accordingly. |
Runtime overhead of unused entries should be around 80 bytes or so each iiRC. |
oh, that's a pretty low bar for most |
Many function allocate stack memory arrays, the biggest being a blas_queue_t array where each element is of size ~168 Bytes. There might be also a job_t array where each of the NUM_THREADS entries has a size of NUM_THREADS * ~16 Bytes, so quadratic memory requirement. 43kB of Stack memory for the queues is a lot (for 256 max threads). @jonringer Can you open an issue referencing this? IMO setting the NUM_THREADS to low must not lead to a crash. So it should work for your current setting. Or am I missing anything? Edit: Another side effect of setting NUM_THREADS: It affects/is related to the number of OpenMP threads that are used by the calling program #2985 |
I opened an issue to continue this discussion. But this PR doesn't seem relevant anymore. Sorry for being a bit presumptuous with causation, but I was just following git bisect. And was able to reproduce the issue consistently. |
This reverts commit 3094fc6.
Causes seg faults in some scenarios, see #2970 (comment)