Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert "Lazyly reinit threads after a fork in OMP mode" #2982

Closed
wants to merge 1 commit into from

Conversation

jonringer
Copy link

@jonringer jonringer commented Nov 11, 2020

This reverts commit 3094fc6.

Causes seg faults in some scenarios, see #2970 (comment)

@martin-frbg
Copy link
Collaborator

Ah, pity - possibly depends on the exact version of libgomp if it works or not @Flamefire

@Flamefire
Copy link
Contributor

Hm, that is strange. Before actually reverting this I'd like to understand why this is happening.
For this situation to occur a fork must have happened earlier. That would be the first thing to check although I'd just assume it.
Then IIUC you said this is because the code is not thread safe. So that means exec_blas is run by multiple threads concurrently. Is that true? If so, how?
Then I think it would be better to make that thread safe than to penalize all future executions. The simplest approach might be:

if (unlikely(blas_server_avail == 0)){
  #pragma omp critical
  {
    #pragma omp flush(blas_server_avail)
    if (unlikely(blas_server_avail == 0)) blas_thread_init();
  }
}

Not sure if the flush is needed or if there even needs to be another one inside the if, after the init call. And also not sure if the omp critical works when threads are spawned outside of OMP (hence my question why this is run by multiple threads), but I expect that to be the case.

@jonringer
Copy link
Author

This is really outside my area of expertise.

Using only one "core" to build and run tests still seg faults, so my assumption around thread safety is likely wrong. But it seemed like the most plausible explanation to me at the time. There might be something else going on at play, but like I said, I'm not super familiar with this domain. I'm just trying to package software.

Do you know what version of libgomp you used? I'm using gcc 9.3.0 in this example. It might be specific to the version of libgomp I'm using.

@jonringer
Copy link
Author

still failing:

$ git diff
diff --git a/driver/others/blas_server_omp.c b/driver/others/blas_server_omp.c
index a8b3e9a4..600e75bb 100644
--- a/driver/others/blas_server_omp.c
+++ b/driver/others/blas_server_omp.c
@@ -376,8 +376,14 @@ fprintf(stderr,"UNHANDLED COMPLEX\n");

 int exec_blas(BLASLONG num, blas_queue_t *queue){

-  // Handle lazy re-init of the thread-pool after a POSIX fork
-  if (unlikely(blas_server_avail == 0)) blas_thread_init();
+  if (unlikely(blas_server_avail == 0)){
+    #pragma omp critical
+    {
+      #pragma omp flush(blas_server_avail)
+      if (unlikely(blas_server_avail == 0)) blas_thread_init();
+    }
+  }
+

   BLASLONG i, buf_index;
../core/tests/test_multiarray.py::TestDot::test_all <- ../../nix/store/71yxinqs8gd0zk93agvja8g1nmk0jgr0-python3.8-numpy-1.19.4/lib/python3.8/site-packages/numpy/core/tests/test_multiarray.py PASSED [ 18%]
../core/tests/test_multiarray.py::TestDot::test_vecobject <- ../../nix/store/71yxinqs8gd0zk93agvja8g1nmk0jgr0-python3.8-numpy-1.19.4/lib/python3.8/site-packages/numpy/core/tests/test_multiarray.py PASSED [ 18%]
../core/tests/test_multiarray.py::TestDot::test_dot_2args <- ../../nix/store/71yxinqs8gd0zk93agvja8g1nmk0jgr0-python3.8-numpy-1.19.4/lib/python3.8/site-packages/numpy/core/tests/test_multiarray.py PASSED [ 18%]
../core/tests/test_multiarray.py::TestDot::test_dot_3args <- ../../nix/store/71yxinqs8gd0zk93agvja8g1nmk0jgr0-python3.8-numpy-1.19.4/lib/python3.8/site-packages/numpy/core/tests/test_multiarray.py Fatal Python error: Segmentation fault

Thread 0x00007ffff783ff80 (most recent call first):
  File "<__array_function__ internals>", line 5 in dot
  File "/nix/store/71yxinqs8gd0zk93agvja8g1nmk0jgr0-python3.8-numpy-1.19.4/lib/python3.8/site-packages/numpy/core/tests/test_multiarray.py", line 5811 in test_dot_3args
  File "/nix/store/996kx9yv5gdgja728hd50ifm2dnhk9yg-python3.8-pytest-5.4.3/lib/python3.8/site-packages/_pytest/python.py", line 182 in pytest_pyfunc_call
  File "/nix/store/ln231rdy72rb4nd46c54yn4iaknb97j1-python3.8-pluggy-0.13.1/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/nix/store/ln231rdy72rb4nd46c54yn4iaknb97j1-python3.8-pluggy-0.13.1/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/nix/store/ln231rdy72rb4nd46c54yn4iaknb97j1-python3.8-pluggy-0.13.1/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/nix/store/ln231rdy72rb4nd46c54yn4iaknb97j1-python3.8-pluggy-0.13.1/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/nix/store/996kx9yv5gdgja728hd50ifm2dnhk9yg-python3.8-pytest-5.4.3/lib/python3.8/site-packages/_pytest/python.py", line 1477 in runtest
  File "/nix/store/996kx9yv5gdgja728hd50ifm2dnhk9yg-python3.8-pytest-5.4.3/lib/python3.8/site-packages/_pytest/runner.py", line 135 in pytest_runtest_call
  File "/nix/store/ln231rdy72rb4nd46c54yn4iaknb97j1-python3.8-pluggy-0.13.1/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/nix/store/ln231rdy72rb4nd46c54yn4iaknb97j1-python3.8-pluggy-0.13.1/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/nix/store/ln231rdy72rb4nd46c54yn4iaknb97j1-python3.8-pluggy-0.13.1/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/nix/store/ln231rdy72rb4nd46c54yn4iaknb97j1-python3.8-pluggy-0.13.1/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/nix/store/996kx9yv5gdgja728hd50ifm2dnhk9yg-python3.8-pytest-5.4.3/lib/python3.8/site-packages/_pytest/runner.py", line 217 in <lambda>
  File "/nix/store/996kx9yv5gdgja728hd50ifm2dnhk9yg-python3.8-pytest-5.4.3/lib/python3.8/site-packages/_pytest/runner.py", line 244 in from_call
  File "/nix/store/996kx9yv5gdgja728hd50ifm2dnhk9yg-python3.8-pytest-5.4.3/lib/python3.8/site-packages/_pytest/runner.py", line 216 in call_runtest_hook
  File "/nix/store/996kx9yv5gdgja728hd50ifm2dnhk9yg-python3.8-pytest-5.4.3/lib/python3.8/site-packages/_pytest/runner.py", line 186 in call_and_report
  File "/nix/store/996kx9yv5gdgja728hd50ifm2dnhk9yg-python3.8-pytest-5.4.3/lib/python3.8/site-packages/_pytest/runner.py", line 100 in runtestprotocol
  File "/nix/store/996kx9yv5gdgja728hd50ifm2dnhk9yg-python3.8-pytest-5.4.3/lib/python3.8/site-packages/_pytest/runner.py", line 85 in pytest_runtest_protocol
  File "/nix/store/ln231rdy72rb4nd46c54yn4iaknb97j1-python3.8-pluggy-0.13.1/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/nix/store/ln231rdy72rb4nd46c54yn4iaknb97j1-python3.8-pluggy-0.13.1/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/nix/store/ln231rdy72rb4nd46c54yn4iaknb97j1-python3.8-pluggy-0.13.1/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/nix/store/ln231rdy72rb4nd46c54yn4iaknb97j1-python3.8-pluggy-0.13.1/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/nix/store/996kx9yv5gdgja728hd50ifm2dnhk9yg-python3.8-pytest-5.4.3/lib/python3.8/site-packages/_pytest/main.py", line 272 in pytest_runtestloop
  File "/nix/store/ln231rdy72rb4nd46c54yn4iaknb97j1-python3.8-pluggy-0.13.1/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/nix/store/ln231rdy72rb4nd46c54yn4iaknb97j1-python3.8-pluggy-0.13.1/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/nix/store/ln231rdy72rb4nd46c54yn4iaknb97j1-python3.8-pluggy-0.13.1/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/nix/store/ln231rdy72rb4nd46c54yn4iaknb97j1-python3.8-pluggy-0.13.1/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/nix/store/996kx9yv5gdgja728hd50ifm2dnhk9yg-python3.8-pytest-5.4.3/lib/python3.8/site-packages/_pytest/main.py", line 247 in _main
  File "/nix/store/996kx9yv5gdgja728hd50ifm2dnhk9yg-python3.8-pytest-5.4.3/lib/python3.8/site-packages/_pytest/main.py", line 191 in wrap_session
  File "/nix/store/996kx9yv5gdgja728hd50ifm2dnhk9yg-python3.8-pytest-5.4.3/lib/python3.8/site-packages/_pytest/main.py", line 240 in pytest_cmdline_main
  File "/nix/store/ln231rdy72rb4nd46c54yn4iaknb97j1-python3.8-pluggy-0.13.1/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/nix/store/ln231rdy72rb4nd46c54yn4iaknb97j1-python3.8-pluggy-0.13.1/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/nix/store/ln231rdy72rb4nd46c54yn4iaknb97j1-python3.8-pluggy-0.13.1/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/nix/store/ln231rdy72rb4nd46c54yn4iaknb97j1-python3.8-pluggy-0.13.1/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/nix/store/996kx9yv5gdgja728hd50ifm2dnhk9yg-python3.8-pytest-5.4.3/lib/python3.8/site-packages/_pytest/config/__init__.py", line 124 in main
  File "/nix/store/71yxinqs8gd0zk93agvja8g1nmk0jgr0-python3.8-numpy-1.19.4/lib/python3.8/site-packages/numpy/_pytesttester.py", line 206 in __call__
  File "<string>", line 1 in <module>
/nix/store/77983lbcimy5h6rqhfq6hvvif4ngmsak-stdenv-linux/setup: line 1303:  3450 Segmentation fault      (core dumped) /nix/store/wfd59kag54hby8kv3f2k9sgasa001qkx-python3-3.8.6/bin/python3.8 -c 'import numpy; numpy.test("fast", verbose=10)'
builder for '/nix/store/bnzlbyzz2kfhic3n1vszl4z4rzz6dacy-python3.8-numpy-1.19.4.drv' failed with exit code 139
error: build of '/nix/store/bnzlbyzz2kfhic3n1vszl4z4rzz6dacy-python3.8-numpy-1.19.4.drv' failed

@Flamefire
Copy link
Contributor

Then I guess this needs further work as I suspect a larger issue here. Could you share the test you run? I.e. is it possible to test this by installing a numpy from pypi, install an openblas, run a test which fails, install a patched openblas, run the same test and have it pass?

@jonringer
Copy link
Author

jonringer commented Nov 13, 2020

The problem seems to be that this only appears for some maintainers, as mentioned on the original thread, many others were able to build the package and run the tests. This may be a subtle error which exists with zen2 architectures.

I would open to allowing someone to have ssh access to my server as an unprivileged user. Nixpkgs allows for unprivileged users to install packages. And I can setup an environment where the changes can be verified with just running nix-build -A python3Packages.numpy

@martin-frbg
Copy link
Collaborator

@jonringer can you please provide the build options that Nixpkgs uses for the OpenBLAS build that failed the numpy tests ?

@jonringer
Copy link
Author

$ nix eval -f default.nix openblas.makeFlags
[ "BINARY=64" "CC=cc" "CROSS=0" "DYNAMIC_ARCH=1" "FC=gfortran" "HOSTCC=cc" "INTERFACE64=1" "NO_AVX512=1" "NO_BINARY_MODE=" "NO_SHARED=0" "NO_STATIC=1" "NUM_THREADS=64" "PREFIX=/1rz4g4znpzjwh1xymhjpm42vipw92pr73vdgl6xs1hycac8kf2n9" "TARGET=ATHLON" "USE_OPENMP=1" ]

Full info, but not meant to be human readable:

openblas.drv
$ nix show-derivation $(nix-instantiate -A openblas)
warning: you did not specify '--add-root'; the result might be removed by the garbage collector
{
  "/nix/store/ldwvyq008gfrparpgbwrck68qpwqgpx2-openblas-0.3.12.drv": {
    "outputs": {
      "dev": {
        "path": "/nix/store/0sym4nmbi11g1h5s13biz9cwjvjhp40q-openblas-0.3.12-dev"
      },
      "out": {
        "path": "/nix/store/7z4a1fmqpnawza19x9xjrv8drkfzvcvn-openblas-0.3.12"
      }
    },
    "inputSrcs": [
      "/nix/store/9krlzvny65gdc8s7kpb6lkx8cd02c25b-default-builder.sh"
    ],
    "inputDrvs": {
      "/nix/store/0zvrnkv6id4qhdk9yfhz53y13sj4czpd-gcc-wrapper-9.3.0.drv": [
        "out"
      ],
      "/nix/store/5grqm6in8vwd9g4mlhh7ac9d083hxjlz-gfortran-wrapper-9.3.0.drv": [
        "out"
      ],
      "/nix/store/bjrd5biij6yxcyp3ihx78q4d3qfc698p-source.drv": [
        "out"
      ],
      "/nix/store/in4iwy53v9rj5wwdy2dihb9lg2za9zy2-bash-4.4-p23.drv": [
        "out"
      ],
      "/nix/store/kxhpn2r3ph78ay6hwjxkvlkz5vdh3hni-perl-5.32.0.drv": [
        "out"
      ],
      "/nix/store/m0vqaflq37wssgh40wdijk1jfn7414cr-which-2.21.drv": [
        "out"
      ],
      "/nix/store/s5l7hky1aiyr5k1lhr0plrfv49xajnzr-stdenv-linux.drv": [
        "out"
      ]
    },
    "platform": "x86_64-linux",
    "builder": "/nix/store/pmrhk324fkidrm5ffd5jckb21s9zys6r-bash-4.4-p23/bin/bash",
    "args": [
      "-e",
      "/nix/store/9krlzvny65gdc8s7kpb6lkx8cd02c25b-default-builder.sh"
    ],
    "env": {
      "NIX_HARDENING_ENABLE": "fortify format",
      "blas64": "1",
      "buildInputs": "",
      "builder": "/nix/store/pmrhk324fkidrm5ffd5jckb21s9zys6r-bash-4.4-p23/bin/bash",
      "checkTarget": "tests",
      "configureFlags": "",
      "depsBuildBuild": "/nix/store/10h0hfgz39h0wk76dgy2a0ajgxzqyfxg-gfortran-wrapper-9.3.0 /nix/store/lfl1davviy5mai3g80mpy2gkb97rh9j5-gcc-wrapper-9.3.0",
      "depsBuildBuildPropagated": "",
      "depsBuildTarget": "",
      "depsBuildTargetPropagated": "",
      "depsHostHost": "",
      "depsHostHostPropagated": "",
      "depsTargetTarget": "",
      "depsTargetTargetPropagated": "",
      "dev": "/nix/store/0sym4nmbi11g1h5s13biz9cwjvjhp40q-openblas-0.3.12-dev",
      "doCheck": "1",
      "doInstallCheck": "",
      "hardeningDisable": "stackprotector pic strictoverflow relro bindnow",
      "makeFlags": "BINARY=64 CC=cc CROSS=0 DYNAMIC_ARCH=1 FC=gfortran HOSTCC=cc INTERFACE64=1 NO_AVX512=1 NO_BINARY_MODE= NO_SHARED=0 NO_STATIC=1 NUM_THREADS=64 PREFIX=/1rz4g4znpzjwh1xymhjpm42vipw92pr73vdgl6xs1hycac8kf2n9 TARGET=ATHLON USE_OPENMP=1",
      "name": "openblas-0.3.12",
      "nativeBuildInputs": "/nix/store/97b3dhba8nxz9ss8wfckmq0nz82fmqh5-perl-5.32.0 /nix/store/9vi62ib3qzj4cxb4h9sj4ydcmddgckba-which-2.21",
      "out": "/nix/store/7z4a1fmqpnawza19x9xjrv8drkfzvcvn-openblas-0.3.12",
      "outputs": "out dev",
      "patches": "",
      "pname": "openblas",
      "postInstall": "    # Write pkgconfig aliases. Upstream report:\n    # https://github.com/xianyi/OpenBLAS/issues/1740\n    for alias in blas cblas lapack; do\n      cat <<EOF > $out/lib/pkgconfig/$alias.pc\nName: $alias\nVersion: 0.3.12\nDescription: $alias provided by the OpenBLAS package.\nCflags: -I$out/include\nLibs: -L$out/lib -lopenblas\nEOF\n    done\n\n    # Setup symlinks for blas / lapack\n    ln -s $out/lib/libopenblas.so $out/lib/libblas.so\n    ln -s $out/lib/libopenblas.so $out/lib/libcblas.so\n    ln -s $out/lib/libopenblas.so $out/lib/liblapack.so\n    ln -s $out/lib/libopenblas.so $out/lib/liblapacke.so\nln -s $out/lib/libopenblas.so $out/lib/libblas.so.3\nln -s $out/lib/libopenblas.so $out/lib/libcblas.so.3\nln -s $out/lib/libopenblas.so $out/lib/liblapack.so.3\nln -s $out/lib/libopenblas.so $out/lib/liblapacke.so.3\n",
      "propagatedBuildInputs": "",
      "propagatedNativeBuildInputs": "",
      "src": "/nix/store/3g2xlgi6s6nd465rfba44wrvhlxmmz0y-source",
      "stdenv": "/nix/store/8qagiljbmjs1m6ndchl7b2h9b2vvcx7x-stdenv-linux",
      "strictDeps": "",
      "system": "x86_64-linux",
      "version": "0.3.12"
    }
  }
}

@martin-frbg
Copy link
Collaborator

Thx. BTW "TARGET=ATHLON" looks like an unusual choice for a 64bit build

@jonringer
Copy link
Author

It's most likely doing that to support as many cpu architectures as possible.

@martin-frbg
Copy link
Collaborator

Wonder why you set "NO_BINARY_MODE=" which is not really supposed to be user-modified, but again this probably has no bearing on the (as yet unreproduced) problem.

@jonringer
Copy link
Author

jonringer commented Nov 15, 2020

@jonringer
Copy link
Author

buffersize to 20 also doesn't work:

$ git diff
diff --git a/common_x86_64.h b/common_x86_64.h
index b813336c..41d9b5b7 100644
--- a/common_x86_64.h
+++ b/common_x86_64.h
@@ -251,7 +251,7 @@ static __inline unsigned int blas_quickdivide(unsigned int x, unsigned int y){
 #define HUGE_PAGESIZE  ( 2 << 20)

 #ifndef BUFFERSIZE
-#define BUFFER_SIZE    (32 << 22)
+#define BUFFER_SIZE    (32 << 20)
 #else
 #define BUFFER_SIZE    (32 << BUFFERSIZE)
 #endif

$ nix-build -A python3Packages.numpy
....
__.py", line 124 in main
  File "/nix/store/p3k94wfs2mdh3b10r5bj5i9bi7wvixs9-python3.8-numpy-1.19.4/lib/python3.8/site-packages/numpy/_pytesttester.py", line 206 in __call__
  File "<string>", line 1 in <module>
/nix/store/frh7ir9rcv19b3ym33ck64s813yr3xrr-stdenv-linux/setup: line 1303:  3450 Segmentation fault      (core dumped) /nix/store/18656kvqazm74bj7k3mdkwmdlqfyf581-python3-3.8.6/bin/python3.8 -c 'import numpy; numpy.test("fast", verbose=10)'
builder for '/nix/store/l5ds3s3b35shmsl897l9gvj0lzyki3jc-python3.8-numpy-1.19.4.drv' failed with exit code 139
error: build of '/nix/store/l5ds3s3b35shmsl897l9gvj0lzyki3jc-python3.8-numpy-1.19.4.drv' failed

@martin-frbg
Copy link
Collaborator

Interesting, thanks.Wonder if you could do an OpenBLAS build with DEBUG=1 (effectively just adding -g to its CFLAGS) to get a somewhat more meaningful backtrace ?

@jonringer
Copy link
Author

jonringer commented Nov 15, 2020

Compiled openblas and numpy with debuging symbols, and did a test run:

[Switching to Thread 0x7ffebba6e640 (LWP 104473)]
0x00007fffe95cbe70 in dgemm_itcopy_ZEN () from /nix/store/3v2i5ga27y9al3z0d8ccwaa389qz3ma6-lapack-3/lib/liblapack.so.3
(gdb) backtrace
#0  0x00007fffe95cbe70 in dgemm_itcopy_ZEN () from /nix/store/3v2i5ga27y9al3z0d8ccwaa389qz3ma6-lapack-3/lib/liblapack.so.3
#1  0x00007fffe882416a in inner_thread () from /nix/store/3v2i5ga27y9al3z0d8ccwaa389qz3ma6-lapack-3/lib/liblapack.so.3
#2  0x00007fffe8956bf6 in exec_blas._omp_fn () from /nix/store/3v2i5ga27y9al3z0d8ccwaa389qz3ma6-lapack-3/lib/liblapack.so.3
#3  0x00007fffe347f916 in ?? () from /nix/store/f23sq7lk6xfrvz467ffkpzackyk5q8dm-gfortran-9.3.0-lib/lib/libgomp.so.1
#4  0x00007ffff7c20e9e in start_thread () from /nix/store/5didcr1sjp2rlx8abbzv92rgahsarqd9-glibc-2.32/lib/libpthread.so.0
#5  0x00007ffff793866f in clone () from /nix/store/5didcr1sjp2rlx8abbzv92rgahsarqd9-glibc-2.32/lib/libc.so.6

I tried adding -g to CFLAGS for openblas, but it seems to not print line numbers :(

$ nix eval -f default.nix openblas.makeFlags
[ "BINARY=64" "CC=cc" "CROSS=0" "DYNAMIC_ARCH=1" "FC=gfortran" "HOSTCC=cc" "INTERFACE64=1" "NO_AVX512=1" "NO_BINARY_MODE=" "NO_SHARED=0" "NO_STATIC=1" "NUM_THREADS=64" "PREFIX=/1rz4g4znpzjwh1xymhjpm42vipw92pr73vdgl6xs1hycac8kf2n9" "TARGET=ATHLON" "USE_OPENMP=1" "DEBUG=1" ]
$ nix eval -f default.nix openblas.dontStrip
true
$ nix eval -f default.nix openblas.NIX_CFLAGS_COMPILE
[ "-g" ]

@martin-frbg
Copy link
Collaborator

Strange that it would not show line numbers (are you sure numpy picked up the intended build of libopenblas ?), but what is visible now looks like a generic case of trashing the stack (gemm_itcopy_ZEN is the generic C gemm_tcopy_4 kernel so no fancy assembly there - probably its "b" argument is garbage on entry). Annoyingly this is/was also one of the manifestations of a too small BUFFERSIZE... and I take it that the crash is no longer shown as happening "inside libgomp", now that libopenblas is
somewhat debuggable ?

@martin-frbg
Copy link
Collaborator

Uh, another potential gotcha - you are building with NUM_THREADS=64, but your TR testbed probably has HT so 128 cores seen at runtime ?

@Flamefire
Copy link
Contributor

I did a few tests.

Using INTERFACE64=1 makes numpy crash during the test inside a copy kernel from OpenBLAS (likely what you got @jonringer)

I did some experiments with OpenBLAS 0.3.12 and Numpy 1.19.4 and can't get it to segfault on a 12 core Intel system and a 256 core AMD system. Test script:

rm -rf /tmp/tmpinstall/software/OpenBLAS/0.3.7-GCC-8.3.0/*
rm -rf /tmp/install_pt/lib/python3.7/site-packages/numpy*

cd /tmp/OpenBLAS
git clean -fxd

VARS=("" install)
for i in "${VARS[@]}"; do
  make -j$(nproc) "BINARY=64" "CC=gcc" "CROSS=0" "DYNAMIC_ARCH=1" "FC=gfortran" "HOSTCC=gcc" "NO_AVX512=1" "NO_BINARY_MODE=" "NO_SHARED=0" "NO_STATIC=1" "NUM_THREADS=256" "PREFIX=/tmp/tmpinstall/software/OpenBLAS/0.3.7-GCC-8.3.0" TARGET=ATHLON "USE_OPENMP=1" DEBUG=1 $i
done

cd /tmp/numpy
git clean -fxd
python setup.py build -j 4 install --prefix /tmp/install_pt
cd /tmp/OpenBLAS

python -c 'import numpy; numpy.test(verbose=2)'

Maybe it is related to the 64 bit build? How does your site.cfg for the numpy build look like?

@martin-frbg
Copy link
Collaborator

Hmm. I did not see crashes with INTERFACE64=1 and the corresponding settings in python's site.cfg (and environment variables for the numpy build as mentioned in the comments there) for either suffixed or non-suffixed symbols. Cannot go beyond 12 threads in my local testing right now though.

@Flamefire
Copy link
Contributor

For the non-suffixed build (see above but with INTERFACE64=1) I uncommented

  [openblas_ilp64]
  libraries = openblas

and set NPY_USE_BLAS_ILP64=1 NPY_BLAS_ILP64_ORDER=openblas_ilp64 NPY_LAPACK_ILP64_ORDER=openblas_ilp64 before the numpy build. This works.

Without the numpy modification I get the crash. For numpy 1.17 it crashes in dcopy_k_ZEN () at ../kernel/x86_64/copy_sse2.S:592 for 1.19.4 in cgemv_kernel_4x2 at ../kernel/x86_64/cgemv_t_microk_haswell-4.c:262

@martin-frbg
Copy link
Collaborator

The "numpy modification" being what exactly - reducing BUFFERSIZE or something else ? (Note you may also need to set up the openblas includes in site.cfg to point to the INTERFACE64 version of the installed build

@Flamefire
Copy link
Contributor

The "numpy modification" being what exactly

Setting the variables for the numpy build to use the ILP64 mode. (Include and library paths are set up, just didn't show those)

@jonringer
Copy link
Author

jonringer commented Nov 17, 2020

one other difference is that I'm using gcc 9.3.0

Uh, another potential gotcha - you are building with NUM_THREADS=64, but your TR testbed probably has HT so 128 cores seen at runtime ?

This seems to be my issue..... which is odd, as these builds use cgroups to limit the number of cores available (I have mine limited to 32 cores on a single build).

I was able to run the numpy tests with just bumping the NUM_THREADS to 128. Is there a penalty to having the NUM_THREADS significantly higher than the actual core count? (I would assume that you may get hit with significant context switching at some point?)

@jonringer
Copy link
Author

According to https://github.com/xianyi/OpenBLAS/blob/develop/USAGE.md#troubleshooting , it looks like i should be able to bump the threads up to 256.

Or, I could limit the cores at runtime with some environment variables.

@martin-frbg
Copy link
Collaborator

@Flamefire ok that's expected and unavoidable
@johringer could be that cgroup is ignored when building with the default NO_AFFINITY=1. Potential penalty from high NUM_THREADS is some memory overhead for structs allocated at compile time, but actual number of threads at runtime is capped at what is physically available on the host.

@jonringer
Copy link
Author

could be that cgroup is ignored when building with the default NO_AFFINITY=1.

I think this is expected. This package will be distributed to host with different hardware specs.

Potential penalty from high NUM_THREADS is some memory overhead for structs allocated at compile time

This is what I was afraid of. Don't want to do the overhead of 256 cores, for it to run on a machine with only 16 cores...

I'll think of something

@jonringer
Copy link
Author

Conclusion is if (unlikely(blas_server_avail == 0)) blas_thread_init(); seems to be correct, and the way we (nixpkgs) were packaging this only worked on systems with <=64 cores. However, the runtime behavior didn't express this broken behavior until it was added.

I will close the thread as reverting the commit doesn't seem to be the correct course of action.

However, it might be nice to emit some warning when openblas has less thread capacity than the host machine, and limit it's thread usage accordingly.

@jonringer jonringer closed this Nov 17, 2020
@jonringer jonringer deleted the revert-309 branch November 17, 2020 15:28
@martin-frbg
Copy link
Collaborator

Runtime overhead of unused entries should be around 80 bytes or so each iiRC.

@jonringer
Copy link
Author

Runtime overhead of unused entries should be around 80 bytes or so each iiRC.

oh, that's a pretty low bar for most x86_64 machines

@Flamefire
Copy link
Contributor

Flamefire commented Nov 17, 2020

Runtime overhead of unused entries should be around 80 bytes or so each iiRC.

Many function allocate stack memory arrays, the biggest being a blas_queue_t array where each element is of size ~168 Bytes. There might be also a job_t array where each of the NUM_THREADS entries has a size of NUM_THREADS * ~16 Bytes, so quadratic memory requirement. 43kB of Stack memory for the queues is a lot (for 256 max threads).

@jonringer Can you open an issue referencing this? IMO setting the NUM_THREADS to low must not lead to a crash. So it should work for your current setting. Or am I missing anything?

Edit: Another side effect of setting NUM_THREADS: It affects/is related to the number of OpenMP threads that are used by the calling program #2985

@jonringer
Copy link
Author

I opened an issue to continue this discussion. But this PR doesn't seem relevant anymore.

Sorry for being a bit presumptuous with causation, but I was just following git bisect. And was able to reproduce the issue consistently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants