-
-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: ARPACK's eigsh & OpenBLAS from Apple Silicon M1 (arm64) nightly build triggers macOS kernel panic / reboot #14688
Comments
That wheel is built for 64bit integers in Fortran, and has some symbols suffixed with |
The numpy 1.21.0 wheels is fine:
so I assume we have to upgrade threadpolctl to also look for Edit: I fixed a bug of threadpoolctl that caused the threading layer to be reported as "disabled" instead of "pthreads". |
I still get the kernel panic with the numpy 1.21.0 wheel but it does so after 5 or 6 iterations of the loop instead of the first...
|
To get a feeling of the experience :) |
Not sure why this is. I get |
Can you try deleting |
I had a bug in my dev version of threadpoolctl, fixed. |
That fixes the crash!
So indeed the 64 bit integer build of OpenBLAS has a problem. But why would that cause a kernel panic is a mystery... |
It's not the 64 bit integer build right? You tried with numpy=1.21.0 which was not the 64-bit integer build. |
I am confused. The problem does not appear with numpy 1.21.0 and the symlink pointing to it in scipy's
I will re-try with numpy 1.21.2 and the symlink next. |
This wouldn't work as the two libraries are ABI incompatible. |
No problem with numpy 1.21.2 and the symlink either. |
1.21.0 and 1.21.2 both have 32-bit integer wheels and only 1.22.0.dev0 has a 64-bit integer wheel build. |
Indeed that is not possible with numpy-1.22.0.dev0+949.ga90677a0b. |
Can you narrow down which library is causing this? Start with a new env and create a symlink only for |
I reinstalled a new env with the nightly builds: numpy Edit: same for I will check again with numpy Edit: symlinking |
Just in case, I archived a copy of the numpy and scipy nightly build wheels here: https://github.com/ogrisel/tmpblobs/releases/tag/m1-dev-wheels |
I can reproduce it on my MacBookAir as well, following the above instruction. |
BTW, I also reported the macOS kernel panic to Apple as https://feedbackassistant.apple.com/feedback/9594377 (probably only visible by the Apple support team) but did not receive any reply yet. |
@charris FYI there's an issue with the 64-bit OpenBLAS build for macOS arm64. I kind of missed the whole move to 64-bit OpenBLAS, was there a driver beyond "looks like a good idea"? |
Running the tests on Master returned errors on ../../scipy/sparse/linalg/eigen/tests/test_svds.py:260: in test_svds_parameter_tol
assert error < accuracy
E assert 1.4567177559457185e-15 < 1e-15
A = <100x100 sparse matrix of type '<class 'numpy.float64'>'
with 6624 stored elements in Compressed Sparse Column format>
_ = array([[-0.06798121, -0.06969187, -0.1634784 , ..., -0.08870255,
-0.08289897, -0.07725935],
[-0.0130691...722, 0.12174589],
[ 0.03737577, 0.07600953, -0.04798564, ..., -0.08283926,
-0.04644228, 0.04356348]])
accuracies = {'arpack': [1e-15, 1e-10, 1e-10], 'lobpcg': [1e-11, 0.001, 10], 'propack': [1e-12, 1e-06, 0.0001]}
accuracy = 1e-15
err = <function SVDSCommonTests.test_svds_parameter_tol.<locals>.err at 0x10e2858b0>
error = 1.4567177559457185e-15
k = 3
n = 100
rng = Generator(PCG64) at 0x13381E2E0
s = array([3.57532733e-01, 1.27553875e-01, 1.19870964e-01, 1.15829350e-01,
1.12545658e-01, 1.09114116e-01, 1.028058...2.69065672e-04, 1.72149810e-04, 1.65442589e-04,
7.44147686e-05, 3.32524683e-05, 2.12349114e-06, 1.50715281e-07])
self = <scipy.sparse.linalg.eigen.tests.test_svds.Test_SVDS_ARPACK object at 0x10e2c0460>
tol = 0.0001
tols = [0.0001, 0.01, 1.0]
========================================================================================================== short test summary info ===========================================================================================================
FAILED ../../scipy/sparse/linalg/eigen/tests/test_svds.py::Test_SVDS_ARPACK::test_svds_parameter_tol - assert 1.4567177559457185e-15 < 1e-15 I am running on MacBook Pro Core I7 with : SciPy/NumPy/Python version information
Does it have something to do with Openblas? |
@V0lantis there are more The kernel panic should be fixed in macOS 12. Which is great, but of course doesn't yet help users now when we release arm64 wheels. So we should ensure that the test suite doesn't trigger this issue. @ogrisel can you confirm that this is the situation:
|
I did run the full scipy suite yet. Let me try to do it now... Hopefully it won't crash my laptop while working on something else. We can probably try to find all the scikit-learn tests that cause this and skip them. But if there are more than a few of them, tracking them all will be very very painful. And furthermore releasing such a scipy could cause panics on other systems outside of the scikit-learn user base. However if the wheel filename is specialized to be |
The panic also happens when running the scipy suite. I ran the tests with https://gist.github.com/ogrisel/fe230e33b81f9e910ad521a61eef1cb6 So the culprit is likely the test that comes after It's too time consuming for me to try to "bisect" the potentially culprits in the scipy test suite because I have the feeling it would prevent me to work on anything else for several hours. |
@sunilshah can you please install threadpoolctl and run The problem seems only related to wheel packaging. |
Here is a version that changes the number of threads interactively during the execution of the code: >>> from time import perf_counter
... import numpy as np
... from scipy.sparse.linalg import eigsh
... from threadpoolctl import threadpool_limits
...
...
... n_samples, n_features = 2000, 10
... rng = np.random.default_rng(0)
... X = rng.normal(size=(n_samples, n_features))
... K = X @ X.T
...
... for n_threads in range(1, 9):
... print(f"using blas with {n_threads=}")
... with threadpool_limits(limits={"blas": n_threads}):
... for i in range(2):
... print("running eigsh...")
... tic = perf_counter()
... s, _ = eigsh(K, 3, which="LA", tol=0)
... toc = perf_counter()
... print(f"computed {s} in {toc - tic:.3f} s")
...
using blas with n_threads=1
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.034 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.019 s
using blas with n_threads=2
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.044 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.034 s
using blas with n_threads=3
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.084 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.045 s
using blas with n_threads=4
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.022 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.027 s
using blas with n_threads=5
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.093 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.091 s
using blas with n_threads=6
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.521 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.506 s
using blas with n_threads=7
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 1.236 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.938 s
using blas with n_threads=8
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 1.471 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 1.900 s So there is a performance degradation that starts with @rgommers since the original kernel panic is fixed in macOS12, do you want to close this issue and relase apple m1 wheel only for macOS 12+? I can open a dedicated issue for the performance problem that remains with large number of threads. |
On my configuration, I do not see the same performance change with number for threads that you see in your example.
|
I just realized that
|
I'm trying to look at this again last-minute and figure out whether it's a problem if NumPy
And yes indeed, the code sample in the issue description causes a kernel panic on my
EDIT 2: trying with a conda-forge env with Python,
|
Hmm, that would be non-ideal - but macOS 12.0 is available since 25 Oct 2021, so it does make sense. The alternative here would be to remove the shipped I am going to mention this plan on the mailing list, then wait a few days and upgrade to 12.0 so I can test locally.
Yes, that sounds good, please do! And let's keep this issue open until we have actually released some wheels. |
This is still a concern right? We don't have a |
If 64-bit indexed openblas numpy wheels are shipped, wouldn't that require scipy to support 64-bit indexed blas operations? |
Yes, but we can still decide to not do that for NumPy 1.22.0. RC1 is coming out within a couple of days, that's why I was refreshing my memory on this issue. But either way, I don't think it's healthy to try to reuse the openblas bundled with numpy.
We may be able to get away with setting |
Or alternatively, we rebuild (build config is at https://github.com/MacPython/openblas-libs/blob/master/.github/workflows/multibuild.yml) |
Which of course is pessimistic - it's not clear to me if the limit should be set to half the number of cores, the number of performance cores, or if even that won't be reliable. There's a number of CPU variants out already:
|
Tried an install from source, and this slowdown then happens as well (see #14688 (comment)). Just not the crash when building from source even though there are 2 openblas's; picking up the conda-forge |
Are you restricting your comments to Conda installs? My experience with Homebrew install is quite different from yours on Conda. On my M1 Mac-mini on Mac OS 12.0.1, with pip installed scipy 1.7.2, and numpy 1.21.4, python 3.9.8, I see little slow down when increasing the number of threads to 8 in the example code posted by @ogrisel :
IMHO it is best to not hardwire max threads in the code. Because the kernel panic is gone, optimization of threads should be left to the users with some guidance. It is surely very dependent upon the example and the system configuration / test conditions. |
That's what I have tested so far. Thanks for pointing out that Homebrew is different. I still don't fully understand what is going on under the hood here. Can you add the output of
I don't think we want the default to be 50x slower because in some circumstances we may get a 30-50% (?) speedup for more threads. The last few cores added are the efficiency cores, so they don't help much anyway (they may even hurt). Once we release wheels, that's what 99% of users will use, not from-source builds. |
Did some timing of the whole test suite:
gives:
With
With
Some comments:
Overall conclusion: disabling parallelism in OpenBLAS completely seems best in this situation, both for the slowest tests and for overall runtime of the test suite. |
As long as what you do does not change the behavior of Homebrew / pip installed Scipy, I personally have no issues.
What does this tell you?
It is best to refrain from making changes until you understand the reasons for the slow down in your tests and configuration as they may slow codes that are tuned by others in different circumstances. For example, I have specific tuned application codes that give me 2-10x performance gains on multicore machines. Your conclusion about disabling parallelism completely in OpenBLAS seems a bit drastic based upon the limited test suite results. In any case, do allow users the option of setting thread count using both threadpoolctl and environment variables. I agree with @ogrisel that performance issue is distinct from kernel panic which happened even without python/scipy. Wheel packaging for Conda and performance tuning are best done as a separate new issue. |
Yes, still trying to get a better understanding of what the root cause is here.
Yes, no argument there:)
I wanted to see (a) if Homebrew
Yes agreed, whatever we do for a default, there should be a way to override it.
There's no such thing. We're building one wheel per Python version, and it goes onto PyPI. And that wheel can then be installed into any Python install/environment.
Yes, I think a new issue with a good summary will be useful, because this issue is getting pretty long. |
1.7.3 wheel slows down my performance on M1 to the same as reported on Conda-forge 1.7.2. So, whatever was done does considerably worsens the performance. On 1.7.3 scipy wheel for PyPI
Downgrading back to 1.7.2,
|
Python 3.8-3.10 only (as was done for SciPy) use Python 3.9 for the Mac OS builds arm64 case needs to downloads the arm64 version of libomp skip universal2 for now until universal2 libomp is available update MACOSX_DEPLOYMENT_TARGET to 12.0 for arm64 This follows SciPy (scipy/scipy#14688)
Python 3.8-3.10 only (as was done for SciPy) use Python 3.9 for the Mac OS builds arm64 case needs to downloads the arm64 version of libomp skip universal2 for now until universal2 libomp is available update MACOSX_DEPLOYMENT_TARGET to 12.0 for arm64 This follows SciPy (scipy/scipy#14688)
Python 3.8-3.10 only (as was done for SciPy) use Python 3.9 for the Mac OS builds arm64 case needs to downloads the arm64 version of libomp skip universal2 for now until universal2 libomp is available update MACOSX_DEPLOYMENT_TARGET to 12.0 for arm64 This follows SciPy (scipy/scipy#14688)
Describe your issue.
Executing the following script makes all the 8 CPU cores run 100% for 1 or 2 seconds then macOS 11.5.1 (20G80) crashes (on a MacBook Air M1 from December 2020).
Note that this machine has 8 cores (4 performance cores and 4 efficiency cores). Running with
OPENBLAS_NUM_THREADS=n
set with n <= 4 runs fine and quickly.With n == 5, the code runs significantly slower.
With n>=6, I get the full system crash almost 100% of the time, sometimes at the 2nd or 3rd iteration of the loop but mostly at the first iteration.
Note that I do not reproduce the crash when running the same code with scipy installed from conda-forge using the macos arm64 installer of miniforge.
Also note that there is probably a problem with the OpenBLAS binary of the numpy wheel because threadpoolctl cannot find the version information:
Reproducing Code Example
Install the scipy nightly build wheel (or download and install a local copy from this archive):
from #13409 (comment).
Then try to execute this script:
Error message
Here is the panic report from macOS after system reboot:
SciPy/NumPy/Python version information
1.8.0.dev0+1675.774941b 1.22.0.dev0+949.ga90677a0b sys.version_info(major=3, minor=9, micro=6, releaselevel='final', serial=0)
The text was updated successfully, but these errors were encountered: