Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: ARPACK's eigsh & OpenBLAS from Apple Silicon M1 (arm64) nightly build triggers macOS kernel panic / reboot #14688

Closed
ogrisel opened this issue Sep 2, 2021 · 62 comments
Labels
defect A clear bug or issue that prevents SciPy from being installed or used as expected
Milestone

Comments

@ogrisel
Copy link
Contributor

ogrisel commented Sep 2, 2021

Describe your issue.

Executing the following script makes all the 8 CPU cores run 100% for 1 or 2 seconds then macOS 11.5.1 (20G80) crashes (on a MacBook Air M1 from December 2020).

Note that this machine has 8 cores (4 performance cores and 4 efficiency cores). Running with OPENBLAS_NUM_THREADS=n set with n <= 4 runs fine and quickly.

With n == 5, the code runs significantly slower.

With n>=6, I get the full system crash almost 100% of the time, sometimes at the 2nd or 3rd iteration of the loop but mostly at the first iteration.

Note that I do not reproduce the crash when running the same code with scipy installed from conda-forge using the macos arm64 installer of miniforge.

Also note that there is probably a problem with the OpenBLAS binary of the numpy wheel because threadpoolctl cannot find the version information:

❯ python -m threadpoolctl -i scipy
[
  {
    "filepath": "/Users/ogrisel/miniforge3/envs/tmp/lib/python3.9/site-packages/numpy/.dylibs/libopenblas64_.0.dylib",
    "prefix": "libopenblas",
    "user_api": "blas",
    "internal_api": "openblas",
    "version": null,
    "num_threads": null,
    "threading_layer": "unknown",
    "architecture": null
  }
]

Reproducing Code Example

Install the scipy nightly build wheel (or download and install a local copy from this archive):

❯ pip install --pre -i https://pypi.anaconda.org/scipy-wheels-nightly/simple scipy

from #13409 (comment).

Then try to execute this script:

from time import perf_counter
import numpy as np
from scipy.sparse.linalg import eigsh


n_samples, n_features = 2000, 10
rng = np.random.default_rng(0)
X = rng.normal(size=(n_samples, n_features))
K = X @ X.T

for i in range(10):
    print("running eigsh...")
    tic = perf_counter()
    s, _ = eigsh(K, 3, which="LA", tol=0)
    toc = perf_counter()
    print(f"computed {s} in {toc - tic:.3f} s")

Error message

Here is the panic report from macOS after system reboot:

panic(cpu 3 caller 0xfffffe0018fab460): "Spinlock timeout after 12585145 ticks, 0xfffffe001cd04588 = fffffe1668958661"
Debugger message: panic
Memory ID: 0xff
OS release type: User
OS version: 20G80
Kernel version: Darwin Kernel Version 20.6.0: Wed Jun 23 00:26:27 PDT 2021; root:xnu-7195.141.2~5/RELEASE_ARM64_T8101
Fileset Kernelcache UUID: E46841F89DC3FD7ACEC6F404AC995579
Kernel UUID: AC4A14A7-8A8E-3AE6-85A6-55E6B2502BF9
iBoot version: iBoot-6723.140.2
secure boot?: YES
Paniclog version: 13
KernelCache slide: 0x0000000011330000
KernelCache base:  0xfffffe0018334000
Kernel slide:      0x0000000011e78000
Kernel text base:  0xfffffe0018e7c000
Kernel text exec base:  0xfffffe0018f48000
mach_absolute_time: 0xa9b77fbf5
Epoch Time:        sec       usec
  Boot    : 0x613143f7 0x00087da9
  Sleep   : 0x00000000 0x00000000
  Wake    : 0x00000000 0x00000000
  Calendar: 0x61314b5c 0x00031a5d

CORE 0 recently retired instr at 0xfffffe00190bad6c
CORE 1 recently retired instr at 0xfffffe00190bad6c
CORE 2 recently retired instr at 0xfffffe00190bad6c
CORE 3 recently retired instr at 0xfffffe00190b96a4
CORE 4 recently retired instr at 0xfffffe00190bad70
CORE 5 recently retired instr at 0xfffffe00190bad70
CORE 6 recently retired instr at 0xfffffe00190bad70
CORE 7 recently retired instr at 0xfffffe00190bad70
CORE 0 PVH locks held: None
CORE 1 PVH locks held: None
CORE 2 PVH locks held: None
CORE 3 PVH locks held: None
CORE 4 PVH locks held: None
CORE 5 PVH locks held: None
CORE 6 PVH locks held: None
CORE 7 PVH locks held: None
CORE 0: PC=0xfffffe0018f911f0, LR=0xfffffe0018f911f0, FP=0xfffffe307bbebd50
CORE 1: PC=0xfffffe0018f911f0, LR=0xfffffe0018f911f0, FP=0xfffffe30b50e3d50
CORE 2: PC=0xfffffe0018f911f0, LR=0xfffffe0018f911f0, FP=0xfffffe3087d13d50
CORE 3 is the one that panicked. Check the full backtrace for details.
CORE 4: PC=0xfffffe00190a7204, LR=0xfffffe00190a71e0, FP=0xfffffe3087ca3d50
CORE 5: PC=0xfffffe0018f91200, LR=0xfffffe00194bb268, FP=0xfffffe30b596bdb0
CORE 6: PC=0xfffffe0018f91200, LR=0xfffffe00194bb268, FP=0xfffffe30b5a1bdb0
CORE 7: PC=0x00000001034415bc, LR=0x0000000103441564, FP=0x000000016d7890d0
Panicked task 0xfffffe166ecfbac8: 4385 pages, 15 threads: pid 5061: python3.9
Panicked thread: 0xfffffe166ef73960, backtrace: 0xfffffe30b6a536d0, tid: 25548
		  lr: 0xfffffe0018f96b68  fp: 0xfffffe30b6a53740
		  lr: 0xfffffe0018f9694c  fp: 0xfffffe30b6a537b0
		  lr: 0xfffffe00190c01c8  fp: 0xfffffe30b6a537d0
		  lr: 0xfffffe00190b1674  fp: 0xfffffe30b6a53880
		  lr: 0xfffffe0018f4f7e8  fp: 0xfffffe30b6a53890
		  lr: 0xfffffe0018f965dc  fp: 0xfffffe30b6a53c20
		  lr: 0xfffffe0018f965dc  fp: 0xfffffe30b6a53c90
		  lr: 0xfffffe0019748e80  fp: 0xfffffe30b6a53cb0
		  lr: 0xfffffe0018fab460  fp: 0xfffffe30b6a53ce0
		  lr: 0xfffffe0018fab2cc  fp: 0xfffffe30b6a53d00
		  lr: 0xfffffe0018f91174  fp: 0xfffffe30b6a53d50
		  lr: 0xfffffe00194bb268  fp: 0xfffffe30b6a53db0
		  lr: 0xfffffe001959e0e8  fp: 0xfffffe30b6a53e40
		  lr: 0xfffffe00190b1350  fp: 0xfffffe30b6a53ef0
		  lr: 0xfffffe0018f4f7e8  fp: 0xfffffe30b6a53f00

last started kext at 694061903: com.apple.driver.AppleHIDKeyboard	224 (addr 0xfffffe00186a0000, size 16384)
loaded kexts:
com.apple.driver.AppleTopCaseHIDEventDriver	4050.1
com.apple.filesystems.autofs	3.0
com.apple.fileutil	20.036.15
com.apple.iokit.IOBluetoothSerialManager	8.0.5d7
com.apple.driver.AppleBiometricServices	1
com.apple.driver.usb.realtek8153patcher	5.0.0
com.apple.driver.usb.AppleUSBHostBillboardDevice	1.0
com.apple.driver.DiskImages.ReadWriteDiskImage	493.0.0
com.apple.driver.DiskImages.UDIFDiskImage	493.0.0
com.apple.driver.DiskImages.RAMBackingStore	493.0.0
com.apple.driver.DiskImages.FileBackingStore	493.0.0
com.apple.iokit.SCSITaskUserClient	436.140.1
com.apple.filesystems.apfs	1677.141.1
com.apple.driver.BCMWLANFirmware4378.Hashstore	1
com.apple.driver.CoreKDL	1
com.apple.driver.SEPHibernation	1
com.apple.driver.AppleUSBDeviceNCM	5.0.0
com.apple.driver.AppleSmartBatteryManager	161.0.0
com.apple.driver.AppleThunderboltIP	4.0.3
com.apple.driver.AppleALSColorSensor	1.0.0d1
com.apple.driver.AppleAOPVoiceTrigger	11.5
com.apple.driver.AppleFileSystemDriver	3.0.1
com.apple.nke.l2tp	1.9
com.apple.filesystems.tmpfs	1
com.apple.IOTextEncryptionFamily	1.0.0
com.apple.filesystems.hfs.kext	556.100.11
com.apple.security.BootPolicy	1
com.apple.BootCache	40
com.apple.AppleFSCompression.AppleFSCompressionTypeZlib	1.0.0
com.apple.AppleFSCompression.AppleFSCompressionTypeDataless	1.0.0d1
com.apple.driver.ApplePMP	1
com.apple.driver.AppleSmartIO2	1
com.apple.driver.ApplePMPFirmware	1
com.apple.AppleEmbeddedSimpleSPINORFlasher	1
com.apple.driver.AppleCS42L83Audio	442.26
com.apple.driver.AppleSPMIPMU	1.0.1
com.apple.driver.AppleDPDisplayTCON	1
com.apple.driver.AppleTAS5770LAmp	442.26
com.apple.driver.AppleT8020SOCTuner	1
com.apple.driver.AppleT8103CLPCv3	1
com.apple.AGXG13G	173.28.7
com.apple.driver.AppleAVD	385
com.apple.driver.AppleAVE2	401.73.4
com.apple.driver.AppleJPEGDriver	4.6.0
com.apple.driver.AppleMobileDispH13G-DCP	140.0
com.apple.driver.usb.AppleUSBHostT8103	1
com.apple.driver.AudioDMAController-T8103	1.60.5
com.apple.driver.AppleS5L8960XNCO	1
com.apple.driver.AppleT8103PMGR	1
com.apple.driver.AppleS8000AES	1
com.apple.driver.AppleS8000DWI	1.0.0d1
com.apple.driver.AppleS5L8960XWatchDogTimer	1
com.apple.driver.AppleInterruptController	1.0.0d1
com.apple.driver.AppleT8020DART	1
com.apple.driver.AppleBluetoothModule	1
com.apple.driver.AppleSamsungSerial	1.0.0d1
com.apple.driver.AppleBCMWLANBusInterfacePCIe	1
com.apple.driver.AppleS5L8920XPWM	1.0.0d1
com.apple.driver.AppleS5L8940XI2C	1.0.0d2
com.apple.driver.AppleSPIMC	1
com.apple.driver.AppleT8101	1
com.apple.driver.AppleM68Buttons	1.0.0d1
com.apple.iokit.IOUserEthernet	1.0.1
com.apple.driver.usb.AppleUSBUserHCI	1
com.apple.iokit.IOKitRegistryCompatibility	1
com.apple.iokit.EndpointSecurity	1
com.apple.driver.AppleDiskImages2	1
com.apple.AppleSystemPolicy	2.0.0
com.apple.nke.applicationfirewall	311
com.apple.kec.InvalidateHmac	1
com.apple.driver.AppleHIDKeyboard	224
com.apple.driver.AppleActuatorDriver	4440.3
com.apple.driver.AppleMultitouchDriver	4440.3
com.apple.driver.AppleHSBluetoothDriver	4050.1
com.apple.driver.IOBluetoothHIDDriver	8.0.5d7
com.apple.kext.triggers	1.0
com.apple.iokit.IOAVBFamily	940.4
com.apple.plugin.IOgPTPPlugin	985.2
com.apple.iokit.IOEthernetAVBController	1.1.0
com.apple.driver.AppleMesaSEPDriver	100.99
com.apple.iokit.IOBiometricFamily	1
com.apple.driver.usb.cdc.ecm	5.0.0
com.apple.driver.usb.cdc.acm	5.0.0
com.apple.driver.usb.serial	6.0.0
com.apple.driver.DiskImages.KernelBacked	493.0.0
com.apple.driver.usb.cdc	5.0.0
com.apple.driver.AppleUSBAudio	405.39
com.apple.iokit.IOAudioFamily	300.6.1
com.apple.vecLib.kext	1.2.0
com.apple.driver.AppleXsanScheme	3
com.apple.driver.AppleSEPHDCPManager	1.0.1
com.apple.driver.AppleTrustedAccessory	1
com.apple.iokit.AppleSEPGenericTransfer	1
com.apple.driver.IOBluetoothHostControllerPCIeTransport	8.0.5d7
com.apple.iokit.IOBluetoothHostControllerTransport	8.0.5d7
com.apple.driver.AppleConvergedIPCOLYBTControl	1
com.apple.driver.AppleConvergedPCI	1
com.apple.driver.AppleBluetoothDebug	1
com.apple.driver.usb.networking	5.0.0
com.apple.driver.AppleBTM	1.0.1
com.apple.driver.AppleThunderboltDPInAdapter	8.1.4
com.apple.driver.AppleThunderboltDPAdapterFamily	8.1.4
com.apple.driver.AppleThunderboltUSBDownAdapter	1.0.4
com.apple.driver.AppleThunderboltPCIDownAdapter	4.1.1
com.apple.driver.AppleAOPAudio	16.2
com.apple.nke.ppp	1.9
com.apple.driver.AppleHIDTransportSPI	4400.35
com.apple.driver.AppleHIDTransport	4400.35
com.apple.driver.AppleInputDeviceSupport	4400.35
com.apple.driver.AppleDCPDPTXProxy	1.0.0
com.apple.driver.DCPDPFamilyProxy	1
com.apple.driver.AppleBSDKextStarter	3
com.apple.filesystems.hfs.encodings.kext	1
com.apple.driver.AppleSPU	1
com.apple.AGXFirmwareKextG13GRTBuddy	173.28.7
com.apple.AGXFirmwareKextRTBuddy64	173.28.7
com.apple.iokit.IONVMeFamily	2.1.0
com.apple.driver.AppleDiagnosticDataAccessReadOnly	1.0.0
com.apple.driver.AppleNANDConfigAccess	1.0.0
com.apple.driver.AppleHPM	3.4.4
com.apple.driver.AppleCSEmbeddedAudio	442.26
com.apple.driver.AppleDialogPMU	1.0.1
com.apple.driver.AppleEmbeddedAudio	442.26
com.apple.iokit.AppleARMIISAudio	80.34.1
com.apple.driver.AppleStockholmControl	1.0.0
com.apple.driver.ApplePassthroughPPM	3.0
com.apple.iokit.IOGPUFamily	21.11
com.apple.driver.AppleH11ANEInterface	4.75.0
com.apple.driver.AppleH13CameraInterface	3.58.0
com.apple.driver.AppleH10PearlCameraInterface	16.15.1
com.apple.driver.DCPAVFamilyProxy	1
com.apple.iokit.IOMobileGraphicsFamily-DCP	343.0.0
com.apple.driver.AppleDCP	1
com.apple.driver.AppleFirmwareKit	1
com.apple.iokit.IOMobileGraphicsFamily	343.0.0
com.apple.driver.AppleSPMI	1.0.1
com.apple.driver.AppleUSBXDCIARM	1.0
com.apple.driver.AppleUSBXDCI	1.0
com.apple.iokit.IOUSBDeviceFamily	2.0.0
com.apple.driver.usb.AppleUSBXHCIARM	1
com.apple.driver.usb.AppleUSBXHCI	1.2
com.apple.driver.AppleEmbeddedUSBHost	1
com.apple.driver.usb.AppleUSBHub	1.2
com.apple.driver.usb.AppleUSBHostCompositeDevice	1.2
com.apple.driver.AppleT8103TypeCPhy	1
com.apple.driver.AppleSART	1
com.apple.driver.ApplePMGR	1
com.apple.driver.watchdog	1
com.apple.driver.usb.AppleUSBHostPacketFilter	1.0
com.apple.driver.AppleDisplayCrossbar	1.0.0
com.apple.iokit.IODisplayPortFamily	1.0.0
com.apple.driver.AppleTypeCPhy	1
com.apple.driver.AppleThunderboltNHI	7.2.8
com.apple.driver.AppleT8103PCIeC	1
com.apple.iokit.IOThunderboltFamily	9.3.2
com.apple.driver.ApplePIODMA	1
com.apple.driver.AppleT8103PCIe	1
com.apple.driver.AppleMultiFunctionManager	1
com.apple.driver.AppleEmbeddedPCIE	1
com.apple.driver.AppleBluetoothDebugService	1
com.apple.driver.AppleBCMWLANCore	1.0.0
com.apple.iokit.IO80211FamilyV2	1200.12.2b1
com.apple.driver.IOImageLoader	1.0.0
com.apple.driver.corecapture	1.0.4
com.apple.driver.AppleMCA2-T8103	540.10
com.apple.driver.AppleGPIOICController	1.0.2
com.apple.driver.AppleFireStormErrorHandler	1
com.apple.driver.AppleMobileApNonce	1
com.apple.iokit.IOTimeSyncFamily	985.2
com.apple.driver.DiskImages	493.0.0
com.apple.iokit.IOGraphicsFamily	585.1
com.apple.iokit.IOBluetoothFamily	8.0.5d7
com.apple.iokit.IOBluetoothPacketLogger	8.0.5d7
com.apple.driver.FairPlayIOKit	68.7.1
com.apple.iokit.CoreAnalyticsFamily	1
com.apple.driver.AppleSSE	1.0
com.apple.driver.AppleSEPKeyStore	2
com.apple.driver.AppleUSBTDM	511.141.1
com.apple.iokit.IOUSBMassStorageDriver	184.140.2
com.apple.iokit.IOPCIFamily	2.9
com.apple.iokit.IOSCSIBlockCommandsDevice	436.140.1
com.apple.iokit.IOSCSIArchitectureModelFamily	436.140.1
com.apple.driver.AppleIPAppender	1.0
com.apple.driver.AppleFDEKeyStore	28.30
com.apple.driver.AppleEffaceableStorage	1.0
com.apple.driver.AppleCredentialManager	1.0
com.apple.driver.KernelRelayHost	1
com.apple.iokit.IOUSBHostFamily	1.2
com.apple.driver.AppleUSBHostMergeProperties	1.2
com.apple.driver.usb.AppleUSBCommon	1.0
com.apple.driver.AppleSMC	3.1.9
com.apple.driver.RTBuddy	1.0.0
com.apple.driver.AppleEmbeddedTempSensor	1.0.0
com.apple.driver.AppleARMPMU	1.0
com.apple.iokit.IOAccessoryManager	1.0.0
com.apple.driver.AppleOnboardSerial	1.0
com.apple.iokit.IOSkywalkFamily	1
com.apple.driver.mDNSOffloadUserClient	1.0.1b8
com.apple.iokit.IONetworkingFamily	3.4
com.apple.iokit.IOSerialFamily	11
com.apple.driver.AppleSEPManager	1.0.1
com.apple.driver.AppleA7IOP	1.0.2
com.apple.driver.IOSlaveProcessor	1
com.apple.driver.AppleBiometricSensor	2
com.apple.iokit.IOHIDFamily	2.0.0
com.apple.AUC	1.0
com.apple.iokit.IOAVFamily	1.0.0
com.apple.iokit.IOHDCPFamily	1.0.0
com.apple.iokit.IOCECFamily	1
com.apple.iokit.IOAudio2Family	1.0
com.apple.driver.AppleEmbeddedAudioLibs	1.17
com.apple.driver.AppleFirmwareUpdateKext	1
com.apple.driver.AppleM2ScalerCSCDriver	265.0.0
com.apple.iokit.IOSurface	290.8.1
com.apple.driver.IODARTFamily	1
com.apple.security.quarantine	4
com.apple.security.sandbox	300.0
com.apple.kext.AppleMatch	1.0.0d1
com.apple.driver.AppleMobileFileIntegrity	1.0.5
com.apple.kext.CoreTrust	1
com.apple.security.AppleImage4	3.0.0
com.apple.iokit.IOCryptoAcceleratorFamily	1.0.1
com.apple.driver.AppleARMPlatform	1.0.2
com.apple.iokit.IOStorageFamily	2.1
com.apple.iokit.IOSlowAdaptiveClockingFamily	1.0.0
com.apple.iokit.IOReportFamily	47
com.apple.kec.pthread	1
com.apple.kec.corecrypto	11.1
com.apple.kec.Libm	1



** Stackshot Succeeded ** Bytes Traced 341829 (Uncompressed 886656) **

SciPy/NumPy/Python version information

1.8.0.dev0+1675.774941b 1.22.0.dev0+949.ga90677a0b sys.version_info(major=3, minor=9, micro=6, releaselevel='final', serial=0)

@ogrisel ogrisel added the defect A clear bug or issue that prevents SciPy from being installed or used as expected label Sep 2, 2021
@ogrisel ogrisel changed the title BUG: arpack's eigsh & openblas from Apple Silicon M1 (arm64) nightly build triggers macOS kernel panic / reboot BUG: ARPACK's eigsh & OpenBLAS from Apple Silicon M1 (arm64) nightly build triggers macOS kernel panic / reboot Sep 2, 2021
@isuruf
Copy link
Contributor

isuruf commented Sep 2, 2021

Also note that there is probably a problem with the OpenBLAS binary of the numpy wheel because threadpoolctl cannot find the version information:

That wheel is built for 64bit integers in Fortran, and has some symbols suffixed with 64_. openblas_num_threads becomes openblas_num_threads64_. Can you try numpy=1.21.0 wheels?

@ogrisel
Copy link
Contributor Author

ogrisel commented Sep 2, 2021

The numpy 1.21.0 wheels is fine:

❯ python -m threadpoolctl -i scipy
[
  {
    "filepath": "/Users/ogrisel/miniforge3/envs/tmp/lib/python3.9/site-packages/numpy/.dylibs/libopenblas.0.dylib",
    "prefix": "libopenblas",
    "user_api": "blas",
    "internal_api": "openblas",
    "version": "0.3.13.dev",
    "num_threads": 8,
    "threading_layer": "pthreads",
    "architecture": "armv8"
 }
]

so I assume we have to upgrade threadpolctl to also look for 64_ suffixed symbols.

Edit: I fixed a bug of threadpoolctl that caused the threading layer to be reported as "disabled" instead of "pthreads".

@ogrisel
Copy link
Contributor Author

ogrisel commented Sep 2, 2021

I still get the kernel panic with the numpy 1.21.0 wheel but it does so after 5 or 6 iterations of the loop instead of the first...

panic(cpu 0 caller 0xfffffe001012b460): "Spinlock timeout after 12583549 ticks, 0xfffffe0013e84588 = fffffe166ad1d321"
Debugger message: panic
Memory ID: 0xff
OS release type: User
OS version: 20G80
Kernel version: Darwin Kernel Version 20.6.0: Wed Jun 23 00:26:27 PDT 2021; root:xnu-7195.141.2~5/RELEASE_ARM64_T8101
Fileset Kernelcache UUID: E46841F89DC3FD7ACEC6F404AC995579
Kernel UUID: AC4A14A7-8A8E-3AE6-85A6-55E6B2502BF9
iBoot version: iBoot-6723.140.2
secure boot?: YES
Paniclog version: 13
KernelCache slide: 0x00000000084b0000
KernelCache base:  0xfffffe000f4b4000
Kernel slide:      0x0000000008ff8000
Kernel text base:  0xfffffe000fffc000
Kernel text exec base:  0xfffffe00100c8000
mach_absolute_time: 0x512170d32
Epoch Time:        sec       usec
  Boot    : 0x61314b68 0x0001f503
  Sleep   : 0x00000000 0x00000000
  Wake    : 0x00000000 0x00000000
  Calendar: 0x61314eed 0x000f1b63

CORE 0 recently retired instr at 0xfffffe00102396a4
CORE 1 recently retired instr at 0xfffffe001023ad6c
CORE 2 recently retired instr at 0xfffffe001023ad6c
CORE 3 recently retired instr at 0xfffffe001023ad6c
CORE 4 recently retired instr at 0xfffffe001023ad70
CORE 5 recently retired instr at 0xfffffe001023ad70
CORE 6 recently retired instr at 0xfffffe001023ad70
CORE 7 recently retired instr at 0xfffffe001023ad70
CORE 0 PVH locks held: None
CORE 1 PVH locks held: None
CORE 2 PVH locks held: None
CORE 3 PVH locks held: None
CORE 4 PVH locks held: None
CORE 5 PVH locks held: None
CORE 6 PVH locks held: None
CORE 7 PVH locks held: None
CORE 0 is the one that panicked. Check the full backtrace for details.
CORE 1: PC=0xfffffe00100d8c28, LR=0xfffffe00100d8c28, FP=0x0000000000000000
CORE 2: PC=0xfffffe00100d8c28, LR=0xfffffe00100d8c28, FP=0x0000000000000000
CORE 3: PC=0xfffffe00100d8c28, LR=0xfffffe00100d8c28, FP=0x0000000000000000
CORE 4: PC=0xfffffe0010227204, LR=0xfffffe00102271e0, FP=0xfffffe30b4fb3d50
CORE 5: PC=0x000000018225ccd8, LR=0x0000000182173f90, FP=0x000000016decaf40
CORE 6: PC=0xfffffe0010227204, LR=0xfffffe00102271e0, FP=0xfffffe30b4f93d50
CORE 7: PC=0x0000000102e17144, LR=0x0000000102e170ec, FP=0x000000016dc990f0
Panicked task 0xfffffe16747efac8: 4553 pages, 15 threads: pid 3270: python3.9
Panicked thread: 0xfffffe166f24ccc0, backtrace: 0xfffffe30b51c36d0, tid: 17089
		  lr: 0xfffffe0010116b68  fp: 0xfffffe30b51c3740
		  lr: 0xfffffe001011694c  fp: 0xfffffe30b51c37b0
		  lr: 0xfffffe00102401c8  fp: 0xfffffe30b51c37d0
		  lr: 0xfffffe0010231674  fp: 0xfffffe30b51c3880
		  lr: 0xfffffe00100cf7e8  fp: 0xfffffe30b51c3890
		  lr: 0xfffffe00101165dc  fp: 0xfffffe30b51c3c20
		  lr: 0xfffffe00101165dc  fp: 0xfffffe30b51c3c90
		  lr: 0xfffffe00108c8e80  fp: 0xfffffe30b51c3cb0
		  lr: 0xfffffe001012b460  fp: 0xfffffe30b51c3ce0
		  lr: 0xfffffe001012b2cc  fp: 0xfffffe30b51c3d00
		  lr: 0xfffffe0010111174  fp: 0xfffffe30b51c3d50
		  lr: 0xfffffe001063b268  fp: 0xfffffe30b51c3db0
		  lr: 0xfffffe001071e0e8  fp: 0xfffffe30b51c3e40
		  lr: 0xfffffe0010231350  fp: 0xfffffe30b51c3ef0
		  lr: 0xfffffe00100cf7e8  fp: 0xfffffe30b51c3f00

last started kext at 659678031: com.apple.driver.AppleActuatorDriver	4440.3 (addr 0xfffffe000f670000, size 16384)
loaded kexts:
com.apple.driver.AppleTopCaseHIDEventDriver	4050.1
com.apple.filesystems.autofs	3.0
com.apple.iokit.IOBluetoothSerialManager	8.0.5d7
com.apple.fileutil	20.036.15
com.apple.driver.AppleBiometricServices	1
com.apple.driver.usb.AppleUSBHostBillboardDevice	1.0
com.apple.driver.usb.realtek8153patcher	5.0.0
com.apple.iokit.SCSITaskUserClient	436.140.1
com.apple.driver.BCMWLANFirmware4378.Hashstore	1
com.apple.driver.DiskImages.ReadWriteDiskImage	493.0.0
com.apple.driver.DiskImages.UDIFDiskImage	493.0.0
com.apple.driver.DiskImages.RAMBackingStore	493.0.0
com.apple.driver.DiskImages.FileBackingStore	493.0.0
com.apple.driver.CoreKDL	1
com.apple.driver.SEPHibernation	1
com.apple.filesystems.apfs	1677.141.1
com.apple.driver.AppleUSBDeviceNCM	5.0.0
com.apple.driver.AppleThunderboltIP	4.0.3
com.apple.driver.AppleSmartBatteryManager	161.0.0
com.apple.driver.AppleALSColorSensor	1.0.0d1
com.apple.driver.AppleAOPVoiceTrigger	11.5
com.apple.driver.AppleFileSystemDriver	3.0.1
com.apple.nke.l2tp	1.9
com.apple.filesystems.tmpfs	1
com.apple.driver.ApplePMP	1
com.apple.driver.ApplePMPFirmware	1
com.apple.driver.AppleSmartIO2	1
com.apple.IOTextEncryptionFamily	1.0.0
com.apple.filesystems.hfs.kext	556.100.11
com.apple.security.BootPolicy	1
com.apple.BootCache	40
com.apple.AppleFSCompression.AppleFSCompressionTypeZlib	1.0.0
com.apple.AppleFSCompression.AppleFSCompressionTypeDataless	1.0.0d1
com.apple.AppleEmbeddedSimpleSPINORFlasher	1
com.apple.driver.AppleDPDisplayTCON	1
com.apple.driver.AppleCS42L83Audio	442.26
com.apple.driver.AppleTAS5770LAmp	442.26
com.apple.driver.AppleSPMIPMU	1.0.1
com.apple.driver.AppleT8020SOCTuner	1
com.apple.driver.AppleT8103CLPCv3	1
com.apple.AGXG13G	173.28.7
com.apple.driver.AppleAVD	385
com.apple.driver.AppleAVE2	401.73.4
com.apple.driver.AppleJPEGDriver	4.6.0
com.apple.driver.AppleMobileDispH13G-DCP	140.0
com.apple.driver.usb.AppleUSBHostT8103	1
com.apple.driver.AudioDMAController-T8103	1.60.5
com.apple.driver.AppleS5L8960XNCO	1
com.apple.driver.AppleT8103PMGR	1
com.apple.driver.AppleS8000AES	1
com.apple.driver.AppleS8000DWI	1.0.0d1
com.apple.driver.AppleS5L8960XWatchDogTimer	1
com.apple.driver.AppleInterruptController	1.0.0d1
com.apple.driver.AppleT8020DART	1
com.apple.driver.AppleBluetoothModule	1
com.apple.driver.AppleSamsungSerial	1.0.0d1
com.apple.driver.AppleBCMWLANBusInterfacePCIe	1
com.apple.driver.AppleS5L8920XPWM	1.0.0d1
com.apple.driver.AppleS5L8940XI2C	1.0.0d2
com.apple.driver.AppleSPIMC	1
com.apple.driver.AppleT8101	1
com.apple.driver.AppleM68Buttons	1.0.0d1
com.apple.iokit.IOUserEthernet	1.0.1
com.apple.driver.usb.AppleUSBUserHCI	1
com.apple.iokit.IOKitRegistryCompatibility	1
com.apple.iokit.EndpointSecurity	1
com.apple.driver.AppleDiskImages2	1
com.apple.AppleSystemPolicy	2.0.0
com.apple.nke.applicationfirewall	311
com.apple.kec.InvalidateHmac	1
com.apple.driver.AppleActuatorDriver	4440.3
com.apple.driver.AppleHIDKeyboard	224
com.apple.driver.AppleMultitouchDriver	4440.3
com.apple.driver.AppleHSBluetoothDriver	4050.1
com.apple.driver.IOBluetoothHIDDriver	8.0.5d7
com.apple.kext.triggers	1.0
com.apple.iokit.IOAVBFamily	940.4
com.apple.plugin.IOgPTPPlugin	985.2
com.apple.iokit.IOEthernetAVBController	1.1.0
com.apple.driver.AppleMesaSEPDriver	100.99
com.apple.iokit.IOBiometricFamily	1
com.apple.driver.usb.cdc.ecm	5.0.0
com.apple.driver.usb.cdc.acm	5.0.0
com.apple.driver.usb.serial	6.0.0
com.apple.driver.usb.cdc	5.0.0
com.apple.driver.AppleUSBAudio	405.39
com.apple.iokit.IOAudioFamily	300.6.1
com.apple.vecLib.kext	1.2.0
com.apple.driver.DiskImages.KernelBacked	493.0.0
com.apple.driver.AppleSEPHDCPManager	1.0.1
com.apple.driver.AppleTrustedAccessory	1
com.apple.iokit.AppleSEPGenericTransfer	1
com.apple.driver.AppleXsanScheme	3
com.apple.driver.IOBluetoothHostControllerPCIeTransport	8.0.5d7
com.apple.iokit.IOBluetoothHostControllerTransport	8.0.5d7
com.apple.driver.AppleConvergedIPCOLYBTControl	1
com.apple.driver.AppleConvergedPCI	1
com.apple.driver.AppleBluetoothDebug	1
com.apple.driver.AppleBTM	1.0.1
com.apple.driver.usb.networking	5.0.0
com.apple.driver.AppleThunderboltPCIDownAdapter	4.1.1
com.apple.driver.AppleThunderboltUSBDownAdapter	1.0.4
com.apple.driver.AppleThunderboltDPInAdapter	8.1.4
com.apple.driver.AppleThunderboltDPAdapterFamily	8.1.4
com.apple.driver.AppleAOPAudio	16.2
com.apple.driver.AppleHIDTransportSPI	4400.35
com.apple.driver.AppleHIDTransport	4400.35
com.apple.driver.AppleInputDeviceSupport	4400.35
com.apple.driver.AppleDCPDPTXProxy	1.0.0
com.apple.driver.DCPDPFamilyProxy	1
com.apple.nke.ppp	1.9
com.apple.AGXFirmwareKextG13GRTBuddy	173.28.7
com.apple.AGXFirmwareKextRTBuddy64	173.28.7
com.apple.driver.AppleBSDKextStarter	3
com.apple.filesystems.hfs.encodings.kext	1
com.apple.driver.AppleSPU	1
com.apple.iokit.IONVMeFamily	2.1.0
com.apple.driver.AppleDiagnosticDataAccessReadOnly	1.0.0
com.apple.driver.AppleNANDConfigAccess	1.0.0
com.apple.driver.AppleStockholmControl	1.0.0
com.apple.driver.AppleHPM	3.4.4
com.apple.driver.AppleCSEmbeddedAudio	442.26
com.apple.driver.AppleEmbeddedAudio	442.26
com.apple.iokit.AppleARMIISAudio	80.34.1
com.apple.driver.AppleDialogPMU	1.0.1
com.apple.driver.ApplePassthroughPPM	3.0
com.apple.iokit.IOGPUFamily	21.11
com.apple.driver.AppleH11ANEInterface	4.75.0
com.apple.driver.AppleH13CameraInterface	3.58.0
com.apple.driver.AppleH10PearlCameraInterface	16.15.1
com.apple.driver.DCPAVFamilyProxy	1
com.apple.iokit.IOMobileGraphicsFamily-DCP	343.0.0
com.apple.driver.AppleDCP	1
com.apple.driver.AppleFirmwareKit	1
com.apple.iokit.IOMobileGraphicsFamily	343.0.0
com.apple.driver.AppleSPMI	1.0.1
com.apple.driver.AppleUSBXDCIARM	1.0
com.apple.driver.AppleUSBXDCI	1.0
com.apple.iokit.IOUSBDeviceFamily	2.0.0
com.apple.driver.usb.AppleUSBXHCIARM	1
com.apple.driver.usb.AppleUSBXHCI	1.2
com.apple.driver.AppleEmbeddedUSBHost	1
com.apple.driver.usb.AppleUSBHub	1.2
com.apple.driver.usb.AppleUSBHostCompositeDevice	1.2
com.apple.driver.AppleT8103TypeCPhy	1
com.apple.driver.AppleSART	1
com.apple.driver.ApplePMGR	1
com.apple.driver.watchdog	1
com.apple.driver.usb.AppleUSBHostPacketFilter	1.0
com.apple.driver.AppleDisplayCrossbar	1.0.0
com.apple.iokit.IODisplayPortFamily	1.0.0
com.apple.driver.AppleTypeCPhy	1
com.apple.driver.AppleThunderboltNHI	7.2.8
com.apple.driver.AppleT8103PCIeC	1
com.apple.iokit.IOThunderboltFamily	9.3.2
com.apple.driver.ApplePIODMA	1
com.apple.driver.AppleT8103PCIe	1
com.apple.driver.AppleMultiFunctionManager	1
com.apple.driver.AppleEmbeddedPCIE	1
com.apple.driver.AppleBluetoothDebugService	1
com.apple.driver.AppleBCMWLANCore	1.0.0
com.apple.iokit.IO80211FamilyV2	1200.12.2b1
com.apple.driver.IOImageLoader	1.0.0
com.apple.driver.corecapture	1.0.4
com.apple.driver.AppleMCA2-T8103	540.10
com.apple.driver.AppleGPIOICController	1.0.2
com.apple.driver.AppleFireStormErrorHandler	1
com.apple.driver.AppleMobileApNonce	1
com.apple.iokit.IOTimeSyncFamily	985.2
com.apple.driver.DiskImages	493.0.0
com.apple.iokit.IOGraphicsFamily	585.1
com.apple.iokit.IOBluetoothFamily	8.0.5d7
com.apple.iokit.IOBluetoothPacketLogger	8.0.5d7
com.apple.driver.FairPlayIOKit	68.7.1
com.apple.iokit.CoreAnalyticsFamily	1
com.apple.driver.AppleSSE	1.0
com.apple.driver.AppleSEPKeyStore	2
com.apple.driver.AppleUSBTDM	511.141.1
com.apple.iokit.IOUSBMassStorageDriver	184.140.2
com.apple.iokit.IOPCIFamily	2.9
com.apple.iokit.IOSCSIBlockCommandsDevice	436.140.1
com.apple.iokit.IOSCSIArchitectureModelFamily	436.140.1
com.apple.driver.AppleIPAppender	1.0
com.apple.driver.AppleFDEKeyStore	28.30
com.apple.driver.AppleEffaceableStorage	1.0
com.apple.driver.AppleCredentialManager	1.0
com.apple.driver.KernelRelayHost	1
com.apple.iokit.IOUSBHostFamily	1.2
com.apple.driver.AppleUSBHostMergeProperties	1.2
com.apple.driver.usb.AppleUSBCommon	1.0
com.apple.driver.AppleSMC	3.1.9
com.apple.driver.RTBuddy	1.0.0
com.apple.driver.AppleEmbeddedTempSensor	1.0.0
com.apple.driver.AppleARMPMU	1.0
com.apple.iokit.IOAccessoryManager	1.0.0
com.apple.driver.AppleOnboardSerial	1.0
com.apple.iokit.IOSkywalkFamily	1
com.apple.driver.mDNSOffloadUserClient	1.0.1b8
com.apple.iokit.IONetworkingFamily	3.4
com.apple.iokit.IOSerialFamily	11
com.apple.driver.AppleSEPManager	1.0.1
com.apple.driver.AppleA7IOP	1.0.2
com.apple.driver.IOSlaveProcessor	1
com.apple.driver.AppleBiometricSensor	2
com.apple.iokit.IOHIDFamily	2.0.0
com.apple.AUC	1.0
com.apple.iokit.IOAVFamily	1.0.0
com.apple.iokit.IOHDCPFamily	1.0.0
com.apple.iokit.IOCECFamily	1
com.apple.iokit.IOAudio2Family	1.0
com.apple.driver.AppleEmbeddedAudioLibs	1.17
com.apple.driver.AppleFirmwareUpdateKext	1
com.apple.driver.AppleM2ScalerCSCDriver	265.0.0
com.apple.iokit.IOSurface	290.8.1
com.apple.driver.IODARTFamily	1
com.apple.security.quarantine	4
com.apple.security.sandbox	300.0
com.apple.kext.AppleMatch	1.0.0d1
com.apple.driver.AppleMobileFileIntegrity	1.0.5
com.apple.kext.CoreTrust	1
com.apple.security.AppleImage4	3.0.0
com.apple.iokit.IOCryptoAcceleratorFamily	1.0.1
com.apple.driver.AppleARMPlatform	1.0.2
com.apple.iokit.IOStorageFamily	2.1
com.apple.iokit.IOSlowAdaptiveClockingFamily	1.0.0
com.apple.iokit.IOReportFamily	47
com.apple.kec.pthread	1
com.apple.kec.corecrypto	11.1
com.apple.kec.Libm	1



** Stackshot Succeeded ** Bytes Traced 326132 (Uncompressed 837568) **

@ogrisel
Copy link
Contributor Author

ogrisel commented Sep 2, 2021

To get a feeling of the experience :)

https://photos.app.goo.gl/ayoAPkeirg721Cud6

@isuruf
Copy link
Contributor

isuruf commented Sep 2, 2021

"threading_layer": "disabled"

Not sure why this is. I get "threading_layer": "pthreads" with the numpy=1.21.0 wheels

@isuruf
Copy link
Contributor

isuruf commented Sep 2, 2021

Can you try deleting scipy/.dylibs and make it a symlink to numpy/.dylibs?

@ogrisel
Copy link
Contributor Author

ogrisel commented Sep 2, 2021

Not sure why this is. I get "threading_layer": "pthreads" with the numpy=1.21.0 wheels

I had a bug in my dev version of threadpoolctl, fixed.

@ogrisel
Copy link
Contributor Author

ogrisel commented Sep 2, 2021

Can you try deleting scipy/.dylibs and make it a symlink to numpy/.dylibs?

That fixes the crash!

tmp ❯ ls -l /Users/ogrisel/miniforge3/envs/tmp/lib/python3.9/site-packages/scipy/.dylibs/libopenblas.0.dylib
lrwxr-xr-x  1 ogrisel  staff  96 Sep  3 00:53 /Users/ogrisel/miniforge3/envs/tmp/lib/python3.9/site-packages/scipy/.dylibs/libopenblas.0.dylib -> /Users/ogrisel/miniforge3/envs/tmp/lib/python3.9/site-packages/numpy/.dylibs/libopenblas.0.dylib

~
tmp ❯ python repro_crash.py
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.302 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.042 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.019 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.018 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.020 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.018 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.016 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.016 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.019 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.021 s

So indeed the 64 bit integer build of OpenBLAS has a problem. But why would that cause a kernel panic is a mystery...

@isuruf
Copy link
Contributor

isuruf commented Sep 2, 2021

It's not the 64 bit integer build right? You tried with numpy=1.21.0 which was not the 64-bit integer build.

@ogrisel
Copy link
Contributor Author

ogrisel commented Sep 2, 2021

I am confused. The problem does not appear with numpy 1.21.0 and the symlink pointing to it in scipy's .dylibs.

tmp ❯ pip show numpy scipy
Name: numpy
Version: 1.21.0
Summary: NumPy is the fundamental package for array computing with Python.
Home-page: https://www.numpy.org
Author: Travis E. Oliphant et al.
Author-email: 
License: BSD
Location: /Users/ogrisel/miniforge3/envs/tmp/lib/python3.9/site-packages
Requires: 
Required-by: scipy, scikit-learn
---
Name: scipy
Version: 1.8.0.dev0+1675.774941b
Summary: SciPy: Scientific Library for Python
Home-page: https://www.scipy.org
Author: 
Author-email: 
License: BSD
Location: /Users/ogrisel/miniforge3/envs/tmp/lib/python3.9/site-packages
Requires: numpy
Required-by: scikit-learn

~
tmp ❯ python -m threadpoolctl -i scipy
[
  {
    "filepath": "/Users/ogrisel/miniforge3/envs/tmp/lib/python3.9/site-packages/numpy/.dylibs/libopenblas.0.dylib",
    "prefix": "libopenblas",
    "user_api": "blas",
    "internal_api": "openblas",
    "version": "0.3.13.dev",
    "num_threads": 8,
    "threading_layer": "pthreads",
    "architecture": "armv8"
  }

I will re-try with numpy 1.21.2 and the symlink next.

@isuruf
Copy link
Contributor

isuruf commented Sep 2, 2021

I will re-try with numpy 1.22.0 and the symlink next.

This wouldn't work as the two libraries are ABI incompatible.

@ogrisel
Copy link
Contributor Author

ogrisel commented Sep 2, 2021

No problem with numpy 1.21.2 and the symlink either.

@isuruf
Copy link
Contributor

isuruf commented Sep 2, 2021

1.21.0 and 1.21.2 both have 32-bit integer wheels and only 1.22.0.dev0 has a 64-bit integer wheel build.

@ogrisel
Copy link
Contributor Author

ogrisel commented Sep 2, 2021

I will re-try with numpy 1.22.0 and the symlink next.

This wouldn't work as the two libraries are ABI incompatible.

Indeed that is not possible with numpy-1.22.0.dev0+949.ga90677a0b.

@isuruf
Copy link
Contributor

isuruf commented Sep 2, 2021

Can you narrow down which library is causing this? Start with a new env and create a symlink only for libgfortran.5.dylib and libgcc_s.2.dylib and see if that crashes too. If it does libopenblas.dylib is to blame

@ogrisel
Copy link
Contributor Author

ogrisel commented Sep 3, 2021

I reinstalled a new env with the nightly builds: numpy 1.22.0.dev0+949.ga90677a0b and scipy 1.8.0.dev0+1675.774941b, I replaced scipy/.dylibs/libgfortran.5.dylib by a symlink pointing to numpy/.dylibs/libgfortran.5.dylib and I still get the crash.

Edit: same for libgcc_s.2.dylib: still crash.

I will check again with numpy 1.21.2 and scipy 1.8.0.dev0+1675.774941b.

Edit: symlinking libgfortran.5.dylib and libgcc_s.2.dylib only does not fix the crash. Only forcing scipy to to use numpy 1.21.2's libopenblas.0.dylib fixes the problem (and makes the code run as fast with 8 threads as with 4 threads).

@ogrisel
Copy link
Contributor Author

ogrisel commented Sep 3, 2021

Just in case, I archived a copy of the numpy and scipy nightly build wheels here:

https://github.com/ogrisel/tmpblobs/releases/tag/m1-dev-wheels

@glemaitre
Copy link
Contributor

I can reproduce it on my MacBookAir as well, following the above instruction.

@ogrisel
Copy link
Contributor Author

ogrisel commented Sep 6, 2021

BTW, I also reported the macOS kernel panic to Apple as https://feedbackassistant.apple.com/feedback/9594377 (probably only visible by the Apple support team) but did not receive any reply yet.

@rgommers
Copy link
Member

rgommers commented Sep 7, 2021

@charris FYI there's an issue with the 64-bit OpenBLAS build for macOS arm64. I kind of missed the whole move to 64-bit OpenBLAS, was there a driver beyond "looks like a good idea"?

@rgommers rgommers added this to the 1.7.2 milestone Sep 9, 2021
@V0lantis
Copy link
Contributor

Running the tests on Master returned errors on linalg as well :

../../scipy/sparse/linalg/eigen/tests/test_svds.py:260: in test_svds_parameter_tol
    assert error < accuracy
E   assert 1.4567177559457185e-15 < 1e-15
        A          = <100x100 sparse matrix of type '<class 'numpy.float64'>'
	with 6624 stored elements in Compressed Sparse Column format>
        _          = array([[-0.06798121, -0.06969187, -0.1634784 , ..., -0.08870255,
        -0.08289897, -0.07725935],
       [-0.0130691...722,  0.12174589],
       [ 0.03737577,  0.07600953, -0.04798564, ..., -0.08283926,
        -0.04644228,  0.04356348]])
        accuracies = {'arpack': [1e-15, 1e-10, 1e-10], 'lobpcg': [1e-11, 0.001, 10], 'propack': [1e-12, 1e-06, 0.0001]}
        accuracy   = 1e-15
        err        = <function SVDSCommonTests.test_svds_parameter_tol.<locals>.err at 0x10e2858b0>
        error      = 1.4567177559457185e-15
        k          = 3
        n          = 100
        rng        = Generator(PCG64) at 0x13381E2E0
        s          = array([3.57532733e-01, 1.27553875e-01, 1.19870964e-01, 1.15829350e-01,
       1.12545658e-01, 1.09114116e-01, 1.028058...2.69065672e-04, 1.72149810e-04, 1.65442589e-04,
       7.44147686e-05, 3.32524683e-05, 2.12349114e-06, 1.50715281e-07])
        self       = <scipy.sparse.linalg.eigen.tests.test_svds.Test_SVDS_ARPACK object at 0x10e2c0460>
        tol        = 0.0001
        tols       = [0.0001, 0.01, 1.0]
========================================================================================================== short test summary info ===========================================================================================================
FAILED ../../scipy/sparse/linalg/eigen/tests/test_svds.py::Test_SVDS_ARPACK::test_svds_parameter_tol - assert 1.4567177559457185e-15 < 1e-15

I am running on MacBook Pro Core I7 with :

SciPy/NumPy/Python version information

1.8.0.dev0+1741.b5653b2 1.21.0 sys.version_info(major=3, minor=9, micro=6, releaselevel='final', serial=0)

Does it have something to do with Openblas?

@rgommers
Copy link
Member

@V0lantis there are more linalg issues, but that's different from this crash. Can you please open a new issue?

The kernel panic should be fixed in macOS 12. Which is great, but of course doesn't yet help users now when we release arm64 wheels. So we should ensure that the test suite doesn't trigger this issue. @ogrisel can you confirm that this is the situation:

  • the SciPy test suite does not trigger the bug
  • the scikit-learn test suite does, and hence something needs to be skipped/xfailed there (or that's already done?)

@ogrisel
Copy link
Contributor Author

ogrisel commented Sep 30, 2021

I did run the full scipy suite yet. Let me try to do it now... Hopefully it won't crash my laptop while working on something else.

We can probably try to find all the scikit-learn tests that cause this and skip them. But if there are more than a few of them, tracking them all will be very very painful.

And furthermore releasing such a scipy could cause panics on other systems outside of the scikit-learn user base.

However if the wheel filename is specialized to be scipy-1.8.0-cp39-cp39m-macosx_12_0_arm64.whl instead of the usual scipy-1.8.0-cp39-cp39m-macosx_10_9_arm64.whl then it should not be picked by pip on systems with macOS 11 or below.

@ogrisel
Copy link
Contributor Author

ogrisel commented Sep 30, 2021

The panic also happens when running the scipy suite. I ran the tests with pytest -v --pyargs scipy 2>&1 | tee scipy_test.log to record to disk the list of the tests that executed successfully before trigger a system reboot. The result is here:

https://gist.github.com/ogrisel/fe230e33b81f9e910ad521a61eef1cb6

So the culprit is likely the test that comes after scipy/fftpack/tests/test_basic.py::TestOverwrite::test_fftn_ifftn[shape1-axes1-False-complex64] if pytest collection ordering is deterministic.

It's too time consuming for me to try to "bisect" the potentially culprits in the scipy test suite because I have the feeling it would prevent me to work on anything else for several hours.

@ogrisel
Copy link
Contributor Author

ogrisel commented Oct 22, 2021

Both scipy and numpy were installed using homebrew as well.

@sunilshah can you please install threadpoolctl and run python -m threadpoolctl -i scipy? I suspect that both scipy and numpy link to the same openblas dylib in which case the problem disappears (as in the case with conda-forge).

The problem seems only related to wheel packaging.

@ogrisel
Copy link
Contributor Author

ogrisel commented Oct 22, 2021

Here is a version that changes the number of threads interactively during the execution of the code:

>>> from time import perf_counter
... import numpy as np
... from scipy.sparse.linalg import eigsh
... from threadpoolctl import threadpool_limits
... 
... 
... n_samples, n_features = 2000, 10
... rng = np.random.default_rng(0)
... X = rng.normal(size=(n_samples, n_features))
... K = X @ X.T
... 
... for n_threads in range(1, 9):
...     print(f"using blas with {n_threads=}")
...     with threadpool_limits(limits={"blas": n_threads}):
...         for i in range(2):
...             print("running eigsh...")
...             tic = perf_counter()
...             s, _ = eigsh(K, 3, which="LA", tol=0)
...             toc = perf_counter()
...             print(f"computed {s} in {toc - tic:.3f} s")
... 
using blas with n_threads=1
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.034 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.019 s
using blas with n_threads=2
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.044 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.034 s
using blas with n_threads=3
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.084 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.045 s
using blas with n_threads=4
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.022 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.027 s
using blas with n_threads=5
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.093 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.091 s
using blas with n_threads=6
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.521 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.506 s
using blas with n_threads=7
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 1.236 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.938 s
using blas with n_threads=8
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 1.471 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 1.900 s

So there is a performance degradation that starts with n_threads = 5 and increases progressively with the number of threads to be catastrophic with n_threads = 8 (which is the default for OpenBLAS on this machine).

@rgommers since the original kernel panic is fixed in macOS12, do you want to close this issue and relase apple m1 wheel only for macOS 12+?

I can open a dedicated issue for the performance problem that remains with large number of threads.

@sunilshah
Copy link

Both scipy and numpy were installed using homebrew as well.

@sunilshah can you please install threadpoolctl and run python -m threadpoolctl -i scipy? I suspect that both scipy and numpy link to the same openblas dylib in which case the problem disappears (as in the case with conda-forge).

The problem seems only related to wheel packaging.

 %python3 -m threadpoolctl -i scipy
[
  {
    "user_api": "blas",
    "internal_api": "openblas",
    "prefix": "libopenblas",
    "filepath": "/opt/homebrew/Cellar/openblas/0.3.18/lib/libopenblasp-r0.3.18.dylib",
    "version": "0.3.18",
    "threading_layer": "openmp",
    "architecture": "armv8",
    "num_threads": 8
  },
  {
    "user_api": "openmp",
    "internal_api": "openmp",
    "prefix": "libgomp",
    "filepath": "/opt/homebrew/Cellar/gcc/11.2.0/lib/gcc/11/libgomp.1.dylib",
    "version": null,
    "num_threads": 8
  }
]

% python3 -m threadpoolctl -i numpy      
[
  {
    "user_api": "blas",
    "internal_api": "openblas",
    "prefix": "libopenblas",
    "filepath": "/opt/homebrew/Cellar/openblas/0.3.18/lib/libopenblasp-r0.3.18.dylib",
    "version": "0.3.18",
    "threading_layer": "openmp",
    "architecture": "armv8",
    "num_threads": 8
  },
  {
    "user_api": "openmp",
    "internal_api": "openmp",
    "prefix": "libgomp",
    "filepath": "/opt/homebrew/Cellar/gcc/11.2.0/lib/gcc/11/libgomp.1.dylib",
    "version": null,
    "num_threads": 8
  }
]


On my configuration, I do not see the same performance change with number for threads that you see in your example.

python3 scipy_perf.py 
using blas with n_threads=1
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.018 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.018 s
using blas with n_threads=2
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.014 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.014 s
using blas with n_threads=3
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.015 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.014 s
using blas with n_threads=4
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.015 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.015 s
using blas with n_threads=5
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.016 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.015 s
using blas with n_threads=6
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.016 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.016 s
using blas with n_threads=7
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.016 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.017 s
using blas with n_threads=8
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.017 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.017 s

@ogrisel
Copy link
Contributor Author

ogrisel commented Oct 25, 2021

I just realized that python -m threadpoolctl -i scipy does not trigger the dynlink to the OpenBLAS from scipy since it only happens when importing the scipy.linalg subpackage. If I do this in the env with the scipy dev wheel that causes the slowdown I indeed see the 2 openblas files linked to that process as expected:

❯ python -m threadpoolctl -i scipy.linalg
[
  {
    "user_api": "blas",
    "internal_api": "openblas",
    "prefix": "libopenblas",
    "filepath": "/Users/ogrisel/mambaforge/envs/tmp/lib/python3.9/site-packages/numpy/.dylibs/libopenblas64_.0.dylib",
    "version": "0.3.18",
    "threading_layer": "pthreads",
    "architecture": "armv8",
    "num_threads": 8
  },
  {
    "user_api": "blas",
    "internal_api": "openblas",
    "prefix": "libopenblas",
    "filepath": "/Users/ogrisel/mambaforge/envs/tmp/lib/python3.9/site-packages/scipy/.dylibs/libopenblas.0.dylib",
    "version": "0.3.17",
    "threading_layer": "pthreads",
    "architecture": "armv8",
    "num_threads": 8
  }
]

@rgommers rgommers modified the milestones: 1.7.2, 1.7.3 Nov 6, 2021
@rgommers
Copy link
Member

rgommers commented Nov 14, 2021

I'm trying to look at this again last-minute and figure out whether it's a problem if NumPy 1.22.0 ships the 64-bit libopenblas. The NumPy 1.21.4 wheels are up now and have M1 support, so we can test those:

% conda create -n m1-crash python=3.9
...
% conda activate m1-crash
...
% pip install numpy==1.21.4
...
  Downloading numpy-1.21.4-cp39-cp39-macosx_11_0_arm64.whl (12.4 MB)

% pip install --pre -i https://pypi.anaconda.org/scipy-wheels-nightly/simple scipy
...
  Downloading https://pypi.anaconda.org/scipy-wheels-nightly/simple/scipy/1.8.0.dev0%2B1984.f1c6b54/scipy-1.8.0.dev0%2B1984.f1c6b54-cp39-cp39-macosx_11_0_arm64.whl (91.4 MB)
...
Successfully installed scipy-1.8.0.dev0+1984.f1c6b54

% pip install threadpoolctl
...
Successfully installed threadpoolctl-3.0.0

% python -m threadpoolctl -i scipy.linalg
[
  {
    "user_api": "blas",
    "internal_api": "openblas",
    "prefix": "libopenblas",
    "filepath": "/Users/rgommers/mambaforge/envs/m1-crash/lib/python3.9/site-packages/numpy/.dylibs/libopenblas.0.dylib",
    "version": "0.3.17",
    "threading_layer": "pthreads",
    "architecture": "armv8",
    "num_threads": 8
  },
  {
    "user_api": "blas",
    "internal_api": "openblas",
    "prefix": "libopenblas",
    "filepath": "/Users/rgommers/mambaforge/envs/m1-crash/lib/python3.9/site-packages/scipy/.dylibs/libopenblas.0.dylib",
    "version": "0.3.17",
    "threading_layer": "pthreads",
    "architecture": "armv8",
    "num_threads": 8
  }
]

And yes indeed, the code sample in the issue description causes a kernel panic on my arm64 macOS 11.6.

% cd /path/to/site-packages/scipy/linalg
% otool -L _fblas.cpython-39-darwin.so 
_fblas.cpython-39-darwin.so:
	@loader_path/../.dylibs/libopenblas.0.dylib (compatibility version 0.0.0, current version 0.0.0)
	@loader_path/../.dylibs/libgfortran.5.dylib (compatibility version 6.0.0, current version 6.0.0)
	/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1292.60.1)

% cd ../../numpy/linalg
% otool -L _umath_linalg.cpython-39-darwin.so 
_umath_linalg.cpython-39-darwin.so:
	@loader_path/../.dylibs/libopenblas.0.dylib (compatibility version 0.0.0, current version 0.0.0)
	/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1292.60.1)

EDIT 2: trying with a conda-forge env with Python, compilers and openblas in it; then pip install scipy which pulls in the numpy 1.21.4 wheel and builds scipy from source. There are still 2 openblas versions then, but the reproducer does not crash (but it does run really slowly, similar to what @ogrisel found for the scipy-nightly wheel on macOS 12.0):

% otool -L _fblas.cpython-39-darwin.so
_fblas.cpython-39-darwin.so:
	@rpath/libopenblas.0.dylib (compatibility version 0.0.0, current version 0.0.0)
	/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1292.100.5)
(m1-crash2) % python -m threadpoolctl -i scipy.linalg
[
  {
    "user_api": "blas",
    "internal_api": "openblas",
    "prefix": "libopenblas",
    "filepath": "/Users/rgommers/mambaforge/envs/m1-crash2/lib/python3.9/site-packages/numpy/.dylibs/libopenblas.0.dylib",
    "version": "0.3.17",
    "threading_layer": "pthreads",
    "architecture": "armv8",
    "num_threads": 8
  },
  {
    "user_api": "blas",
    "internal_api": "openblas",
    "prefix": "libopenblas",
    "filepath": "/Users/rgommers/mambaforge/envs/m1-crash2/lib/libopenblas_vortexp-r0.3.18.dylib",
    "version": "0.3.18",
    "threading_layer": "openmp",
    "architecture": "VORTEX",
    "num_threads": 8
  },
  {
    "user_api": "openmp",
    "internal_api": "openmp",
    "prefix": "libomp",
    "filepath": "/Users/rgommers/mambaforge/envs/m1-crash2/lib/libomp.dylib",
    "version": null,
    "num_threads": 8
  }
]
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 1.062 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.888 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 1.192 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 1.218 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 1.295 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 1.194 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 1.054 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 1.096 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 1.297 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 1.202 s

@rgommers
Copy link
Member

@rgommers since the original kernel panic is fixed in macOS12, do you want to close this issue and relase apple m1 wheel only for macOS 12+?

Hmm, that would be non-ideal - but macOS 12.0 is available since 25 Oct 2021, so it does make sense.

The alternative here would be to remove the shipped libopenblas from the wheel and somehow reuse the NumPy one. But that's perhaps fragile, and would not work at all if we'd say mix a pip-installed scipy with a system/conda-installed numpy (not recommended, but that works fine in practice today).

I am going to mention this plan on the mailing list, then wait a few days and upgrade to 12.0 so I can test locally.

I can open a dedicated issue for the performance problem that remains with large number of threads.

Yes, that sounds good, please do!

And let's keep this issue open until we have actually released some wheels.

@rgommers
Copy link
Member

So there is a performance degradation that starts with n_threads = 5 and increases progressively with the number of threads to be catastrophic with n_threads = 8 (which is the default for OpenBLAS on this machine).

This is still a concern right? We don't have a threadpoolctl dependency for SciPy, so do you think users are really well-served by 12.0 wheels with massive slowdowns when calling linalg functions?

@isuruf
Copy link
Contributor

isuruf commented Nov 14, 2021

The alternative here would be to remove the shipped libopenblas from the wheel and somehow reuse the NumPy one.

If 64-bit indexed openblas numpy wheels are shipped, wouldn't that require scipy to support 64-bit indexed blas operations?

@rgommers
Copy link
Member

If 64-bit indexed openblas numpy wheels are shipped, wouldn't that require scipy to support 64-bit indexed blas operations?

Yes, but we can still decide to not do that for NumPy 1.22.0. RC1 is coming out within a couple of days, that's why I was refreshing my memory on this issue. But either way, I don't think it's healthy to try to reuse the openblas bundled with numpy.

So there is a performance degradation that starts with n_threads = 5 and increases progressively with the number of threads to be catastrophic with n_threads = 8 (which is the default for OpenBLAS on this machine).

This is still a concern right? We don't have a threadpoolctl dependency for SciPy, so do you think users are really well-served by 12.0 wheels with massive slowdowns when calling linalg functions?

We may be able to get away with setting OPENBLAS_NUM_THREADS=4 in numpy/__init__.py and scipy/__init__.py as a temporary hack?

@rgommers
Copy link
Member

Or alternatively, we rebuild arm64 OpenBLAS binaries for max 4 threads?

(build config is at https://github.com/MacPython/openblas-libs/blob/master/.github/workflows/multibuild.yml)

@rgommers
Copy link
Member

Or alternatively, we rebuild arm64 OpenBLAS binaries for max 4 threads?

Which of course is pessimistic - it's not clear to me if the limit should be set to half the number of cores, the number of performance cores, or if even that won't be reliable. There's a number of CPU variants out already:

  • M1 (13"): 4 performance and 4 efficiency cores
  • M1 Pro (14"): 6 or 8 performance and 2 efficiency cores
  • M1 Pro/Max (16"): 8 performance and 2 efficiency cores

@rgommers
Copy link
Member

This is still a concern right? We don't have a threadpoolctl dependency for SciPy, so do you think users are really well-served by 12.0 wheels with massive slowdowns when calling linalg functions?

Tried an install from source, and this slowdown then happens as well (see #14688 (comment)). Just not the crash when building from source even though there are 2 openblas's; picking up the conda-forge openblas is just different enough apparently, or the linking using @rpath/libopenblas instead of @loader_path/../.dylibs/libopenblas helps, or it matter that it's using "threading_layer": "openmp" rather than both versions using pthreads.

@sunilshah
Copy link

sunilshah commented Nov 15, 2021

@rgommers

Are you restricting your comments to Conda installs?

My experience with Homebrew install is quite different from yours on Conda. On my M1 Mac-mini on Mac OS 12.0.1, with pip installed scipy 1.7.2, and numpy 1.21.4, python 3.9.8, I see little slow down when increasing the number of threads to 8 in the example code posted by @ogrisel :

python3 scipy_perf.py 
using blas with n_threads=1
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.028 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.024 s
using blas with n_threads=2
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.029 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.028 s
using blas with n_threads=3
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.033 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.021 s
using blas with n_threads=4
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.027 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.019 s
using blas with n_threads=5
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.027 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.022 s
using blas with n_threads=6
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.023 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.026 s
using blas with n_threads=7
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.024 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.023 s
using blas with n_threads=8
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.025 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.026 s

python3 -m threadpoolctl -i scipy.linalg
[
  {
    "user_api": "blas",
    "internal_api": "openblas",
    "prefix": "libopenblas",
    "filepath": "/opt/homebrew/lib/python3.9/site-packages/numpy/.dylibs/libopenblas.0.dylib",
    "version": "0.3.17",
    "threading_layer": "pthreads",
    "architecture": "armv8",
    "num_threads": 8
  },
  {
    "user_api": "blas",
    "internal_api": "openblas",
    "prefix": "libopenblas",
    "filepath": "/opt/homebrew/Cellar/openblas/0.3.18/lib/libopenblasp-r0.3.18.dylib",
    "version": "0.3.18",
    "threading_layer": "openmp",
    "architecture": "armv8",
    "num_threads": 8
  },
  {
    "user_api": "openmp",
    "internal_api": "openmp",
    "prefix": "libgomp",
    "filepath": "/opt/homebrew/Cellar/gcc/11.2.0_1/lib/gcc/11/libgomp.1.dylib",
    "version": null,
    "num_threads": 8
  }
]


IMHO it is best to not hardwire max threads in the code. Because the kernel panic is gone, optimization of threads should be left to the users with some guidance. It is surely very dependent upon the example and the system configuration / test conditions.

@rgommers
Copy link
Member

Are you restricting your comments to Conda installs?

That's what I have tested so far. Thanks for pointing out that Homebrew is different. I still don't fully understand what is going on under the hood here.

Can you add the output of otool -L /PATH/TO/scipy/linalg/_fblas.cpython-*-darwin.so?

IMHO it is best to not hardwire max threads in the code. Because the kernel panic is gone, optimization of threads should be left to the users with some guidance.

I don't think we want the default to be 50x slower because in some circumstances we may get a 30-50% (?) speedup for more threads. The last few cores added are the efficiency cores, so they don't help much anyway (they may even hurt). Once we release wheels, that's what 99% of users will use, not from-source builds.

@rgommers
Copy link
Member

Did some timing of the whole test suite:

% export OPENBLAS_NUM_THREADS=4
% python -c "import scipy as s; s.test(extra_argv=['--durations=20'])"

gives:

8.97s call     tests/test_basic.py::TestLstsq::test_random_complex_exact
5.74s call     tests/test_basic.py::TestLstsq::test_random_exact
5.34s call     tests/test_decomp_ldl.py::test_ldl_type_size_combinations
4.27s call     _lib/tests/test_import_cycles.py::test_modules_importable
3.60s call     optimize/tests/test_optimize.py::test_cobyla_threadsafe
3.02s call     optimize/tests/test_lsq_linear.py::TestTRF::test_large_rank_deficient
3.00s call     optimize/tests/test_lsq_linear.py::TestBVLS::test_large_rank_deficient
2.77s call     stats/tests/test_continuous_basic.py::test_cont_basic[500-200-wrapcauchy-arg106]
1.83s call     optimize/tests/test_least_squares.py::TestTRF::test_with_bounds
1.32s call     io/tests/test_wavfile.py::test_write_roundtrip
1.19s call     tests/test_decomp.py::TestOrdQZWorkspaceSize::test_decompose
0.98s call     optimize/tests/test__differential_evolution.py::TestDifferentialEvolutionSolver::test_parallel
0.98s call     optimize/tests/test__differential_evolution.py::TestDifferentialEvolutionSolver::test_L4
0.95s call     optimize/tests/test__differential_evolution.py::TestDifferentialEvolutionSolver::test_L1
0.92s call     optimize/tests/test_lsq_linear.py::TestTRF::test_sparse_bounds
0.92s call     stats/tests/test_distributions.py::TestStudentizedRange::test_cdf_against_tables
0.91s call     signal/tests/test_signaltools.py::test_filtfilt_gust
0.85s call     signal/tests/test_signaltools.py::test_choose_conv_method
0.77s call     spatial/tests/test_kdtree.py::test_kdtree_box[KDTree]
0.77s call     spatial/tests/test_kdtree.py::test_kdtree_box[cKDTree]
============================================================== short test summary info ==============================================================
... (a few failures left) ...
================= 8 failed, 32064 passed, 2583 skipped, 11134 deselected, 106 xfailed, 8 xpassed, 75 warnings in 241.64s (0:04:01) ==================

With export OPENBLAS_NUM_THREADS=1:

6.17s call     tests/test_basic.py::TestLstsq::test_random_exact
5.57s call     tests/test_decomp_ldl.py::test_ldl_type_size_combinations
4.14s call     tests/test_basic.py::TestLstsq::test_random_complex_exact
4.08s call     _lib/tests/test_import_cycles.py::test_modules_importable
3.93s call     optimize/tests/test_lsq_linear.py::TestTRF::test_large_rank_deficient
3.79s call     optimize/tests/test_lsq_linear.py::TestBVLS::test_large_rank_deficient
3.52s call     optimize/tests/test_optimize.py::test_cobyla_threadsafe
2.78s call     stats/tests/test_continuous_basic.py::test_cont_basic[500-200-wrapcauchy-arg106]
1.27s call     io/tests/test_wavfile.py::test_write_roundtrip
1.13s call     tests/test_decomp.py::TestOrdQZWorkspaceSize::test_decompose
0.99s call     optimize/tests/test__differential_evolution.py::TestDifferentialEvolutionSolver::test_L4
0.96s call     optimize/tests/test_least_squares.py::TestTRF::test_with_bounds
0.96s call     optimize/tests/test__differential_evolution.py::TestDifferentialEvolutionSolver::test_L1
0.92s call     optimize/tests/test_lsq_linear.py::TestTRF::test_sparse_bounds
0.91s call     stats/tests/test_distributions.py::TestStudentizedRange::test_cdf_against_tables
0.91s call     signal/tests/test_signaltools.py::test_filtfilt_gust
0.85s call     signal/tests/test_signaltools.py::test_choose_conv_method
0.78s call     spatial/tests/test_kdtree.py::test_kdtree_box[KDTree]
0.78s call     spatial/tests/test_kdtree.py::test_kdtree_box[cKDTree]
0.70s call     signal/tests/test_filter_design.py::TestBessel::test_fs_param
... (a few failures left) ...
================= 8 failed, 32064 passed, 2583 skipped, 11134 deselected, 106 xfailed, 8 xpassed, 75 warnings in 234.97s (0:03:54) ==================

With unset OPENBLAS_NUM_THREADS:

16.39s call     tests/test_basic.py::TestLstsq::test_random_complex_exact
14.94s call     optimize/tests/test_least_squares.py::TestTRF::test_with_bounds
8.57s call     optimize/tests/test_lsq_linear.py::TestBVLS::test_large_rank_deficient
7.57s call     optimize/tests/test_lsq_linear.py::TestTRF::test_large_rank_deficient
6.57s call     _lib/tests/test_import_cycles.py::test_modules_importable
5.74s call     tests/test_basic.py::TestLstsq::test_random_exact
5.37s call     tests/test_decomp_ldl.py::test_ldl_type_size_combinations
3.50s call     optimize/tests/test_optimize.py::test_cobyla_threadsafe
2.92s call     optimize/tests/test_least_squares.py::TestTRF::test_numerical_jac
2.77s call     stats/tests/test_continuous_basic.py::test_cont_basic[500-200-wrapcauchy-arg106]
1.84s call     optimize/tests/test_least_squares.py::TestTRF::test_bvp
1.32s call     io/tests/test_wavfile.py::test_write_roundtrip
1.17s call     tests/test_decomp.py::TestOrdQZWorkspaceSize::test_decompose
1.13s call     optimize/tests/test_least_squares.py::TestTRF::test_solver_selection
1.11s call     optimize/tests/test__differential_evolution.py::TestDifferentialEvolutionSolver::test_parallel
0.99s call     sparse/linalg/eigen/lobpcg/tests/test_lobpcg.py::test_hermitian
0.98s call     optimize/tests/test_lsq_linear.py::TestTRF::test_sparse_bounds
0.98s call     optimize/tests/test__differential_evolution.py::TestDifferentialEvolutionSolver::test_L4
0.95s call     optimize/tests/test__differential_evolution.py::TestDifferentialEvolutionSolver::test_L1
0.92s call     stats/tests/test_distributions.py::TestStudentizedRange::test_cdf_against_tables
... (a few failures left) ...
================= 8 failed, 32064 passed, 2583 skipped, 11134 deselected, 106 xfailed, 8 xpassed, 75 warnings in 311.94s (0:05:11) ==================

Some comments:

  • This is with the conda-forge build + numpy 1.21.4 wheel on macOS 11.6. Should be repeated with the SciPy pre-release wheels on macOS 12.0
  • need to do this with pytest-xdist not installed, to avoid oversubscription issues when running tests in parallel.
  • Timing may vary, since the slowdowns on linalg code has quite some variation in it.

Overall conclusion: disabling parallelism in OpenBLAS completely seems best in this situation, both for the slowest tests and for overall runtime of the test suite.

@sunilshah
Copy link

sunilshah commented Nov 15, 2021

Are you restricting your comments to Conda installs?

That's what I have tested so far. Thanks for pointing out that Homebrew is different. I still don't fully understand what is going on under the hood here.

As long as what you do does not change the behavior of Homebrew / pip installed Scipy, I personally have no issues.
Making changes to code or install defaults without understanding what is going on seems risky. I am sure you will agree that multi-core performance testing and optimization are nontrivial for scientific computing applications.

Can you add the output of otool -L /PATH/TO/scipy/linalg/_fblas.cpython-*-darwin.so?

otool -L  /opt/homebrew/lib/python3.9/site-packages/scipy/linalg/_fblas.cpython-*-darwin.so
/opt/homebrew/lib/python3.9/site-packages/scipy/linalg/_fblas.cpython-39-darwin.so:
	/opt/homebrew/opt/openblas/lib/libopenblas.0.dylib (compatibility version 0.0.0, current version 0.0.0)
	/opt/homebrew/opt/gcc/lib/gcc/11/libgfortran.5.dylib (compatibility version 6.0.0, current version 6.0.0)
	/opt/homebrew/opt/gcc/lib/gcc/11/libgcc_s.1.1.dylib (compatibility version 1.0.0, current version 1.1.0)
	/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1292.100.5)

What does this tell you?

IMHO it is best to not hardwire max threads in the code. Because the kernel panic is gone, optimization of threads should be left to the users with some guidance.

I don't think we want the default to be 50x slower because in some circumstances we may get a 30-50% (?) speedup for more threads. The last few cores added are the efficiency cores, so they don't help much anyway (they may even hurt). Once we release wheels, that's what 99% of users will use, not from-source builds.

It is best to refrain from making changes until you understand the reasons for the slow down in your tests and configuration as they may slow codes that are tuned by others in different circumstances. For example, I have specific tuned application codes that give me 2-10x performance gains on multicore machines. Your conclusion about disabling parallelism completely in OpenBLAS seems a bit drastic based upon the limited test suite results.

In any case, do allow users the option of setting thread count using both threadpoolctl and environment variables.

I agree with @ogrisel that performance issue is distinct from kernel panic which happened even without python/scipy. Wheel packaging for Conda and performance tuning are best done as a separate new issue.

@rgommers
Copy link
Member

Making changes to code or install defaults without understanding what is going on seems risky.

Yes, still trying to get a better understanding of what the root cause is here.

I am sure you will agree that multi-core performance testing and optimization are nontrivial for scientific computing applications.

Yes, no argument there:)

What does this tell you?

I wanted to see (a) if Homebrew openblas has the same version numbers (and it does, all 0's) and (b) how it differs from conda's openblas. The answer to the latter seems to be that, besides different compilers of course, conda uses @rpath for loading while in Homebrew the location is hardcoded to an absolute path.

In any case, do allow users the option of setting thread count using both threadpoolctl and environment variables.

Yes agreed, whatever we do for a default, there should be a way to override it.

Wheel packaging for Conda

There's no such thing. We're building one wheel per Python version, and it goes onto PyPI. And that wheel can then be installed into any Python install/environment.

as a separate new issue

Yes, I think a new issue with a good summary will be useful, because this issue is getting pretty long.

@rgommers
Copy link
Member

I opened a new issue for the performance problems: gh-15050.

Closing this because the problem is fixed in macOS >=12.0.1, thanks @ogrisel, @isuruf and everyone else who provided input!

@sunilshah
Copy link

1.7.3 wheel slows down my performance on M1 to the same as reported on Conda-forge 1.7.2. So, whatever was done does considerably worsens the performance.

On 1.7.3 scipy wheel for PyPI

%python3 scipy_perf.py     
using blas with n_threads=1
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.037 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.026 s
using blas with n_threads=2
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.024 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.021 s
using blas with n_threads=3
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.068 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.046 s
using blas with n_threads=4
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.041 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.019 s
using blas with n_threads=5
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.067 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.071 s
using blas with n_threads=6
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.389 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.364 s
using blas with n_threads=7
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.793 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.777 s
using blas with n_threads=8
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 1.150 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 1.226 s


Downgrading back to 1.7.2,


 %pip3 install -U --no-use-pep517 scipy==1.7.2

python3 scipy_perf.py
using blas with n_threads=1
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.030 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.030 s
using blas with n_threads=2
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.024 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.063 s
using blas with n_threads=3
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.019 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.016 s
using blas with n_threads=4
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.022 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.056 s
using blas with n_threads=5
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.019 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.019 s
using blas with n_threads=6
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.031 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.028 s
using blas with n_threads=7
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.059 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.024 s
using blas with n_threads=8
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.034 s
running eigsh...
computed [2096.59480134 2144.21662824 2188.11799492] in 0.022 s

grlee77 added a commit to grlee77/scikit-image that referenced this issue Nov 30, 2021
Python 3.8-3.10 only (as was done for SciPy)

use Python 3.9 for the Mac OS builds

arm64 case needs to downloads the arm64 version of libomp
skip universal2 for now until universal2 libomp is available

update MACOSX_DEPLOYMENT_TARGET to 12.0 for arm64
This follows SciPy (scipy/scipy#14688)
grlee77 added a commit to grlee77/scikit-image that referenced this issue Nov 30, 2021
Python 3.8-3.10 only (as was done for SciPy)

use Python 3.9 for the Mac OS builds

arm64 case needs to downloads the arm64 version of libomp
skip universal2 for now until universal2 libomp is available

update MACOSX_DEPLOYMENT_TARGET to 12.0 for arm64
This follows SciPy (scipy/scipy#14688)
grlee77 added a commit to grlee77/scikit-image that referenced this issue Nov 30, 2021
Python 3.8-3.10 only (as was done for SciPy)

use Python 3.9 for the Mac OS builds

arm64 case needs to downloads the arm64 version of libomp
skip universal2 for now until universal2 libomp is available

update MACOSX_DEPLOYMENT_TARGET to 12.0 for arm64
This follows SciPy (scipy/scipy#14688)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect A clear bug or issue that prevents SciPy from being installed or used as expected
Projects
None yet
Development

No branches or pull requests

7 participants