-
Notifications
You must be signed in to change notification settings - Fork 446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Apple Silicon / ARM64 notes #2333
Comments
Are there any known methods of getting to build Numpy/Scipy from source using Accelerate to avoid the mismatch? (I think support might have been dropped altogether.) |
When I was looking for it recently I only found that Numpy dropped support of Accelerate some time ago because it only provided an old lapack version. This was pre-M1 as far as I know. |
One thing I noticed while investigating this a bit: If you run the command It looks like cblas might be Apple's own BLAS implementation:https://developer.apple.com/documentation/accelerate/blas, but I'm not really sure if this is the case or how it relates to the Accelerate framework. This is beyond my knowledge base. Perhaps someone else knows and can elaborate. |
Apple does not seems to be interested in enabling open source developers to utilize their silicon, to the point where people need to reverse engineer their libraries to find out what undocumented instructions they use to perform fast math operations. I find that rather anti-consumer and anti-competitive, to put it mildly. |
That's both interesting and unfortunate. Interesting in the sense that whatever the computational backend for the Conda-forge build is, it appears to be different than the pypi build. (Of course, there are probably a whole host of other factors in the build process that I'm not aware of that could also influence the runtime speed. Maybe a difference in the compiler used could matter if the number of machine instructions at the end of the compilation is different. I don't know.) It's unfortunate that Apple is now making great CPUs, but is not making it easy for developers to fully utilize their potential. I hope that changes or that at least people can find workarounds. Apple's advertising seems mostly geared towards video editors and photographers, but the chips have great potential for certain scientific computing workloads. It's a shame Apple doesn't lean into other parts of the market more. |
I was somewhat interested in the new M1 Max, due to the very high memory BW (~8 channels of DDR4), but that was soon tempered by the discovery that the CPU cores cannot use more than half of the total BW due to some, yet again undocumented, internal bus bottleneck. |
That does make some sense. And by this, I don't mean that it's ideal, just that it appears consistent about what we know about the chips. The main differences in M1 Pro and Max are not the CPU itself (unless you count the 8-core binned M1 Pro), but other things like the GPU core count and media encoders. The CPU itself is the same for the two chips. It seems like the extra memory bandwidth advertised for the M1 Max is somehow reserved for other parts of the chip, with the CPU memory bandwidth being about the same as that of the Pro. I don't see whether the author of the linked article used the 24 or 32 core GPU M1 Max model. I wonder how the CPU memory bandwidth would differ between the two models. i.e. whether getting the 24 core model "frees up" more bandwidth for the CPU or whether the total shared bandwidth is just decreased. Who knows. That seems like it would be a very expensive experiment at the very least. Another interesting question would be whether the M1 Pro CPU can fully utilize all 200 GB/s, or whether that's slashed in half as well. It seems like the 400 GB/s marketing claim has to come with this asterisk. It's a real shame that executives and marketing teams at large companies sometimes get in the way of the innovations their engineering teams produce, to the detriment of consumers and developers. EDIT: I also wonder if the memory bandwidth bottleneck is something that is built into in the silicon, or if somehow the operating system is making decisions as how to allocate memory to different parts of the chip. I think it's now possible to install Linux on M1:https://asahilinux.org/2021/10/progress-report-september-2021/, so I wonder if that would result in memory being allocated to the CPU differently. |
This is pure conjecture on my part, but I would assume that the bandwidth is limited by physical partitioning on the M1 Pro/Max. The CPU cluster probably does not have "enough wires" going to the memory controller to transfer 400 GB/s, so I would think fusing off a couple of cores in the GPU would not affect the CPU BW. Not sure about the Pro, if they just copy-pasted the CPU part, there is a chance the CPU could use all of the BW on that, Edit: The undocumented math instructions I mentioned previously, are not executed by the CPU core, but separate SIMD coprocessors, which are technically not part of the CPU core, even though some caches are shared. But given how the big.LITTLE cores all share the ~1/2 BW limit, I doubt using those coprocessors would make much of a difference. |
Of all the possible explanations I can think of, that makes the most sense to me. Do you know if there are any published benchmarks for specific open source scientific computing packages such as psi4, pyscf, Qiskit, ect. that might enlighten the performance of these machines for specific applications? That might be interesting to see. It's a shame that most of the information regarding the performance of these machines is almost entirely in the context of non-scientific computing workloads. |
For M1 and successor? Not that I am aware. Most of the bottlenecks are usually BLAS/LAPACK (and I/O but let's ignore that) so it often is enough to test the linear algebra library. |
Talking about I/O in the context of M1 machines, what does the SSD endurance look like on them? Running quantum chemistry on machines with the SSD soldered onto the motherboard sounds generally unwise to me, unless it is particularly write-durable or the scratch files fit inside a ramdisk (so mostly direct algorithms). |
@ TiborGY I would not know how to monitor the health of my SSD, so I cannot speak to that. I have a 16GB M1 Mac mini sitting on my desk using most of the cores 24/7, but all of the program memory requirements fit comfortably within 16GB, so little or no swap is being used. |
Some good news for building numpy using the Accelerate framework! From the numpy 1.21.0 release notes at https://numpy.org/doc/stable/release/1.21.0-notes.html : "With the release of macOS 11.3, several different issues that numpy was encountering when using Accelerate Framework’s implementation of BLAS and LAPACK should be resolved. This change enables the Accelerate Framework as an option on macOS. If additional issues are found, please file a bug report against Accelerate using the developer feedback assistant tool (https://developer.apple.com/bug-reporting/). We intend to address issues promptly and plan to continue supporting and updating our BLAS and LAPACK libraries." It might very well be that this is what the conda-forge numpy builds are already using. It is difficult to say. If anyone knows how to build numpy from source explicitly using Accelerate, that would be very much appreciated. |
What exactly did you install? |
The numpy from Miniforge3-MacOSX-arm64 comes with libopenblas. They just hide the actual blas library behind more a more generic interface like cblas. This way they can easily switch between openblas or mkl for example.
|
Ah I see. The limited benchmarking I did that showed better performance from the mini forge build must be due to something else then. I would look into this more to be more thorough, but there are too many processes running on my machine to get any useful information from them. At any rate, it looks like it should be possible to build numpy from source using Accelerate as a backend now, but I don't see anything in the release notes for numpy > 1.21.0 about changing the BLAS for the macOS-arm64 wheels. I have to imagine that in the not-too-distant future (unless more bugs have been uncovered) that future arm64 wheels will be build using Accelerate since this seems to be the most suitable BLAS for this platform. Let me see if I can find out the exact build that I installed. |
If I run |
I have rerun the benchmarks at: https://markus-beuckelmann.de/blog/boosting-numpy-blas.html. The results are: pypi numpy 1.21.3:
conda-forge 1.21.0:
This time around the results are not meaningfully different. Something must have been throwing them off before, so my apologies for the red herring regarding the different builds. They seem to be using the same BLAS. That just makes the prospect of actually using Accelerate for numpy and scipy all the more exciting! I will try to look into how to do this and report back if I find out how to do so. (Or maybe someone else knows, since it seems like it should be possible to do now.) |
Edit: We are going off topic here, but I am going to answer because I think this could be useful information. I once guesstimated the amount of writes running non-DF CCSD(T) generates to be around 1 to 5 TB per day on a fast 8-14 core machine. Not something that most SSDs can be expected to reliably handle for long. |
For anyone curious about how to build numpy from source using the Accelerate framework: NOTE: You will want to make sure your MacOS version is at least 11.3. Anything lower is going to be buggy. |
Another set of instructions installing numpy+vecLib: |
See also the discussion in conda-forge/numpy-feedstock#253 |
Ok, as promised, the QC deps for Psi4 are now available on conda-forge natively for There's a c-f tracker for osx-arm64 packages at https://github.com/orgs/psi4/projects/2/views/5 psi4 master
psi4 with #2861
|
There's built psi4 packages available for testing. Details at #2300 (comment) |
This is a short summary on how to get started with PSI4 and Apple Silicon
Overview:
Python/Package management options:
BLAS/LAPACK options:
homebrew:
requirred brew packages:
cmake eigen numpy
for OpenMP:
libomp
optional:
doxygen jupyterlab pytest gcc
(gcc to get a Fortran compiler)Note: numpy will come with a non-threading OpenBLAS library
psi4 python packages:
pip3 install pydantic pint py-cpuinfo psutil
docs:
git clone https://github.com/psi4/sphinx-psi-theme.git
(pip3 install .
)sphinx-doc
from brew, it has python3.10 dependencybasic build with Accelerate Framework and homebrew python:
cmake -H. -Bobjdir -DPython_EXECUTABLE=/opt/homebrew/bin/python3 -DCMAKE_INSTALL_PREFIX=<custom>
export CPLUS_INCLUDE_PATH=/opt/homebrew/include
(for libint2 to find a header)libomp
is found correctly, OpenMP is enabled but only explicit C-side openmp sections are threaded AFAIK, not blas.caveats
I may have missed a detail.. These notes will be updated over time.
The text was updated successfully, but these errors were encountered: