Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apple Silicon / ARM64 notes #2333

Open
hokru opened this issue Oct 21, 2021 · 24 comments
Open

Apple Silicon / ARM64 notes #2333

hokru opened this issue Oct 21, 2021 · 24 comments
Labels
bulletin For things that aren't "issues"

Comments

@hokru
Copy link
Member

hokru commented Oct 21, 2021

This is a short summary on how to get started with PSI4 and Apple Silicon

Overview:


homebrew:

requirred brew packages: cmake eigen numpy
for OpenMP: libomp
optional: doxygen jupyterlab pytest gcc (gcc to get a Fortran compiler)
Note: numpy will come with a non-threading OpenBLAS library

psi4 python packages:

  • pip3 install pydantic pint py-cpuinfo psutil

docs:

  • pip3 install Sphinx nbsphinx python-graphviz sphinx-autodoc-typehints sphinx-automodapi
  • custom theme from git clone https://github.com/psi4/sphinx-psi-theme.git (pip3 install .)
  • dont get sphinx-doc from brew, it has python3.10 dependency

basic build with Accelerate Framework and homebrew python:

  • cmake -H. -Bobjdir -DPython_EXECUTABLE=/opt/homebrew/bin/python3 -DCMAKE_INSTALL_PREFIX=<custom>
  • export CPLUS_INCLUDE_PATH=/opt/homebrew/include (for libint2 to find a header)
  • If libomp is found correctly, OpenMP is enabled but only explicit C-side openmp sections are threaded AFAIK, not blas.
  • builds everything from scratch and wow it's fast!

caveats

  • Mismatch between numpy(=openblas) and psi4(=Accelerate) libraries.

I may have missed a detail.. These notes will be updated over time.

@JonathonMisiewicz JonathonMisiewicz added the bulletin For things that aren't "issues" label Oct 22, 2021
@JoelHBierman
Copy link

JoelHBierman commented Oct 31, 2021

Are there any known methods of getting to build Numpy/Scipy from source using Accelerate to avoid the mismatch? (I think support might have been dropped altogether.)

@hokru
Copy link
Member Author

hokru commented Nov 1, 2021

Are there any known methods of getting to build Numpy/Scipy from source using Accelerate to avoid the mismatch? (I think support might have been dropped altogether.)

When I was looking for it recently I only found that Numpy dropped support of Accelerate some time ago because it only provided an old lapack version. This was pre-M1 as far as I know.
Maybe it is possible to build numpy regardless of support with some manual interventions.

@JoelHBierman
Copy link

JoelHBierman commented Nov 1, 2021

One thing I noticed while investigating this a bit: If you run the command np.show_config(), you can see that the Numpy binaries from Conda-forge and pypi are built using different BLAS and LAPACK. The Numpy binary on pypi is built using openblas and the Conda-forge binary is built using something called cblas. I'm not sure what cblas is, but this build seems to be much faster for some numpy functionality than the openblas build on pypi. Just something interesting that might be of use to M1 users.

It looks like cblas might be Apple's own BLAS implementation:https://developer.apple.com/documentation/accelerate/blas, but I'm not really sure if this is the case or how it relates to the Accelerate framework. This is beyond my knowledge base. Perhaps someone else knows and can elaborate.

@TiborGY
Copy link
Contributor

TiborGY commented Nov 1, 2021

One thing I noticed while investigating this a bit: If you run the command np.show_config(), you can see that the Numpy binaries from Conda-forge and pypi are built using different BLAS and LAPACK. The Numpy binary on pypi is built using openblas and the Conda-forge binary is built using something called cblas. I'm not sure what cblas is, but this build seems to be much faster for some numpy functionality than the openblas build on pypi. Just something interesting that might be of use to M1 users.

It looks like cblas might be Apple's own BLAS implementation:https://developer.apple.com/documentation/accelerate/blas, but I'm not really sure if this is the case or how it relates to the Accelerate framework. This is beyond my knowledge base. Perhaps someone else knows and can elaborate.

CBLAS is typically just a wrapper written in C, to provide a "least common denominator" interface for the Fortran subroutines making up a typical BLAS implementation. The computational backend beyond that CBLAS could be pretty much anything, including OpenBLAS.

Apple does not seems to be interested in enabling open source developers to utilize their silicon, to the point where people need to reverse engineer their libraries to find out what undocumented instructions they use to perform fast math operations. I find that rather anti-consumer and anti-competitive, to put it mildly.

@JoelHBierman
Copy link

JoelHBierman commented Nov 1, 2021

That's both interesting and unfortunate. Interesting in the sense that whatever the computational backend for the Conda-forge build is, it appears to be different than the pypi build. (Of course, there are probably a whole host of other factors in the build process that I'm not aware of that could also influence the runtime speed. Maybe a difference in the compiler used could matter if the number of machine instructions at the end of the compilation is different. I don't know.) It's unfortunate that Apple is now making great CPUs, but is not making it easy for developers to fully utilize their potential. I hope that changes or that at least people can find workarounds. Apple's advertising seems mostly geared towards video editors and photographers, but the chips have great potential for certain scientific computing workloads. It's a shame Apple doesn't lean into other parts of the market more.

@TiborGY
Copy link
Contributor

TiborGY commented Nov 1, 2021

I hope that changes or that at least people can find workarounds. Apple's advertising seems mostly geared towards video editors and photographers, but the chips have great potential for certain scientific computing workloads.

I was somewhat interested in the new M1 Max, due to the very high memory BW (~8 channels of DDR4), but that was soon tempered by the discovery that the CPU cores cannot use more than half of the total BW due to some, yet again undocumented, internal bus bottleneck.

@JoelHBierman
Copy link

JoelHBierman commented Nov 1, 2021

That does make some sense. And by this, I don't mean that it's ideal, just that it appears consistent about what we know about the chips. The main differences in M1 Pro and Max are not the CPU itself (unless you count the 8-core binned M1 Pro), but other things like the GPU core count and media encoders. The CPU itself is the same for the two chips. It seems like the extra memory bandwidth advertised for the M1 Max is somehow reserved for other parts of the chip, with the CPU memory bandwidth being about the same as that of the Pro. I don't see whether the author of the linked article used the 24 or 32 core GPU M1 Max model. I wonder how the CPU memory bandwidth would differ between the two models. i.e. whether getting the 24 core model "frees up" more bandwidth for the CPU or whether the total shared bandwidth is just decreased. Who knows. That seems like it would be a very expensive experiment at the very least. Another interesting question would be whether the M1 Pro CPU can fully utilize all 200 GB/s, or whether that's slashed in half as well. It seems like the 400 GB/s marketing claim has to come with this asterisk. It's a real shame that executives and marketing teams at large companies sometimes get in the way of the innovations their engineering teams produce, to the detriment of consumers and developers.

EDIT: I also wonder if the memory bandwidth bottleneck is something that is built into in the silicon, or if somehow the operating system is making decisions as how to allocate memory to different parts of the chip. I think it's now possible to install Linux on M1:https://asahilinux.org/2021/10/progress-report-september-2021/, so I wonder if that would result in memory being allocated to the CPU differently.

@TiborGY
Copy link
Contributor

TiborGY commented Nov 1, 2021

This is pure conjecture on my part, but I would assume that the bandwidth is limited by physical partitioning on the M1 Pro/Max. The CPU cluster probably does not have "enough wires" going to the memory controller to transfer 400 GB/s, so I would think fusing off a couple of cores in the GPU would not affect the CPU BW. Not sure about the Pro, if they just copy-pasted the CPU part, there is a chance the CPU could use all of the BW on that,

Edit: The undocumented math instructions I mentioned previously, are not executed by the CPU core, but separate SIMD coprocessors, which are technically not part of the CPU core, even though some caches are shared. But given how the big.LITTLE cores all share the ~1/2 BW limit, I doubt using those coprocessors would make much of a difference.

@JoelHBierman
Copy link

Of all the possible explanations I can think of, that makes the most sense to me. Do you know if there are any published benchmarks for specific open source scientific computing packages such as psi4, pyscf, Qiskit, ect. that might enlighten the performance of these machines for specific applications? That might be interesting to see. It's a shame that most of the information regarding the performance of these machines is almost entirely in the context of non-scientific computing workloads.

@hokru
Copy link
Member Author

hokru commented Nov 1, 2021

Do you know if there are any published benchmarks for specific open source scientific computing packages such as psi4, pyscf, Qiskit, ect. that might enlighten the performance of these machines for specific applications?

For M1 and successor? Not that I am aware. Most of the bottlenecks are usually BLAS/LAPACK (and I/O but let's ignore that) so it often is enough to test the linear algebra library.
Among programs there are algorithmic choices/limitations that often make comparisons difficult if not pointless.

@TiborGY
Copy link
Contributor

TiborGY commented Nov 1, 2021

Talking about I/O in the context of M1 machines, what does the SSD endurance look like on them? Running quantum chemistry on machines with the SSD soldered onto the motherboard sounds generally unwise to me, unless it is particularly write-durable or the scratch files fit inside a ramdisk (so mostly direct algorithms).

@JoelHBierman
Copy link

@ TiborGY I would not know how to monitor the health of my SSD, so I cannot speak to that. I have a 16GB M1 Mac mini sitting on my desk using most of the cores 24/7, but all of the program memory requirements fit comfortably within 16GB, so little or no swap is being used.

@JoelHBierman
Copy link

JoelHBierman commented Nov 4, 2021

Some good news for building numpy using the Accelerate framework! From the numpy 1.21.0 release notes at https://numpy.org/doc/stable/release/1.21.0-notes.html :

"With the release of macOS 11.3, several different issues that numpy was encountering when using Accelerate Framework’s implementation of BLAS and LAPACK should be resolved. This change enables the Accelerate Framework as an option on macOS. If additional issues are found, please file a bug report against Accelerate using the developer feedback assistant tool (https://developer.apple.com/bug-reporting/). We intend to address issues promptly and plan to continue supporting and updating our BLAS and LAPACK libraries."

It might very well be that this is what the conda-forge numpy builds are already using. It is difficult to say. If anyone knows how to build numpy from source explicitly using Accelerate, that would be very much appreciated.

@hokru
Copy link
Member Author

hokru commented Nov 4, 2021

What exactly did you install?

@hokru
Copy link
Member Author

hokru commented Nov 4, 2021

The numpy from Miniforge3-MacOSX-arm64 comes with libopenblas. They just hide the actual blas library behind more a more generic interface like cblas. This way they can easily switch between openblas or mkl for example.
You can check what is being actually used:

Holgers-MacBook-Air:kruse :~ > otool -L /Users/kruse/miniforge3/lib/libcblas.dylib
/Users/kruse/miniforge3/lib/libcblas.dylib:
	@rpath/libopenblas.0.dylib (compatibility version 0.0.0, current version 0.0.0)
	@rpath/libgfortran.5.dylib (compatibility version 6.0.0, current version 6.0.0)
	@rpath/libomp.dylib (compatibility version 5.0.0, current version 5.0.0)
	/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1292.0.0)

@JoelHBierman
Copy link

JoelHBierman commented Nov 4, 2021

Ah I see. The limited benchmarking I did that showed better performance from the mini forge build must be due to something else then. I would look into this more to be more thorough, but there are too many processes running on my machine to get any useful information from them. At any rate, it looks like it should be possible to build numpy from source using Accelerate as a backend now, but I don't see anything in the release notes for numpy > 1.21.0 about changing the BLAS for the macOS-arm64 wheels. I have to imagine that in the not-too-distant future (unless more bugs have been uncovered) that future arm64 wheels will be build using Accelerate since this seems to be the most suitable BLAS for this platform.

Let me see if I can find out the exact build that I installed.

@JoelHBierman
Copy link

JoelHBierman commented Nov 4, 2021

If I run conda list, it tells me that I installed the py39h1f3b974_0 NumPy 1.21.0 build from conda-forge.

@JoelHBierman
Copy link

I have rerun the benchmarks at: https://markus-beuckelmann.de/blog/boosting-numpy-blas.html. The results are:

pypi numpy 1.21.3:

Dotted two 4096x4096 matrices in 0.74 s.
Dotted two vectors of length 524288 in 0.26 ms.
SVD of a 2048x1024 matrix in 0.83 s.
Cholesky decomposition of a 2048x2048 matrix in 0.09 s.
Eigendecomposition of a 2048x2048 matrix in 6.14 s.

conda-forge 1.21.0:

Dotted two 4096x4096 matrices in 0.67 s.
Dotted two vectors of length 524288 in 0.25 ms.
SVD of a 2048x1024 matrix in 0.72 s.
Cholesky decomposition of a 2048x2048 matrix in 0.08 s.
Eigendecomposition of a 2048x2048 matrix in 5.33 s.

This time around the results are not meaningfully different. Something must have been throwing them off before, so my apologies for the red herring regarding the different builds. They seem to be using the same BLAS. That just makes the prospect of actually using Accelerate for numpy and scipy all the more exciting! I will try to look into how to do this and report back if I find out how to do so. (Or maybe someone else knows, since it seems like it should be possible to do now.)

@TiborGY
Copy link
Contributor

TiborGY commented Nov 5, 2021

@ TiborGY I would not know how to monitor the health of my SSD, so I cannot speak to that. I have a 16GB M1 Mac mini sitting on my desk using most of the cores 24/7, but all of the program memory requirements fit comfortably within 16GB, so little or no swap is being used.

Edit: We are going off topic here, but I am going to answer because I think this could be useful information.
I am not just talking about swap here, a lot of quantum chemistry programs create temporary files that they intensely read and write to. Usually called "conventional integrals" or "out-of-core" algorithms for historical reasons.

I once guesstimated the amount of writes running non-DF CCSD(T) generates to be around 1 to 5 TB per day on a fast 8-14 core machine. Not something that most SSDs can be expected to reliably handle for long.

@JoelHBierman
Copy link

For anyone curious about how to build numpy from source using the Accelerate framework:
https://stackoverflow.com/questions/69848969/how-to-build-numpy-from-source-linked-to-apple-accelerate-framework/69869531#69869531

NOTE: You will want to make sure your MacOS version is at least 11.3. Anything lower is going to be buggy.

@hokru
Copy link
Member Author

hokru commented Jan 7, 2022

Another set of instructions installing numpy+vecLib:
https://developer.apple.com/forums/thread/695963?answerId=697568022#697568022

@ulupo
Copy link

ulupo commented Jan 25, 2022

See also the discussion in conda-forge/numpy-feedstock#253

@loriab
Copy link
Member

loriab commented Mar 18, 2023

Ok, as promised, the QC deps for Psi4 are now available on conda-forge natively for osx-arm64. Note that these are cross-compiled on regular osx-64, so don't get tested. I'd be glad to hear if/how they're working. You can either build psi4 master and still provide your own libint or use the libint package and build a special branch of psi4.

There's a c-f tracker for osx-arm64 packages at https://github.com/orgs/psi4/projects/2/views/5

psi4 master

conda install gau2grid libxc-c optking qcengine -c conda-forge

psi4 with #2861

  • conda install gau2grid libxc-c optking qcengine conda-forge/label/libint_dev::libint -c conda-forge
  • or conda install gau2grid libxc-c optking qcengine libint -c conda-forge/label/libint_dev -c conda-forge (channel order matters)

@loriab
Copy link
Member

loriab commented May 1, 2023

There's built psi4 packages available for testing. Details at #2300 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bulletin For things that aren't "issues"
Projects
None yet
Development

No branches or pull requests

6 participants