Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pytensor and blas problems on on MacOS 15 Sequoia with Apple Silicon #1005

Closed
danieltomasz opened this issue Sep 28, 2024 · 29 comments · Fixed by #1056
Closed

pytensor and blas problems on on MacOS 15 Sequoia with Apple Silicon #1005

danieltomasz opened this issue Sep 28, 2024 · 29 comments · Fixed by #1056
Labels
bug Something isn't working installation macOS

Comments

@danieltomasz
Copy link

danieltomasz commented Sep 28, 2024

Describe the issue:

Since update to MacOS 15 I have a problem with using Apple implementation of BLAS.
Installing pytensor from miniconda3-3.12-24.7.1-0 via conda create -n voxel-bayes-3.12 -c conda-forge pytensor seems to install openblas instead of accelerate.

~/.pyenv/versions/miniconda3-3.12-24.7.1-0/bin/conda create -n voxel-bayes-3.12   -c conda-forge  pytensor
Channels:
 - conda-forge
 - defaults
Platform: osx-arm64
Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12

  added / updated specs:
    - pytensor


The following NEW packages will be INSTALLED:

  accelerate         conda-forge/noarch::accelerate-0.34.2-pyhd8ed1ab_0 
  blas               conda-forge/osx-arm64::blas-2.124-openblas 
  blas-devel         conda-forge/osx-arm64::blas-devel-3.9.0-24_osxarm64_openblas 
  brotli-python      conda-forge/osx-arm64::brotli-python-1.1.0-py312hde4cb15_2 
  bzip2              conda-forge/osx-arm64::bzip2-1.0.8-h99b78c6_7 
  ca-certificates    conda-forge/osx-arm64::ca-certificates-2024.8.30-hf0a4a13_0 
  cctools_osx-arm64  conda-forge/osx-arm64::cctools_osx-arm64-1010.6-h4208deb_1 
  certifi            conda-forge/noarch::certifi-2024.8.30-pyhd8ed1ab_0 
  cffi               conda-forge/osx-arm64::cffi-1.17.1-py312h0fad829_0 
  charset-normalizer conda-forge/noarch::charset-normalizer-3.3.2-pyhd8ed1ab_0 
  clang              conda-forge/osx-arm64::clang-17.0.6-default_h360f5da_7 
  clang-17           conda-forge/osx-arm64::clang-17-17.0.6-default_h146c034_7 
  clang_impl_osx-ar~ conda-forge/osx-arm64::clang_impl_osx-arm64-17.0.6-he47c785_19 
  clang_osx-arm64    conda-forge/osx-arm64::clang_osx-arm64-17.0.6-h54d7cd3_19 
  clangxx            conda-forge/osx-arm64::clangxx-17.0.6-default_h360f5da_7 
  clangxx_impl_osx-~ conda-forge/osx-arm64::clangxx_impl_osx-arm64-17.0.6-h50f59cd_19 
  clangxx_osx-arm64  conda-forge/osx-arm64::clangxx_osx-arm64-17.0.6-h54d7cd3_19 
  colorama           conda-forge/noarch::colorama-0.4.6-pyhd8ed1ab_0 
  compiler-rt        conda-forge/osx-arm64::compiler-rt-17.0.6-h856b3c1_2 
  compiler-rt_osx-a~ conda-forge/noarch::compiler-rt_osx-arm64-17.0.6-h832e737_2 
  cons               conda-forge/noarch::cons-0.4.6-pyhd8ed1ab_0 
  etuples            conda-forge/noarch::etuples-0.3.9-pyhd8ed1ab_0 
  filelock           conda-forge/noarch::filelock-3.16.1-pyhd8ed1ab_0 
  fsspec             conda-forge/noarch::fsspec-2024.9.0-pyhff2d567_0 
  gmp                conda-forge/osx-arm64::gmp-6.3.0-h7bae524_2 
  gmpy2              conda-forge/osx-arm64::gmpy2-2.1.5-py312h87fada9_2 
  h2                 conda-forge/noarch::h2-4.1.0-pyhd8ed1ab_0 
  hpack              conda-forge/noarch::hpack-4.0.0-pyh9f0ad1d_0 
  huggingface_hub    conda-forge/noarch::huggingface_hub-0.25.1-pyhd8ed1ab_0 
  hyperframe         conda-forge/noarch::hyperframe-6.0.1-pyhd8ed1ab_0 
  icu                conda-forge/osx-arm64::icu-75.1-hfee45f7_0 
  idna               conda-forge/noarch::idna-3.10-pyhd8ed1ab_0 
  jinja2             conda-forge/noarch::jinja2-3.1.4-pyhd8ed1ab_0 
  ld64_osx-arm64     conda-forge/osx-arm64::ld64_osx-arm64-951.9-hc81425b_1 
  libabseil          conda-forge/osx-arm64::libabseil-20240116.2-cxx17_h00cdb27_1 
  libblas            conda-forge/osx-arm64::libblas-3.9.0-24_osxarm64_openblas 
  libcblas           conda-forge/osx-arm64::libcblas-3.9.0-24_osxarm64_openblas 
  libclang-cpp17     conda-forge/osx-arm64::libclang-cpp17-17.0.6-default_h146c034_7 
  libcxx             conda-forge/osx-arm64::libcxx-19.1.0-ha82da77_0 
  libcxx-devel       conda-forge/osx-arm64::libcxx-devel-17.0.6-h86353a2_6 
  libexpat           conda-forge/osx-arm64::libexpat-2.6.3-hf9b8971_0 
  libffi             conda-forge/osx-arm64::libffi-3.4.2-h3422bc3_5 
  libgfortran        conda-forge/osx-arm64::libgfortran-5.0.0-13_2_0_hd922786_3 
  libgfortran5       conda-forge/osx-arm64::libgfortran5-13.2.0-hf226fd6_3 
  libiconv           conda-forge/osx-arm64::libiconv-1.17-h0d3ecfb_2 
  liblapack          conda-forge/osx-arm64::liblapack-3.9.0-24_osxarm64_openblas 
  liblapacke         conda-forge/osx-arm64::liblapacke-3.9.0-24_osxarm64_openblas 
  libllvm17          conda-forge/osx-arm64::libllvm17-17.0.6-h5090b49_2 
  libopenblas        conda-forge/osx-arm64::libopenblas-0.3.27-openmp_h517c56d_1 
  libprotobuf        conda-forge/osx-arm64::libprotobuf-4.25.3-hc39d83c_1 
  libsqlite          conda-forge/osx-arm64::libsqlite-3.46.1-hc14010f_0 
  libtorch           conda-forge/osx-arm64::libtorch-2.4.0-cpu_generic_h4365fe2_1 
  libuv              conda-forge/osx-arm64::libuv-1.49.0-hd74edd7_0 
  libxml2            conda-forge/osx-arm64::libxml2-2.12.7-h01dff8b_4 
  libzlib            conda-forge/osx-arm64::libzlib-1.3.1-hfb2fe0b_1 
  llvm-openmp        conda-forge/osx-arm64::llvm-openmp-18.1.8-hde57baf_1 
  llvm-tools         conda-forge/osx-arm64::llvm-tools-17.0.6-h5090b49_2 
  logical-unificati~ conda-forge/noarch::logical-unification-0.4.6-pyhd8ed1ab_0 
  macosx_deployment~ conda-forge/noarch::macosx_deployment_target_osx-arm64-11.0-h6553868_1 
  markupsafe         conda-forge/osx-arm64::markupsafe-2.1.5-py312h024a12e_1 
  minikanren         conda-forge/noarch::minikanren-1.0.3-pyhd8ed1ab_0 
  mpc                conda-forge/osx-arm64::mpc-1.3.1-h8f1351a_1 
  mpfr               conda-forge/osx-arm64::mpfr-4.2.1-hb693164_3 
  mpmath             conda-forge/noarch::mpmath-1.3.0-pyhd8ed1ab_0 
  multipledispatch   conda-forge/noarch::multipledispatch-0.6.0-pyhd8ed1ab_1 
  ncurses            conda-forge/osx-arm64::ncurses-6.5-h7bae524_1 
  networkx           conda-forge/noarch::networkx-3.3-pyhd8ed1ab_1 
  nomkl              conda-forge/noarch::nomkl-1.0-h5ca1d4c_0 
  numpy              conda-forge/osx-arm64::numpy-1.26.4-py312h8442bc7_0 
  openblas           conda-forge/osx-arm64::openblas-0.3.27-openmp_h560b219_1 
  openssl            conda-forge/osx-arm64::openssl-3.3.2-h8359307_0 
  packaging          conda-forge/noarch::packaging-24.1-pyhd8ed1ab_0 
  pip                conda-forge/noarch::pip-24.2-pyh8b19718_1 
  psutil             conda-forge/osx-arm64::psutil-6.0.0-py312h024a12e_1 
  pycparser          conda-forge/noarch::pycparser-2.22-pyhd8ed1ab_0 
  pysocks            conda-forge/noarch::pysocks-1.7.1-pyha2e5f31_6 
  pytensor           conda-forge/osx-arm64::pytensor-2.25.4-py312h3f593ad_0 
  pytensor-base      conda-forge/osx-arm64::pytensor-base-2.25.4-py312h02baea5_0 
  python             conda-forge/osx-arm64::python-3.12.6-h739c21a_1_cpython 
  python_abi         conda-forge/osx-arm64::python_abi-3.12-5_cp312 
  pytorch            conda-forge/osx-arm64::pytorch-2.4.0-cpu_generic_py312h6bd8f41_1 
  pyyaml             conda-forge/osx-arm64::pyyaml-6.0.2-py312h024a12e_1 
  readline           conda-forge/osx-arm64::readline-8.2-h92ec313_1 
  requests           conda-forge/noarch::requests-2.32.3-pyhd8ed1ab_0 
  safetensors        conda-forge/osx-arm64::safetensors-0.4.5-py312he431725_0 
  scipy              conda-forge/osx-arm64::scipy-1.14.1-py312heb3a901_0 
  setuptools         conda-forge/noarch::setuptools-75.1.0-pyhd8ed1ab_0 
  sigtool            conda-forge/osx-arm64::sigtool-0.1.3-h44b9a77_0 
  six                conda-forge/noarch::six-1.16.0-pyh6c4a22f_0 
  sleef              conda-forge/osx-arm64::sleef-3.7-h7783ee8_0 
  sympy              conda-forge/noarch::sympy-1.13.3-pypyh2585a3b_103 
  tapi               conda-forge/osx-arm64::tapi-1300.6.5-h03f4b80_0 
  tk                 conda-forge/osx-arm64::tk-8.6.13-h5083fa2_1 
  toolz              conda-forge/noarch::toolz-0.12.1-pyhd8ed1ab_0 
  tqdm               conda-forge/noarch::tqdm-4.66.5-pyhd8ed1ab_0 
  typing-extensions  conda-forge/noarch::typing-extensions-4.12.2-hd8ed1ab_0 
  typing_extensions  conda-forge/noarch::typing_extensions-4.12.2-pyha770c72_0 
  tzdata             conda-forge/noarch::tzdata-2024a-h8827d51_1 
  urllib3            conda-forge/noarch::urllib3-2.2.3-pyhd8ed1ab_0 
  wheel              conda-forge/noarch::wheel-0.44.0-pyhd8ed1ab_0 
  xz                 conda-forge/osx-arm64::xz-5.2.6-h57fd34a_0 
  yaml               conda-forge/osx-arm64::yaml-0.2.5-h3422bc3_2 
  zstandard          conda-forge/osx-arm64::zstandard-0.23.0-py312h15fbf35_1 
  zstd               conda-forge/osx-arm64::zstd-1.5.6-hb46c0d2_0 


Proceed ([y]/n)? y

Running this the check

python $(python -c "import pathlib, pytensor; print(pathlib.Path(pytensor.__file__).parent / 'misc/check_blas.py')")

        Some results that you can compare against. They were 10 executions
        of gemm in float64 with matrices of shape 2000x2000 (M=N=K=2000).
        All memory layout was in C order.

        CPU tested: Xeon E5345(2.33Ghz, 8M L2 cache, 1333Mhz FSB),
                    Xeon E5430(2.66Ghz, 12M L2 cache, 1333Mhz FSB),
                    Xeon E5450(3Ghz, 12M L2 cache, 1333Mhz FSB),
                    Xeon X5560(2.8Ghz, 12M L2 cache, hyper-threads?)
                    Core 2 E8500, Core i7 930(2.8Ghz, hyper-threads enabled),
                    Core i7 950(3.07GHz, hyper-threads enabled)
                    Xeon X5550(2.67GHz, 8M l2 cache?, hyper-threads enabled)


        Libraries tested:
            * numpy with ATLAS from distribution (FC9) package (1 thread)
            * manually compiled numpy and ATLAS with 2 threads
            * goto 1.26 with 1, 2, 4 and 8 threads
            * goto2 1.13 compiled with multiple threads enabled

                          Xeon   Xeon   Xeon  Core2 i7    i7     Xeon   Xeon
        lib/nb threads    E5345  E5430  E5450 E8500 930   950    X5560  X5550

        numpy 1.3.0 blas                                                775.92s
        numpy_FC9_atlas/1 39.2s  35.0s  30.7s 29.6s 21.5s 19.60s
        goto/1            18.7s  16.1s  14.2s 13.7s 16.1s 14.67s
        numpy_MAN_atlas/2 12.0s  11.6s  10.2s  9.2s  9.0s
        goto/2             9.5s   8.1s   7.1s  7.3s  8.1s  7.4s
        goto/4             4.9s   4.4s   3.7s  -     4.1s  3.8s
        goto/8             2.7s   2.4s   2.0s  -     4.1s  3.8s
        openblas/1                                        14.04s
        openblas/2                                         7.16s
        openblas/4                                         3.71s
        openblas/8                                         3.70s
        mkl 11.0.083/1            7.97s
        mkl 10.2.2.025/1                                         13.7s
        mkl 10.2.2.025/2                                          7.6s
        mkl 10.2.2.025/4                                          4.0s
        mkl 10.2.2.025/8                                          2.0s
        goto2 1.13/1                                                     14.37s
        goto2 1.13/2                                                      7.26s
        goto2 1.13/4                                                      3.70s
        goto2 1.13/8                                                      1.94s
        goto2 1.13/16                                                     3.16s

        Test time in float32. There were 10 executions of gemm in
        float32 with matrices of shape 5000x5000 (M=N=K=5000)
        All memory layout was in C order.


        cuda version      8.0    7.5    7.0
        gpu
        M40               0.45s  0.47s
        k80               0.92s  0.96s
        K6000/NOECC       0.71s         0.69s
        P6000/NOECC       0.25s

        Titan X (Pascal)  0.28s
        GTX Titan X       0.45s  0.45s  0.47s
        GTX Titan Black   0.66s  0.64s  0.64s
        GTX 1080          0.35s
        GTX 980 Ti               0.41s
        GTX 970                  0.66s
        GTX 680                         1.57s
        GTX 750 Ti               2.01s  2.01s
        GTX 750                  2.46s  2.37s
        GTX 660                  2.32s  2.32s
        GTX 580                  2.42s
        GTX 480                  2.87s
        TX1                             7.6s (float32 storage and computation)
        GT 610                          33.5s
        
Some PyTensor flags:
    blas__ldflags= -L/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib -llapack -lblas -lcblas -lm -Wl,-rpath,/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib
    compiledir= /Users/daniel/.pytensor/compiledir_macOS-15.0-arm64-arm-64bit-arm-3.12.6-64
    floatX= float64
    device= cpu
Some OS information:
    sys.platform= darwin
    sys.version= 3.12.6 | packaged by conda-forge | (main, Sep 22 2024, 14:07:06) [Clang 17.0.6 ]
    sys.prefix= /Users/daniel/.pyenv/versions/voxel-bayes-3.12
Some environment variables:
    MKL_NUM_THREADS= None
    OMP_NUM_THREADS= None
    GOTO_NUM_THREADS= None

Numpy config: (used when the PyTensor flag "blas__ldflags" is empty)
Build Dependencies:
  blas:
    detection method: pkgconfig
    found: true
    include directory: /Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/include
    lib directory: /Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/lib
    name: blas
    openblas configuration: unknown
    pc file directory: /Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/lib/pkgconfig
    version: 3.9.0
  lapack:
    detection method: internal
    found: true
    include directory: unknown
    lib directory: unknown
    name: dep4569863840
    openblas configuration: unknown
    pc file directory: unknown
    version: 1.26.4
Compilers:
  c:
    args: -ftree-vectorize, -fPIC, -fstack-protector-strong, -O2, -pipe, -isystem,
      /Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/include,
      -fdebug-prefix-map=/Users/runner/miniforge3/conda-bld/numpy_1707225421156/work=/usr/local/src/conda/numpy-1.26.4,
      -fdebug-prefix-map=/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12=/usr/local/src/conda-prefix,
      -D_FORTIFY_SOURCE=2, -isystem, /Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/include,
      -mmacosx-version-min=11.0
    commands: arm64-apple-darwin20.0.0-clang
    linker: ld64
    linker args: -Wl,-headerpad_max_install_names, -Wl,-dead_strip_dylibs, -Wl,-rpath,/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/lib,
      -L/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/lib,
      -ftree-vectorize, -fPIC, -fstack-protector-strong, -O2, -pipe, -isystem, /Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/include,
      -fdebug-prefix-map=/Users/runner/miniforge3/conda-bld/numpy_1707225421156/work=/usr/local/src/conda/numpy-1.26.4,
      -fdebug-prefix-map=/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12=/usr/local/src/conda-prefix,
      -D_FORTIFY_SOURCE=2, -isystem, /Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/include,
      -mmacosx-version-min=11.0
    name: clang
    version: 16.0.6
  c++:
    args: -ftree-vectorize, -fPIC, -fstack-protector-strong, -O2, -pipe, -stdlib=libc++,
      -fvisibility-inlines-hidden, -fmessage-length=0, -isystem, /Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/include,
      -fdebug-prefix-map=/Users/runner/miniforge3/conda-bld/numpy_1707225421156/work=/usr/local/src/conda/numpy-1.26.4,
      -fdebug-prefix-map=/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12=/usr/local/src/conda-prefix,
      -D_FORTIFY_SOURCE=2, -isystem, /Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/include,
      -mmacosx-version-min=11.0
    commands: arm64-apple-darwin20.0.0-clang++
    linker: ld64
    linker args: -Wl,-headerpad_max_install_names, -Wl,-dead_strip_dylibs, -Wl,-rpath,/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/lib,
      -L/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/lib,
      -ftree-vectorize, -fPIC, -fstack-protector-strong, -O2, -pipe, -stdlib=libc++,
      -fvisibility-inlines-hidden, -fmessage-length=0, -isystem, /Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/include,
      -fdebug-prefix-map=/Users/runner/miniforge3/conda-bld/numpy_1707225421156/work=/usr/local/src/conda/numpy-1.26.4,
      -fdebug-prefix-map=/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12=/usr/local/src/conda-prefix,
      -D_FORTIFY_SOURCE=2, -isystem, /Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/include,
      -mmacosx-version-min=11.0
    name: clang
    version: 16.0.6
  cython:
    commands: cython
    linker: cython
    name: cython
    version: 3.0.8
Machine Information:
  build:
    cpu: aarch64
    endian: little
    family: aarch64
    system: darwin
  cross-compiled: true
  host:
    cpu: arm64
    endian: little
    family: aarch64
    system: darwin
Python Information:
  path: /Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/bin/python
  version: '3.12'
SIMD Extensions:
  baseline:
  - NEON
  - NEON_FP16
  - NEON_VFPV4
  - ASIMD
  found:
  - ASIMDHP
  not found:
  - ASIMDFHM

Numpy dot module: numpy
Numpy location: /Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/numpy/__init__.py
Numpy version: 1.26.4

We executed 10 calls to gemm with a and b matrices of shapes (5000, 5000) and (5000, 5000).

Total execution time: 31.56s on CPU (with direct PyTensor binding to blas

Try to run this script a few times. Experience shows that the first time is not as fast as following calls. The difference is not big, but consistent.

And when I try to run the same command but in env with pip installed pytensor results in this

Some PyTensor flags:
    blas__ldflags= 
    compiledir= /Users/daniel/.pytensor/compiledir_macOS-15.0-arm64-arm-64bit-arm-3.12.6-64
    floatX= float64
    device= cpu
Some OS information:
    sys.platform= darwin
    sys.version= 3.12.6 (main, Sep 28 2024, 17:45:34) [Clang 15.0.0 (clang-1500.3.9.4)]
    sys.prefix= /Users/daniel/.pyenv/versions/3.12.6/envs/zotero-3.12.6
Some environment variables:
    MKL_NUM_THREADS= None
    OMP_NUM_THREADS= None
    GOTO_NUM_THREADS= None

Numpy config: (used when the PyTensor flag "blas__ldflags" is empty)
/Users/daniel/.pyenv/versions/3.12.6/envs/zotero-3.12.6/lib/python3.12/site-packages/numpy/__config__.py:155: UserWarning: Install `pyyaml` for better output
  warnings.warn("Install `pyyaml` for better output", stacklevel=1)
{
  "Compilers": {
    "c": {
      "name": "clang",
      "linker": "ld64",
      "version": "14.0.0",
      "commands": "cc",
      "args": "-fno-strict-aliasing, -DBLAS_SYMBOL_SUFFIX=64_, -DHAVE_BLAS_ILP64",
      "linker args": "-fno-strict-aliasing, -DBLAS_SYMBOL_SUFFIX=64_, -DHAVE_BLAS_ILP64"
    },
    "cython": {
      "name": "cython",
      "linker": "cython",
      "version": "3.0.8",
      "commands": "cython"
    },
    "c++": {
      "name": "clang",
      "linker": "ld64",
      "version": "14.0.0",
      "commands": "c++",
      "args": "-DBLAS_SYMBOL_SUFFIX=64_, -DHAVE_BLAS_ILP64",
      "linker args": "-DBLAS_SYMBOL_SUFFIX=64_, -DHAVE_BLAS_ILP64"
    }
  },
  "Machine Information": {
    "host": {
      "cpu": "aarch64",
      "family": "aarch64",
      "endian": "little",
      "system": "darwin"
    },
    "build": {
      "cpu": "aarch64",
      "family": "aarch64",
      "endian": "little",
      "system": "darwin"
    }
  },
  "Build Dependencies": {
    "blas": {
      "name": "openblas64",
      "found": true,
      "version": "0.3.23.dev",
      "detection method": "pkgconfig",
      "include directory": "/opt/arm64-builds/include",
      "lib directory": "/opt/arm64-builds/lib",
      "openblas configuration": "USE_64BITINT=1 DYNAMIC_ARCH=1 DYNAMIC_OLDER= NO_CBLAS= NO_LAPACK= NO_LAPACKE= NO_AFFINITY=1 USE_OPENMP= SANDYBRIDGE MAX_THREADS=3",
      "pc file directory": "/usr/local/lib/pkgconfig"
    },
    "lapack": {
      "name": "dep4335021056",
      "found": true,
      "version": "1.26.4",
      "detection method": "internal",
      "include directory": "unknown",
      "lib directory": "unknown",
      "openblas configuration": "unknown",
      "pc file directory": "unknown"
    }
  },
  "Python Information": {
    "path": "/private/var/folders/76/zy5ktkns50v6gt5g8r0sf6sc0000gn/T/cibw-run-q69bfk1p/cp312-macosx_arm64/build/venv/bin/python",
    "version": "3.12"
  },
  "SIMD Extensions": {
    "baseline": [
      "NEON",
      "NEON_FP16",
      "NEON_VFPV4",
      "ASIMD"
    ],
    "found": [
      "ASIMDHP"
    ],
    "not found": [
      "ASIMDFHM"
    ]
  }
}
Numpy dot module: numpy
Numpy location: /Users/daniel/.pyenv/versions/3.12.6/envs/zotero-3.12.6/lib/python3.12/site-packages/numpy/__init__.py
Numpy version: 1.26.4

We executed 10 calls to gemm with a and b matrices of shapes (5000, 5000) and (5000, 5000).

Total execution time: 45.75s on CPU (with direct PyTensor binding to blas).

Try to run this script a few times. Experience shows that the first time is not as fast as following calls. The difference is not big, but consistent.

When I try to specify the accelerate the old way via "libblas=*=*accelerate" when installing the conda environment, when I try to run this it fails , I copied the output here https://discourse.pymc.io/t/pytensor-support-to-apple-accelerate-blas-with-conda-forge-on-macos-15/15131/2

Reproducable code example:

from `python $(python -c "import pathlib, pytensor; print(pathlib.Path(pytensor.__file__).parent / 'misc/check_blas.py')")`

Error message:

No response

PyTensor version information:

conda-forge/osx-arm64::pytensor-2.25.4-py312h3f593ad_0

Context for the issue:

No response

@danieltomasz danieltomasz added the bug Something isn't working label Sep 28, 2024
@maresb
Copy link
Contributor

maresb commented Sep 28, 2024

Thanks a lot @danieltomasz for the very high quality report. @lucianopaz, do you have any thoughts regarding the BLAS selection mechanism?

@danieltomasz
Copy link
Author

One thing I learned & might be useful - numpy 2.2 installed via pip use accelerate, numpy 2.2 installed via the same conda installs with openblas (I checked this via numpy.show_config()) I installed it in separate env just to check, bc pytensor doesnt support yet numpy >= 2.0

@maresb
Copy link
Contributor

maresb commented Sep 28, 2024

That is indeed very interesting, thanks @danieltomasz.

The Conda dependency chain is:

conda-forge/osx-arm64/pytensor-2.25.4-py312h3f593ad_0.condaaccelerate, blas
conda-forge/osx-arm64/blas-2.124-openblas.condablas-devel 3.9.0
blas-devel 3.9.0openblas 0.3.27.*

One way to get more flexibility to help debug this is to instead use the pytensor-base package on conda-forge. That should allow us to specify accelerate without installing openblas. But you'll need to install your own C compilers as well.

@danieltomasz, does this give you something to experiment with? I don't have a Mac myself, so unfortunately I can't directly debug this.

@danieltomasz
Copy link
Author

When I force "libblas=*=*accelerate"

~/.pyenv/versions/miniconda3-3.12-24.7.1-0/bin/conda create -n voxel-bayes-3.12  -c conda-forge  pytensor "libblas=*=*accelerate" 
Channels:
 - conda-forge
 - defaults
Platform: osx-arm64
Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12

  added / updated specs:
    - libblas[build=*accelerate]
    - pytensor


The following NEW packages will be INSTALLED:

  accelerate         conda-forge/noarch::accelerate-0.34.2-pyhd8ed1ab_0 
  blas               conda-forge/osx-arm64::blas-2.124-accelerate 
  blas-devel         conda-forge/osx-arm64::blas-devel-3.9.0-24_osxarm64_accelerate 
  brotli-python      conda-forge/osx-arm64::brotli-python-1.1.0-py312hde4cb15_2 
  bzip2              conda-forge/osx-arm64::bzip2-1.0.8-h99b78c6_7 
  ca-certificates    conda-forge/osx-arm64::ca-certificates-2024.8.30-hf0a4a13_0 
  cctools_osx-arm64  conda-forge/osx-arm64::cctools_osx-arm64-1010.6-h4208deb_1 
  certifi            conda-forge/noarch::certifi-2024.8.30-pyhd8ed1ab_0 
  cffi               conda-forge/osx-arm64::cffi-1.17.1-py312h0fad829_0 
  charset-normalizer conda-forge/noarch::charset-normalizer-3.3.2-pyhd8ed1ab_0 
  clang              conda-forge/osx-arm64::clang-17.0.6-default_h360f5da_7 
  clang-17           conda-forge/osx-arm64::clang-17-17.0.6-default_h146c034_7 
  clang_impl_osx-ar~ conda-forge/osx-arm64::clang_impl_osx-arm64-17.0.6-he47c785_19 
  clang_osx-arm64    conda-forge/osx-arm64::clang_osx-arm64-17.0.6-h54d7cd3_19 
  clangxx            conda-forge/osx-arm64::clangxx-17.0.6-default_h360f5da_7 
  clangxx_impl_osx-~ conda-forge/osx-arm64::clangxx_impl_osx-arm64-17.0.6-h50f59cd_19 
  clangxx_osx-arm64  conda-forge/osx-arm64::clangxx_osx-arm64-17.0.6-h54d7cd3_19 
  colorama           conda-forge/noarch::colorama-0.4.6-pyhd8ed1ab_0 
  compiler-rt        conda-forge/osx-arm64::compiler-rt-17.0.6-h856b3c1_2 
  compiler-rt_osx-a~ conda-forge/noarch::compiler-rt_osx-arm64-17.0.6-h832e737_2 
  cons               conda-forge/noarch::cons-0.4.6-pyhd8ed1ab_0 
  etuples            conda-forge/noarch::etuples-0.3.9-pyhd8ed1ab_0 
  filelock           conda-forge/noarch::filelock-3.16.1-pyhd8ed1ab_0 
  fsspec             conda-forge/noarch::fsspec-2024.9.0-pyhff2d567_0 
  gmp                conda-forge/osx-arm64::gmp-6.3.0-h7bae524_2 
  gmpy2              conda-forge/osx-arm64::gmpy2-2.1.5-py312h87fada9_2 
  h2                 conda-forge/noarch::h2-4.1.0-pyhd8ed1ab_0 
  hpack              conda-forge/noarch::hpack-4.0.0-pyh9f0ad1d_0 
  huggingface_hub    conda-forge/noarch::huggingface_hub-0.25.1-pyhd8ed1ab_0 
  hyperframe         conda-forge/noarch::hyperframe-6.0.1-pyhd8ed1ab_0 
  icu                conda-forge/osx-arm64::icu-75.1-hfee45f7_0 
  idna               conda-forge/noarch::idna-3.10-pyhd8ed1ab_0 
  jinja2             conda-forge/noarch::jinja2-3.1.4-pyhd8ed1ab_0 
  ld64_osx-arm64     conda-forge/osx-arm64::ld64_osx-arm64-951.9-hc81425b_1 
  libabseil          conda-forge/osx-arm64::libabseil-20240116.2-cxx17_h00cdb27_1 
  libblas            conda-forge/osx-arm64::libblas-3.9.0-24_osxarm64_accelerate 
  libcblas           conda-forge/osx-arm64::libcblas-3.9.0-24_osxarm64_accelerate 
  libclang-cpp17     conda-forge/osx-arm64::libclang-cpp17-17.0.6-default_h146c034_7 
  libcxx             conda-forge/osx-arm64::libcxx-19.1.0-ha82da77_0 
  libcxx-devel       conda-forge/osx-arm64::libcxx-devel-17.0.6-h86353a2_6 
  libexpat           conda-forge/osx-arm64::libexpat-2.6.3-hf9b8971_0 
  libffi             conda-forge/osx-arm64::libffi-3.4.2-h3422bc3_5 
  libgfortran        conda-forge/osx-arm64::libgfortran-5.0.0-13_2_0_hd922786_3 
  libgfortran5       conda-forge/osx-arm64::libgfortran5-13.2.0-hf226fd6_3 
  libiconv           conda-forge/osx-arm64::libiconv-1.17-h0d3ecfb_2 
  liblapack          conda-forge/osx-arm64::liblapack-3.9.0-24_osxarm64_accelerate 
  liblapacke         conda-forge/osx-arm64::liblapacke-3.9.0-24_osxarm64_accelerate 
  libllvm17          conda-forge/osx-arm64::libllvm17-17.0.6-h5090b49_2 
  libprotobuf        conda-forge/osx-arm64::libprotobuf-4.25.3-hc39d83c_1 
  libsqlite          conda-forge/osx-arm64::libsqlite-3.46.1-hc14010f_0 
  libtorch           conda-forge/osx-arm64::libtorch-2.4.0-cpu_generic_h4365fe2_1 
  libuv              conda-forge/osx-arm64::libuv-1.49.0-hd74edd7_0 
  libxml2            conda-forge/osx-arm64::libxml2-2.12.7-h01dff8b_4 
  libzlib            conda-forge/osx-arm64::libzlib-1.3.1-hfb2fe0b_1 
  llvm-openmp        conda-forge/osx-arm64::llvm-openmp-18.1.8-hde57baf_1 
  llvm-tools         conda-forge/osx-arm64::llvm-tools-17.0.6-h5090b49_2 
  logical-unificati~ conda-forge/noarch::logical-unification-0.4.6-pyhd8ed1ab_0 
  macosx_deployment~ conda-forge/noarch::macosx_deployment_target_osx-arm64-11.0-h6553868_1 
  markupsafe         conda-forge/osx-arm64::markupsafe-2.1.5-py312h024a12e_1 
  minikanren         conda-forge/noarch::minikanren-1.0.3-pyhd8ed1ab_0 
  mpc                conda-forge/osx-arm64::mpc-1.3.1-h8f1351a_1 
  mpfr               conda-forge/osx-arm64::mpfr-4.2.1-hb693164_3 
  mpmath             conda-forge/noarch::mpmath-1.3.0-pyhd8ed1ab_0 
  multipledispatch   conda-forge/noarch::multipledispatch-0.6.0-pyhd8ed1ab_1 
  ncurses            conda-forge/osx-arm64::ncurses-6.5-h7bae524_1 
  networkx           conda-forge/noarch::networkx-3.3-pyhd8ed1ab_1 
  nomkl              conda-forge/noarch::nomkl-1.0-h5ca1d4c_0 
  numpy              conda-forge/osx-arm64::numpy-1.26.4-py312h8442bc7_0 
  openssl            conda-forge/osx-arm64::openssl-3.3.2-h8359307_0 
  packaging          conda-forge/noarch::packaging-24.1-pyhd8ed1ab_0 
  pip                conda-forge/noarch::pip-24.2-pyh8b19718_1 
  psutil             conda-forge/osx-arm64::psutil-6.0.0-py312h024a12e_1 
  pycparser          conda-forge/noarch::pycparser-2.22-pyhd8ed1ab_0 
  pysocks            conda-forge/noarch::pysocks-1.7.1-pyha2e5f31_6 
  pytensor           conda-forge/osx-arm64::pytensor-2.25.4-py312h3f593ad_0 
  pytensor-base      conda-forge/osx-arm64::pytensor-base-2.25.4-py312h02baea5_0 
  python             conda-forge/osx-arm64::python-3.12.6-h739c21a_1_cpython 
  python_abi         conda-forge/osx-arm64::python_abi-3.12-5_cp312 
  pytorch            conda-forge/osx-arm64::pytorch-2.4.0-cpu_generic_py312h6bd8f41_1 
  pyyaml             conda-forge/osx-arm64::pyyaml-6.0.2-py312h024a12e_1 
  readline           conda-forge/osx-arm64::readline-8.2-h92ec313_1 
  requests           conda-forge/noarch::requests-2.32.3-pyhd8ed1ab_0 
  safetensors        conda-forge/osx-arm64::safetensors-0.4.5-py312he431725_0 
  scipy              conda-forge/osx-arm64::scipy-1.14.1-py312heb3a901_0 
  setuptools         conda-forge/noarch::setuptools-75.1.0-pyhd8ed1ab_0 
  sigtool            conda-forge/osx-arm64::sigtool-0.1.3-h44b9a77_0 
  six                conda-forge/noarch::six-1.16.0-pyh6c4a22f_0 
  sleef              conda-forge/osx-arm64::sleef-3.7-h7783ee8_0 
  sympy              conda-forge/noarch::sympy-1.13.3-pypyh2585a3b_103 
  tapi               conda-forge/osx-arm64::tapi-1300.6.5-h03f4b80_0 
  tk                 conda-forge/osx-arm64::tk-8.6.13-h5083fa2_1 
  toolz              conda-forge/noarch::toolz-0.12.1-pyhd8ed1ab_0 
  tqdm               conda-forge/noarch::tqdm-4.66.5-pyhd8ed1ab_0 
  typing-extensions  conda-forge/noarch::typing-extensions-4.12.2-hd8ed1ab_0 
  typing_extensions  conda-forge/noarch::typing_extensions-4.12.2-pyha770c72_0 
  tzdata             conda-forge/noarch::tzdata-2024a-h8827d51_1 
  urllib3            conda-forge/noarch::urllib3-2.2.3-pyhd8ed1ab_0 
  wheel              conda-forge/noarch::wheel-0.44.0-pyhd8ed1ab_0 
  xz                 conda-forge/osx-arm64::xz-5.2.6-h57fd34a_0 
  yaml               conda-forge/osx-arm64::yaml-0.2.5-h3422bc3_2 
  zstandard          conda-forge/osx-arm64::zstandard-0.23.0-py312h15fbf35_1 
  zstd               conda-forge/osx-arm64::zstd-1.5.6-hb46c0d2_0 

It install with success, but gives me errors (segmentation faults)

from `python $(python -c "import pathlib, pytensor; print(pathlib.Path(pytensor.__file__).parent / 'misc/check_blas.py')")`
WARNING (pytensor.tensor.blas): Using NumPy C-API based implementation for BLAS functions.
WARNING (pytensor.tensor.blas): Using NumPy C-API based implementation for BLAS functions.
Traceback (most recent call last):
  File "/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/pytensor/link/vm.py", line 1227, in make_all
    node.op.make_thunk(node, storage_map, compute_map, [], impl=impl)
  File "/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/pytensor/link/c/op.py", line 119, in make_thunk
    return self.make_c_thunk(node, storage_map, compute_map, no_recycling)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/pytensor/link/c/op.py", line 84, in make_c_thunk
    outputs = cl.make_thunk(
              ^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/pytensor/link/c/basic.py", line 1182, in make_thunk
    cthunk, module, in_storage, out_storage, error_storage = self.__compile__(
                                                             ^^^^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/pytensor/link/c/basic.py", line 1103, in __compile__
    thunk, module = self.cthunk_factory(
                    ^^^^^^^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/pytensor/link/c/basic.py", line 1627, in cthunk_factory
    module = cache.module_from_key(key=key, lnk=self)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/pytensor/link/c/cmodule.py", line 1255, in module_from_key
    module = lnk.compile_cmodule(location)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/pytensor/link/c/basic.py", line 1528, in compile_cmodule
    module = c_compiler.compile_str(
             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/pytensor/link/c/cmodule.py", line 2654, in compile_str
    raise CompileError(
pytensor.link.c.exceptions.CompileError: Compilation failed (return status=1):
/Users/daniel/.pyenv/versions/voxel-bayes-3.12/bin/clang++ -dynamiclib -g -O3 -fno-math-errno -Wno-unused-label -Wno-unused-variable -Wno-write-strings -DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION -fPIC -undefined dynamic_lookup -I/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/numpy/core/include -I/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/include/python3.12 -I/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/pytensor/link/c/c_code -L/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib -fvisibility=hidden -o /Users/daniel/.pytensor/compiledir_macOS-15.0-arm64-arm-64bit-arm-3.12.6-64/tmp4ndb3uui/mbe23404cc39ec1a668b1ae18701f267b8ee61fabc03b6968263aa4f888d9dec6.so /Users/daniel/.pytensor/compiledir_macOS-15.0-arm64-arm-64bit-arm-3.12.6-64/tmp4ndb3uui/mod.cpp
clang++: error: unable to execute command: Segmentation fault: 11
clang++: error: linker command failed due to signal (use -v to see invocation)


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/pytensor/misc/check_blas.py", line 274, in <module>
    t, impl = execute(
              ^^^^^^^^
  File "/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/pytensor/misc/check_blas.py", line 57, in execute
    f = pytensor.function([], updates=[(c, 0.4 * c + 0.8 * dot(a, b))])
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/pytensor/compile/function/__init__.py", line 318, in function
    fn = pfunc(
         ^^^^^^
  File "/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/pytensor/compile/function/pfunc.py", line 465, in pfunc
    return orig_function(
           ^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/pytensor/compile/function/types.py", line 1762, in orig_function
    fn = m.create(defaults)
         ^^^^^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/pytensor/compile/function/types.py", line 1654, in create
    _fn, _i, _o = self.linker.make_thunk(
                  ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/pytensor/link/basic.py", line 245, in make_thunk
    return self.make_all(
           ^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/pytensor/link/vm.py", line 1236, in make_all
    raise_with_op(fgraph, node)
  File "/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/pytensor/link/utils.py", line 524, in raise_with_op
    raise exc_value.with_traceback(exc_trace)
  File "/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/pytensor/link/vm.py", line 1227, in make_all
    node.op.make_thunk(node, storage_map, compute_map, [], impl=impl)
  File "/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/pytensor/link/c/op.py", line 119, in make_thunk
    return self.make_c_thunk(node, storage_map, compute_map, no_recycling)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/pytensor/link/c/op.py", line 84, in make_c_thunk
    outputs = cl.make_thunk(
              ^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/pytensor/link/c/basic.py", line 1182, in make_thunk
    cthunk, module, in_storage, out_storage, error_storage = self.__compile__(
                                                             ^^^^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/pytensor/link/c/basic.py", line 1103, in __compile__
    thunk, module = self.cthunk_factory(
                    ^^^^^^^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/pytensor/link/c/basic.py", line 1627, in cthunk_factory
    module = cache.module_from_key(key=key, lnk=self)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/pytensor/link/c/cmodule.py", line 1255, in module_from_key
    module = lnk.compile_cmodule(location)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/pytensor/link/c/basic.py", line 1528, in compile_cmodule
    module = c_compiler.compile_str(
             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/pytensor/link/c/cmodule.py", line 2654, in compile_str
    raise CompileError(
pytensor.link.c.exceptions.CompileError: Compilation failed (return status=1):
/Users/daniel/.pyenv/versions/voxel-bayes-3.12/bin/clang++ -dynamiclib -g -O3 -fno-math-errno -Wno-unused-label -Wno-unused-variable -Wno-write-strings -DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION -fPIC -undefined dynamic_lookup -I/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/numpy/core/include -I/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/include/python3.12 -I/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/pytensor/link/c/c_code -L/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib -fvisibility=hidden -o /Users/daniel/.pytensor/compiledir_macOS-15.0-arm64-arm-64bit-arm-3.12.6-64/tmp4ndb3uui/mbe23404cc39ec1a668b1ae18701f267b8ee61fabc03b6968263aa4f888d9dec6.so /Users/daniel/.pytensor/compiledir_macOS-15.0-arm64-arm-64bit-arm-3.12.6-64/tmp4ndb3uui/mod.cpp
clang++: error: unable to execute command: Segmentation fault: 11
clang++: error: linker command failed due to signal (use -v to see invocation)

Apply node that caused the error: Gemm{inplace}(<Matrix(float64, shape=(?, ?))>, 0.8, <Matrix(float64, shape=(?, ?))>, <Matrix(float64, shape=(?, ?))>, 0.4)
Toposort index: 0
Inputs types: [TensorType(float64, shape=(None, None)), TensorType(float64, shape=()), TensorType(float64, shape=(None, None)), TensorType(float64, shape=(None, None)), TensorType(float64, shape=())]

HINT: Use a linker other than the C linker to print the inputs' shapes and strides.
HINT: Re-running with most PyTensor optimizations disabled could provide a back-trace showing when this node was created. This can be done by setting the PyTensor flag 'optimizer=fast_compile'. If that does not work, PyTensor optimizations can be disabled with 'optimizer=None'.
HINT: Use the PyTensor flag `exception_verbosity=high` for a debug print-out and storage map footprint of this Apply node.
zsh: command not found: from  

Some of those error are discussed here https://discourse.pymc.io/t/environment-not-working-anymore-on-macos/14210

@maresb
Copy link
Contributor

maresb commented Sep 28, 2024

@danieltomasz, could you please try using the pytensor-base package instead of pytensor?

@danieltomasz
Copy link
Author

/.pyenv/versions/miniconda3-3.12-24.7.1-0/bin/conda create -n voxel-bayes-3.12  -c conda-forge pytensor-base
Channels:
 - conda-forge
 - defaults
Platform: osx-arm64
Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12

  added / updated specs:
    - pytensor-base


The following NEW packages will be INSTALLED:

  bzip2              conda-forge/osx-arm64::bzip2-1.0.8-h99b78c6_7 
  ca-certificates    conda-forge/osx-arm64::ca-certificates-2024.8.30-hf0a4a13_0 
  cons               conda-forge/noarch::cons-0.4.6-pyhd8ed1ab_0 
  etuples            conda-forge/noarch::etuples-0.3.9-pyhd8ed1ab_0 
  filelock           conda-forge/noarch::filelock-3.16.1-pyhd8ed1ab_0 
  libblas            conda-forge/osx-arm64::libblas-3.9.0-24_osxarm64_openblas 
  libcblas           conda-forge/osx-arm64::libcblas-3.9.0-24_osxarm64_openblas 
  libcxx             conda-forge/osx-arm64::libcxx-19.1.0-ha82da77_0 
  libexpat           conda-forge/osx-arm64::libexpat-2.6.3-hf9b8971_0 
  libffi             conda-forge/osx-arm64::libffi-3.4.2-h3422bc3_5 
  libgfortran        conda-forge/osx-arm64::libgfortran-5.0.0-13_2_0_hd922786_3 
  libgfortran5       conda-forge/osx-arm64::libgfortran5-13.2.0-hf226fd6_3 
  liblapack          conda-forge/osx-arm64::liblapack-3.9.0-24_osxarm64_openblas 
  libopenblas        conda-forge/osx-arm64::libopenblas-0.3.27-openmp_h517c56d_1 
  libsqlite          conda-forge/osx-arm64::libsqlite-3.46.1-hc14010f_0 
  libzlib            conda-forge/osx-arm64::libzlib-1.3.1-hfb2fe0b_1 
  llvm-openmp        conda-forge/osx-arm64::llvm-openmp-18.1.8-hde57baf_1 
  logical-unificati~ conda-forge/noarch::logical-unification-0.4.6-pyhd8ed1ab_0 
  minikanren         conda-forge/noarch::minikanren-1.0.3-pyhd8ed1ab_0 
  multipledispatch   conda-forge/noarch::multipledispatch-0.6.0-pyhd8ed1ab_1 
  ncurses            conda-forge/osx-arm64::ncurses-6.5-h7bae524_1 
  numpy              conda-forge/osx-arm64::numpy-1.26.4-py312h8442bc7_0 
  openssl            conda-forge/osx-arm64::openssl-3.3.2-h8359307_0 
  pip                conda-forge/noarch::pip-24.2-pyh8b19718_1 
  pytensor-base      conda-forge/osx-arm64::pytensor-base-2.25.4-py312h02baea5_0 
  python             conda-forge/osx-arm64::python-3.12.6-h739c21a_1_cpython 
  python_abi         conda-forge/osx-arm64::python_abi-3.12-5_cp312 
  readline           conda-forge/osx-arm64::readline-8.2-h92ec313_1 
  scipy              conda-forge/osx-arm64::scipy-1.14.1-py312heb3a901_0 
  setuptools         conda-forge/noarch::setuptools-75.1.0-pyhd8ed1ab_0 
  six                conda-forge/noarch::six-1.16.0-pyh6c4a22f_0 
  tk                 conda-forge/osx-arm64::tk-8.6.13-h5083fa2_1 
  toolz              conda-forge/noarch::toolz-0.12.1-pyhd8ed1ab_0 
  tzdata             conda-forge/noarch::tzdata-2024a-h8827d51_1 
  wheel              conda-forge/noarch::wheel-0.44.0-pyhd8ed1ab_0 
  xz                 conda-forge/osx-arm64::xz-5.2.6-h57fd34a_0 

@maresb
Copy link
Contributor

maresb commented Sep 28, 2024

Ok, so that's installing numpy with openblas. And what happens now if you try and force accelerate?

@danieltomasz
Copy link
Author

@maresb as I wrote here it install fine, but when trying to test it give segmentation error #1005 (comment)

@danieltomasz
Copy link
Author

Also the output from numpy even if I force accelerate in conda ~/.pyenv/versions/miniconda3-3.12-24.7.1-0/bin/conda create -n voxel-bayes-3.12 -c conda-forge pytensor-base "libblas=*=*accelerate"

>>> import numpy as np
>>> np.show_config()
/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/numpy/__config__.py:155: UserWarning: Install `pyyaml` for better output
  warnings.warn("Install `pyyaml` for better output", stacklevel=1)
{
  "Compilers": {
    "c": {
      "name": "clang",
      "linker": "ld64",
      "version": "16.0.6",
      "commands": "arm64-apple-darwin20.0.0-clang",
      "args": "-ftree-vectorize, -fPIC, -fstack-protector-strong, -O2, -pipe, -isystem, /Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/include, -fdebug-prefix-map=/Users/runner/miniforge3/conda-bld/numpy_1707225421156/work=/usr/local/src/conda/numpy-1.26.4, -fdebug-prefix-map=/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12=/usr/local/src/conda-prefix, -D_FORTIFY_SOURCE=2, -isystem, /Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/include, -mmacosx-version-min=11.0",
      "linker args": "-Wl,-headerpad_max_install_names, -Wl,-dead_strip_dylibs, -Wl,-rpath,/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/lib, -L/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/lib, -ftree-vectorize, -fPIC, -fstack-protector-strong, -O2, -pipe, -isystem, /Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/include, -fdebug-prefix-map=/Users/runner/miniforge3/conda-bld/numpy_1707225421156/work=/usr/local/src/conda/numpy-1.26.4, -fdebug-prefix-map=/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12=/usr/local/src/conda-prefix, -D_FORTIFY_SOURCE=2, -isystem, /Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/include, -mmacosx-version-min=11.0"
    },
    "cython": {
      "name": "cython",
      "linker": "cython",
      "version": "3.0.8",
      "commands": "cython"
    },
    "c++": {
      "name": "clang",
      "linker": "ld64",
      "version": "16.0.6",
      "commands": "arm64-apple-darwin20.0.0-clang++",
      "args": "-ftree-vectorize, -fPIC, -fstack-protector-strong, -O2, -pipe, -stdlib=libc++, -fvisibility-inlines-hidden, -fmessage-length=0, -isystem, /Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/include, -fdebug-prefix-map=/Users/runner/miniforge3/conda-bld/numpy_1707225421156/work=/usr/local/src/conda/numpy-1.26.4, -fdebug-prefix-map=/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12=/usr/local/src/conda-prefix, -D_FORTIFY_SOURCE=2, -isystem, /Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/include, -mmacosx-version-min=11.0",
      "linker args": "-Wl,-headerpad_max_install_names, -Wl,-dead_strip_dylibs, -Wl,-rpath,/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/lib, -L/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/lib, -ftree-vectorize, -fPIC, -fstack-protector-strong, -O2, -pipe, -stdlib=libc++, -fvisibility-inlines-hidden, -fmessage-length=0, -isystem, /Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/include, -fdebug-prefix-map=/Users/runner/miniforge3/conda-bld/numpy_1707225421156/work=/usr/local/src/conda/numpy-1.26.4, -fdebug-prefix-map=/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12=/usr/local/src/conda-prefix, -D_FORTIFY_SOURCE=2, -isystem, /Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/include, -mmacosx-version-min=11.0"
    }
  },
  "Machine Information": {
    "host": {
      "cpu": "arm64",
      "family": "aarch64",
      "endian": "little",
      "system": "darwin"
    },
    "build": {
      "cpu": "aarch64",
      "family": "aarch64",
      "endian": "little",
      "system": "darwin"
    },
    "cross-compiled": true
  },
  "Build Dependencies": {
    "blas": {
      "name": "blas",
      "found": true,
      "version": "3.9.0",
      "detection method": "pkgconfig",
      "include directory": "/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/include",
      "lib directory": "/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/lib",
      "openblas configuration": "unknown",
      "pc file directory": "/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/lib/pkgconfig"
    },
    "lapack": {
      "name": "dep4569863840",
      "found": true,
      "version": "1.26.4",
      "detection method": "internal",
      "include directory": "unknown",
      "lib directory": "unknown",
      "openblas configuration": "unknown",
      "pc file directory": "unknown"
    }
  },
  "Python Information": {
    "path": "/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/bin/python",
    "version": "3.12"
  },
  "SIMD Extensions": {
    "baseline": [
      "NEON",
      "NEON_FP16",
      "NEON_VFPV4",
      "ASIMD"
    ],
    "found": [
      "ASIMDHP"
    ],
    "not found": [
      "ASIMDFHM"
    ]
  }
}

@danieltomasz
Copy link
Author

danieltomasz commented Sep 28, 2024

When installing only numpy with forced accelerate

Python 3.12.6 | packaged by conda-forge | (main, Sep 22 2024, 14:07:06) [Clang 17.0.6 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> print(np.__version__)
2.1.1
>>> np.show_config()
/Users/daniel/.pyenv/versions/voxel-bayes-3.12/lib/python3.12/site-packages/numpy/__config__.py:155: UserWarning: Install `pyyaml` for better output
  warnings.warn("Install `pyyaml` for better output", stacklevel=1)
{
  "Compilers": {
    "c": {
      "name": "clang",
      "linker": "ld64",
      "version": "17.0.6",
      "commands": "arm64-apple-darwin20.0.0-clang",
      "args": "-ftree-vectorize, -fPIC, -fstack-protector-strong, -O2, -pipe, -isystem, /Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/include, -fdebug-prefix-map=/Users/runner/miniforge3/conda-bld/numpy_1725411805471/work=/usr/local/src/conda/numpy-2.1.1, -fdebug-prefix-map=/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12=/usr/local/src/conda-prefix, -D_FORTIFY_SOURCE=2, -isystem, /Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/include, -mmacosx-version-min=11.0, -mmacosx-version-min=11.0",
      "linker args": "-Wl,-headerpad_max_install_names, -Wl,-dead_strip_dylibs, -Wl,-rpath,/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/lib, -L/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/lib, -ftree-vectorize, -fPIC, -fstack-protector-strong, -O2, -pipe, -isystem, /Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/include, -fdebug-prefix-map=/Users/runner/miniforge3/conda-bld/numpy_1725411805471/work=/usr/local/src/conda/numpy-2.1.1, -fdebug-prefix-map=/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12=/usr/local/src/conda-prefix, -D_FORTIFY_SOURCE=2, -isystem, /Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/include, -mmacosx-version-min=11.0, -mmacosx-version-min=11.0"
    },
    "cython": {
      "name": "cython",
      "linker": "cython",
      "version": "3.0.11",
      "commands": "cython"
    },
    "c++": {
      "name": "clang",
      "linker": "ld64",
      "version": "17.0.6",
      "commands": "arm64-apple-darwin20.0.0-clang++",
      "args": "-ftree-vectorize, -fPIC, -fstack-protector-strong, -O2, -pipe, -stdlib=libc++, -fvisibility-inlines-hidden, -fmessage-length=0, -isystem, /Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/include, -fdebug-prefix-map=/Users/runner/miniforge3/conda-bld/numpy_1725411805471/work=/usr/local/src/conda/numpy-2.1.1, -fdebug-prefix-map=/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12=/usr/local/src/conda-prefix, -D_FORTIFY_SOURCE=2, -isystem, /Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/include, -mmacosx-version-min=11.0, -mmacosx-version-min=11.0",
      "linker args": "-Wl,-headerpad_max_install_names, -Wl,-dead_strip_dylibs, -Wl,-rpath,/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/lib, -L/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/lib, -ftree-vectorize, -fPIC, -fstack-protector-strong, -O2, -pipe, -stdlib=libc++, -fvisibility-inlines-hidden, -fmessage-length=0, -isystem, /Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/include, -fdebug-prefix-map=/Users/runner/miniforge3/conda-bld/numpy_1725411805471/work=/usr/local/src/conda/numpy-2.1.1, -fdebug-prefix-map=/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12=/usr/local/src/conda-prefix, -D_FORTIFY_SOURCE=2, -isystem, /Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/include, -mmacosx-version-min=11.0, -mmacosx-version-min=11.0"
    }
  },
  "Machine Information": {
    "host": {
      "cpu": "arm64",
      "family": "aarch64",
      "endian": "little",
      "system": "darwin"
    },
    "build": {
      "cpu": "aarch64",
      "family": "aarch64",
      "endian": "little",
      "system": "darwin"
    },
    "cross-compiled": true
  },
  "Build Dependencies": {
    "blas": {
      "name": "blas",
      "found": true,
      "version": "3.9.0",
      "detection method": "pkgconfig",
      "include directory": "/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/include",
      "lib directory": "/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/lib",
      "openblas configuration": "unknown",
      "pc file directory": "/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/lib/pkgconfig"
    },
    "lapack": {
      "name": "lapack",
      "found": true,
      "version": "3.9.0",
      "detection method": "pkgconfig",
      "include directory": "/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/include",
      "lib directory": "/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/lib",
      "openblas configuration": "unknown",
      "pc file directory": "/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/lib/pkgconfig"
    }
  },
  "Python Information": {
    "path": "/Users/daniel/.pyenv/versions/miniconda3-3.12-24.7.1-0/envs/voxel-bayes-3.12/bin/python",
    "version": "3.12"
  },
  "SIMD Extensions": {
    "baseline": [
      "NEON",
      "NEON_FP16",
      "NEON_VFPV4",
      "ASIMD"
    ],
    "found": [
      "ASIMDHP"
    ],
    "not found": [
      "ASIMDFHM"
    ]
  }
}

I need to leave, but I might try something out of box, like if installing pytensor via "pixi" pulls accelerate (there might be something particular to my setup how conda is linking and trying different package manager tool might help), maybe someone with Apple Sillicon can replicate in meantime

@maresb
Copy link
Contributor

maresb commented Sep 28, 2024

Thanks so much for all the diagnosis @danieltomasz!

For when you find some more time, I wonder if lower versions of NumPy might work? For example <2?

@danieltomasz
Copy link
Author

danieltomasz commented Sep 29, 2024

Unfortunately neither pixi, nor changing pyhon version to 3.11 or asking for lower version of numpy provide accelarate libraries (it is openblas by default); When I installe numpy via pip it intalled numpy 2.2 with accelerate, but adding pytensor to this envioronment downgrade numpy to one' that is using openblas64

@maresb
Copy link
Contributor

maresb commented Sep 29, 2024

but adding pytensor to this envioronment downgrade numpy to one' that is using openblas64

Thanks @danieltomasz for getting back to me!

Are you able to find some earlier conda-forge version of numpy that works with accelerate on your system?

@danieltomasz
Copy link
Author

danieltomasz commented Sep 29, 2024

hi @maresb, I think conda and numpy worked fine earlier (the latest numpy version <2 is from february) , I cannot pinpoint exact moment, but I was probably update to MacOS 15 that changed things ~ 2 weeks ago (also recently I think that conda might changed clang compiler that it uses with the python it ships, but I am not sure about this);

what could be worth to see :

  1. If someone with Apple SIllicon and still on MacOS 14 can install pytensor with accelerate
  2. If other people on MacOS 15 and Apple Silicon can reproduce this behaviour

@danieltomasz
Copy link
Author

danieltomasz commented Sep 29, 2024

It is just my intuition, but forcing blas to accelerate works, but it creates the create error when running due to problems with compilers on MacOS 15, but it works with openblas

Also there was updates to accelerate in MacOS 15 https://developer.apple.com/documentation/accelerate/blas/ and this discussion might be relevant conda-forge/blas-feedstock#103 and here conda-forge/numpy-feedstock#253

@danieltomasz
Copy link
Author

Quick update: when I install via pip

pip install -U --no-binary :all: numpy pytensor

numpy seems to use accelerate but pytensor fails to do so
Results of running

python $(python -c "import pathlib, pytensor; print(pathlib.Path(pytensor.__file__).parent / 'misc/check_blas.py')")

is below

WARNING (pytensor.tensor.blas): Using NumPy C-API based implementation for BLAS functions.
WARNING (pytensor.tensor.blas): Using NumPy C-API based implementation for BLAS functions.

        Some results that you can compare against. They were 10 executions
        of gemm in float64 with matrices of shape 2000x2000 (M=N=K=2000).
        All memory layout was in C order.

        CPU tested: Xeon E5345(2.33Ghz, 8M L2 cache, 1333Mhz FSB),
                    Xeon E5430(2.66Ghz, 12M L2 cache, 1333Mhz FSB),
                    Xeon E5450(3Ghz, 12M L2 cache, 1333Mhz FSB),
                    Xeon X5560(2.8Ghz, 12M L2 cache, hyper-threads?)
                    Core 2 E8500, Core i7 930(2.8Ghz, hyper-threads enabled),
                    Core i7 950(3.07GHz, hyper-threads enabled)
                    Xeon X5550(2.67GHz, 8M l2 cache?, hyper-threads enabled)


        Libraries tested:
            * numpy with ATLAS from distribution (FC9) package (1 thread)
            * manually compiled numpy and ATLAS with 2 threads
            * goto 1.26 with 1, 2, 4 and 8 threads
            * goto2 1.13 compiled with multiple threads enabled

                          Xeon   Xeon   Xeon  Core2 i7    i7     Xeon   Xeon
        lib/nb threads    E5345  E5430  E5450 E8500 930   950    X5560  X5550

        numpy 1.3.0 blas                                                775.92s
        numpy_FC9_atlas/1 39.2s  35.0s  30.7s 29.6s 21.5s 19.60s
        goto/1            18.7s  16.1s  14.2s 13.7s 16.1s 14.67s
        numpy_MAN_atlas/2 12.0s  11.6s  10.2s  9.2s  9.0s
        goto/2             9.5s   8.1s   7.1s  7.3s  8.1s  7.4s
        goto/4             4.9s   4.4s   3.7s  -     4.1s  3.8s
        goto/8             2.7s   2.4s   2.0s  -     4.1s  3.8s
        openblas/1                                        14.04s
        openblas/2                                         7.16s
        openblas/4                                         3.71s
        openblas/8                                         3.70s
        mkl 11.0.083/1            7.97s
        mkl 10.2.2.025/1                                         13.7s
        mkl 10.2.2.025/2                                          7.6s
        mkl 10.2.2.025/4                                          4.0s
        mkl 10.2.2.025/8                                          2.0s
        goto2 1.13/1                                                     14.37s
        goto2 1.13/2                                                      7.26s
        goto2 1.13/4                                                      3.70s
        goto2 1.13/8                                                      1.94s
        goto2 1.13/16                                                     3.16s

        Test time in float32. There were 10 executions of gemm in
        float32 with matrices of shape 5000x5000 (M=N=K=5000)
        All memory layout was in C order.


        cuda version      8.0    7.5    7.0
        gpu
        M40               0.45s  0.47s
        k80               0.92s  0.96s
        K6000/NOECC       0.71s         0.69s
        P6000/NOECC       0.25s

        Titan X (Pascal)  0.28s
        GTX Titan X       0.45s  0.45s  0.47s
        GTX Titan Black   0.66s  0.64s  0.64s
        GTX 1080          0.35s
        GTX 980 Ti               0.41s
        GTX 970                  0.66s
        GTX 680                         1.57s
        GTX 750 Ti               2.01s  2.01s
        GTX 750                  2.46s  2.37s
        GTX 660                  2.32s  2.32s
        GTX 580                  2.42s
        GTX 480                  2.87s
        TX1                             7.6s (float32 storage and computation)
        GT 610                          33.5s

Some PyTensor flags:
    blas__ldflags=
    compiledir= /Users/daniel/.pytensor/compiledir_macOS-15.1-arm64-arm-64bit-arm-3.12.7-64
    floatX= float64
    device= cpu
Some OS information:
    sys.platform= darwin
    sys.version= 3.12.7 (main, Oct 11 2024, 01:24:59) [Clang 16.0.0 (clang-1600.0.26.3)]
    sys.prefix= /Users/daniel/.pyenv/versions/3.12.7
Some environment variables:
    MKL_NUM_THREADS= None
    OMP_NUM_THREADS= None
    GOTO_NUM_THREADS= None

Numpy config: (used when the PyTensor flag "blas__ldflags" is empty)
Build Dependencies:
  blas:
    detection method: system
    found: true
    include directory: unknown
    lib directory: unknown
    name: accelerate
    openblas configuration: unknown
    pc file directory: unknown
    version: unknown
  lapack:
    detection method: internal
    found: true
    include directory: unknown
    lib directory: unknown
    name: dep4409437856
    openblas configuration: unknown
    pc file directory: unknown
    version: 1.26.4
Compilers:
  c:
    commands: cc
    linker: ld64
    name: clang
    version: 16.0.0
  c++:
    commands: c++
    linker: ld64
    name: clang
    version: 16.0.0
  cython:
    commands: cython
    linker: cython
    name: cython
    version: 3.0.11
Machine Information:
  build:
    cpu: aarch64
    endian: little
    family: aarch64
    system: darwin
  host:
    cpu: aarch64
    endian: little
    family: aarch64
    system: darwin
Python Information:
  path: /Users/daniel/.pyenv/versions/3.12.7/bin/python3.12
  version: '3.12'
SIMD Extensions:
  baseline:
  - NEON
  - NEON_FP16
  - NEON_VFPV4
  - ASIMD
  found:
  - ASIMDHP
  not found:
  - ASIMDFHM

Numpy dot module: numpy
Numpy location: /Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/numpy/__init__.py
Numpy version: 1.26.4

We executed 10 calls to gemm with a and b matrices of shapes (5000, 5000) and (5000, 5000).

Total execution time: 13.11s on CPU (with direct PyTensor binding to blas).

Try to run this script a few times. Experience shows that the first time is not as fast as following calls. The difference is not big, but consistent.

I am now on MacOS 15.1

@lucianopaz
Copy link
Contributor

lucianopaz commented Oct 30, 2024

@danieltomasz, the phrase Numpy config: (used when the PyTensor flag "blas__ldflags" is empty) is from old theano and we haven't updated it. Currently, numpy's config information is a bit deprecated in light of the newer build chain that they use. For that reason, we had to rely on something different. To get a better picture of what's going on please checkout the branch from this PR and run the following:

import logging

logger = logging.getLogger("pytensor.link.c.cmodule")
logger.setLevel(logging.DEBUG)

import pytensor

After the last import, you should see all of the detailed logs from cmodule. I would like to ask you to paste all the output you get here.

I would like to see what errors pytensor is running into when it tries to determine the default_blas_flags. You'll see that pytensor first tries to link against MKL (which will obviously fail on M* chips) and it should log some information about not finding the libraries. The important thing to me is what happens when it tries to find blas and cblas. Both of these should be importable from Mac's provided accelerate framework, via clang++'s search directories.

@danieltomasz
Copy link
Author

The environment wasn't completely clean, but I uninstalled pytensor and numpy and then installed it again via

pip install --no-binary :all: numpy git+https://github.com/pymc-devs/pytensor.git@b314ca67e841b6fc0aac5ea7b5bcc11700565b1e

Output from pytensor

DEBUG (pytensor.link.c.cmodule): Will search for BLAS libraries in the following directories:
/Library/Developer/CommandLineTools/usr/lib/clang/16
/Users/daniel/.pyenv/versions/3.12.7/lib
DEBUG (pytensor.link.c.cmodule): Checking MKL flags with intel threading
DEBUG (pytensor.link.c.cmodule): Required file 'mkl_core' not found
DEBUG (pytensor.link.c.cmodule): Required file mkl_core not found
DEBUG (pytensor.link.c.cmodule): Checking MKL flags with GNU OpenMP threading
DEBUG (pytensor.link.c.cmodule): Required file 'mkl_core' not found
DEBUG (pytensor.link.c.cmodule): Required file mkl_core not found
DEBUG (pytensor.link.c.cmodule): Checking Lapack + blas
DEBUG (pytensor.link.c.cmodule): Required file 'lapack' not found
DEBUG (pytensor.link.c.cmodule): Required file lapack not found
DEBUG (pytensor.link.c.cmodule): Checking blas alone
DEBUG (pytensor.link.c.cmodule): Required file 'blas' not found
DEBUG (pytensor.link.c.cmodule): Required file blas not found
DEBUG (pytensor.link.c.cmodule): Checking openblas
DEBUG (pytensor.link.c.cmodule): Required file 'openblas' not found
DEBUG (pytensor.link.c.cmodule): Required file openblas not found
DEBUG (pytensor.link.c.cmodule): Failed to identify blas ldflags. Will leave them empty.
WARNING (pytensor.tensor.blas): Using NumPy C-API based implementation for BLAS functions.

And in the same session

>>> import numpy as np
>>> np.show_config()
Build Dependencies:
  blas:
    detection method: system
    found: true
    include directory: unknown
    lib directory: unknown
    name: accelerate
    openblas configuration: unknown
    pc file directory: unknown
    version: unknown
  lapack:
    detection method: internal
    found: true
    include directory: unknown
    lib directory: unknown
    name: dep4409437856
    openblas configuration: unknown
    pc file directory: unknown
    version: 1.26.4
Compilers:
  c:
    commands: cc
    linker: ld64
    name: clang
    version: 16.0.0
  c++:
    commands: c++
    linker: ld64
    name: clang
    version: 16.0.0
  cython:
    commands: cython
    linker: cython
    name: cython
    version: 3.0.11
Machine Information:
  build:
    cpu: aarch64
    endian: little
    family: aarch64
    system: darwin
  host:
    cpu: aarch64
    endian: little
    family: aarch64
    system: darwin
Python Information:
  path: /Users/daniel/.pyenv/versions/3.12.7/bin/python3.12
  version: '3.12'
SIMD Extensions:
  baseline:
  - NEON
  - NEON_FP16
  - NEON_VFPV4
  - ASIMD
  found:
  - ASIMDHP
  not found:
  - ASIMDFHM

@lucianopaz
Copy link
Contributor

Thanks @danieltomasz , the logs say that we couldn’t find a blas library in the search directories. I can think of a couple of dumb causes but I’ll have to ask you to run a couple of other tests.

  1. Can you check what path to an executable you get as pytensor.config.cxx? Is it the system clang or is it the conda clang?
  2. Can you try to run that cxx executable in a terminal as cxx -print-search-dirs? What directories do you get in the libraries entry? Is the conda env lib path included?
  3. Can you verify if there is any file that has the name blas in the conda env lib directory? If there is, what’s the file name extension?
  4. Can you try to run pytensor.link.c.cmodule.try_blas_flags(["-framework", "Accelerate"]) and see if you get something?

@danieltomasz
Copy link
Author

Hi @lucianopaz, thanks for all the comments!
I installed python in the above case via pyenv (cpython 3.12.7),
The result of 1 is pointing into pyenv shim Users/daniel/.pyenv/shims/clang++

❯ /Users/daniel/.pyenv/shims/clang++ --version
Apple clang version 16.0.0 (clang-1600.0.26.4)
Target: arm64-apple-darwin24.1.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin
❯ /Users/daniel/.pyenv/shims/clang++ -print-search-dirs
programs: =/Library/Developer/CommandLineTools/usr/bin
libraries: =/Library/Developer/CommandLineTools/usr/lib/clang/16

Regarding 2 and 3
Earlier in this thread I was trying installing pytensor via miniconda (also managed via pyenv) ; It was either installing openblas or when I was trying to force accelerate via

~/.pyenv/versions/miniconda3-3.12-24.7.1-0/bin/conda create -n voxel-bayes-3.12  -c conda-forge pytensor-base  "libblas=*=*accelerate" 

accelerate was installed but with the following error happens #1005 (comment) this happens also with the newer version of the miniconda
I checked if the reason might be my setup, but with pixi conda install I was getting similar errors

Regarding 4, in the pyenv installed cpython:

>>> pytensor.link.c.cmodule.try_blas_flag(["-framework", "Accelerate"])
'-framework Accelerate'
>>>

Would be great if any other person on Apple processor can confirm it, if this is pecular to my setup or something more general (I started to have this problem after update to MacOS 15, MacOS 15 ships accelerate with blas 3.11, I wonder if this might be a problem

@lucianopaz
Copy link
Contributor

That last thing that you tried means that we could add those flags as a check and Mac would link to Accelerate. I'll open a small patch PR so that you can try it out.

@lucianopaz
Copy link
Contributor

@danieltomasz, try this PR out. It should set blas__ldflags to the Accelerate framework.

@danieltomasz
Copy link
Author

@lucianopaz seems promising

Python 3.12.7 (main, Oct 31 2024, 00:25:36) [Clang 16.0.0 (clang-1600.0.26.4)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import logging
>>> logger = logging.getLogger("pytensor.link.c.cmodule")
>>> logger.setLevel(logging.DEBUG)
>>> import pytensor
DEBUG (pytensor.link.c.cmodule): Will search for BLAS libraries in the following directories:
/Library/Developer/CommandLineTools/usr/lib/clang/16
/Users/daniel/.pyenv/versions/3.12.7/lib
DEBUG (pytensor.link.c.cmodule): Checking MKL flags with intel threading
DEBUG (pytensor.link.c.cmodule): Required file 'mkl_core' not found
DEBUG (pytensor.link.c.cmodule): Required file mkl_core not found
DEBUG (pytensor.link.c.cmodule): Checking MKL flags with GNU OpenMP threading
DEBUG (pytensor.link.c.cmodule): Required file 'mkl_core' not found
DEBUG (pytensor.link.c.cmodule): Required file mkl_core not found
DEBUG (pytensor.link.c.cmodule): Checking Accelerate framework
INFO (pytensor.link.c.cmodule): g++ -march=native selected lines: ['"/Library/Developer/CommandLineTools/usr/bin/clang" -cc1 -triple arm64-apple-macosx15.0.0 -Wundef-prefix=TARGET_OS_ -Wdeprecated-objc-isa-usage -Werror=deprecated-objc-isa-usage -Werror=implicit-function-declaration -E -disable-free -clear-ast-before-backend -disable-llvm-verifier -discard-value-names -main-file-name - -mrelocation-model pic -pic-level 2 -mframe-pointer=non-leaf -fno-strict-return -ffp-contract=on -fno-rounding-math -funwind-tables=1 -fobjc-msgsend-selector-stubs -target-sdk-version=15.1 -fvisibility-inlines-hidden-static-local-var -fno-modulemap-allow-subdirectory-search -target-cpu apple-m1 -target-feature +neon -target-feature +v8.5a -target-feature +zcm -target-feature +zcz -target-abi darwinpcs -debugger-tuning=lldb -target-linker-version 1115.7.3 -v -fcoverage-compilation-dir=/Users/daniel -resource-dir /Library/Developer/CommandLineTools/usr/lib/clang/16 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -I/usr/local/include -internal-isystem /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/local/include -internal-isystem /Library/Developer/CommandLineTools/usr/lib/clang/16/include -internal-externc-isystem /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include -internal-externc-isystem /Library/Developer/CommandLineTools/usr/include -Wno-reorder-init-list -Wno-implicit-int-float-conversion -Wno-c99-designator -Wno-final-dtor-non-final-class -Wno-extra-semi-stmt -Wno-misleading-indentation -Wno-quoted-include-in-framework-header -Wno-implicit-fallthrough -Wno-enum-enum-conversion -Wno-enum-float-conversion -Wno-elaborated-enum-base -Wno-reserved-identifier -Wno-gnu-folding-constant -fdebug-compilation-dir=/Users/daniel -ferror-limit 19 -stack-protector 1 -fstack-check -mdarwin-stkchk-strong-link -fblocks -fencode-extended-block-signature -fregister-global-dtors-with-atexit -fgnuc-version=4.2.1 -fmax-type-align=16 -fcommon -clang-vendor-feature=+disableNonDependentMemberExprInCurrentInstantiation -fno-odr-hash-protocols -clang-vendor-feature=+enableAggressiveVLAFolding -clang-vendor-feature=+revert09abecef7bbf -clang-vendor-feature=+thisNoAlignAttr -clang-vendor-feature=+thisNoNullAttr -clang-vendor-feature=+disableAtImportPrivateFrameworkInImplementationError -D__GCC_HAVE_DWARF2_CFI_ASM=1 -o - -x c -']
INFO (pytensor.link.c.cmodule): g++ default lines: ['"/Library/Developer/CommandLineTools/usr/bin/clang" -cc1 -triple arm64-apple-macosx15.0.0 -Wundef-prefix=TARGET_OS_ -Wdeprecated-objc-isa-usage -Werror=deprecated-objc-isa-usage -Werror=implicit-function-declaration -E -disable-free -clear-ast-before-backend -disable-llvm-verifier -discard-value-names -main-file-name - -mrelocation-model pic -pic-level 2 -mframe-pointer=non-leaf -fno-strict-return -ffp-contract=on -fno-rounding-math -funwind-tables=1 -fobjc-msgsend-selector-stubs -target-sdk-version=15.1 -fvisibility-inlines-hidden-static-local-var -fno-modulemap-allow-subdirectory-search -target-cpu apple-m1 -target-feature +v8.5a -target-feature +aes -target-feature +crc -target-feature +dotprod -target-feature +fp-armv8 -target-feature +fp16fml -target-feature +lse -target-feature +ras -target-feature +rcpc -target-feature +rdm -target-feature +sha2 -target-feature +sha3 -target-feature +neon -target-feature +zcm -target-feature +zcz -target-feature +fullfp16 -target-abi darwinpcs -debugger-tuning=lldb -target-linker-version 1115.7.3 -v -fcoverage-compilation-dir=/Users/daniel -resource-dir /Library/Developer/CommandLineTools/usr/lib/clang/16 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -I/usr/local/include -internal-isystem /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/local/include -internal-isystem /Library/Developer/CommandLineTools/usr/lib/clang/16/include -internal-externc-isystem /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include -internal-externc-isystem /Library/Developer/CommandLineTools/usr/include -Wno-reorder-init-list -Wno-implicit-int-float-conversion -Wno-c99-designator -Wno-final-dtor-non-final-class -Wno-extra-semi-stmt -Wno-misleading-indentation -Wno-quoted-include-in-framework-header -Wno-implicit-fallthrough -Wno-enum-enum-conversion -Wno-enum-float-conversion -Wno-elaborated-enum-base -Wno-reserved-identifier -Wno-gnu-folding-constant -fdebug-compilation-dir=/Users/daniel -ferror-limit 19 -stack-protector 1 -fstack-check -mdarwin-stkchk-strong-link -fblocks -fencode-extended-block-signature -fregister-global-dtors-with-atexit -fgnuc-version=4.2.1 -fmax-type-align=16 -fcommon -clang-vendor-feature=+disableNonDependentMemberExprInCurrentInstantiation -fno-odr-hash-protocols -clang-vendor-feature=+enableAggressiveVLAFolding -clang-vendor-feature=+revert09abecef7bbf -clang-vendor-feature=+thisNoAlignAttr -clang-vendor-feature=+thisNoNullAttr -clang-vendor-feature=+disableAtImportPrivateFrameworkInImplementationError -D__GCC_HAVE_DWARF2_CFI_ASM=1 -o - -x c -']
INFO (pytensor.link.c.cmodule): g++ -march=native equivalent flags: ['-march=apple-m1']

@danieltomasz
Copy link
Author

danieltomasz commented Oct 30, 2024

but with the above flag results of

python $(python -c "import pathlib, pytensor; print(pathlib.Path(pytensor.__file__).parent / 'misc/check_blas.py')")

are defaulting to the error I posted above

       Some results that you can compare against. They were 10 executions
        of gemm in float64 with matrices of shape 2000x2000 (M=N=K=2000).
        All memory layout was in C order.

        CPU tested: Xeon E5345(2.33Ghz, 8M L2 cache, 1333Mhz FSB),
                    Xeon E5430(2.66Ghz, 12M L2 cache, 1333Mhz FSB),
                    Xeon E5450(3Ghz, 12M L2 cache, 1333Mhz FSB),
                    Xeon X5560(2.8Ghz, 12M L2 cache, hyper-threads?)
                    Core 2 E8500, Core i7 930(2.8Ghz, hyper-threads enabled),
                    Core i7 950(3.07GHz, hyper-threads enabled)
                    Xeon X5550(2.67GHz, 8M l2 cache?, hyper-threads enabled)


        Libraries tested:
            * numpy with ATLAS from distribution (FC9) package (1 thread)
            * manually compiled numpy and ATLAS with 2 threads
            * goto 1.26 with 1, 2, 4 and 8 threads
            * goto2 1.13 compiled with multiple threads enabled

                          Xeon   Xeon   Xeon  Core2 i7    i7     Xeon   Xeon
        lib/nb threads    E5345  E5430  E5450 E8500 930   950    X5560  X5550

        numpy 1.3.0 blas                                                775.92s
        numpy_FC9_atlas/1 39.2s  35.0s  30.7s 29.6s 21.5s 19.60s
        goto/1            18.7s  16.1s  14.2s 13.7s 16.1s 14.67s
        numpy_MAN_atlas/2 12.0s  11.6s  10.2s  9.2s  9.0s
        goto/2             9.5s   8.1s   7.1s  7.3s  8.1s  7.4s
        goto/4             4.9s   4.4s   3.7s  -     4.1s  3.8s
        goto/8             2.7s   2.4s   2.0s  -     4.1s  3.8s
        openblas/1                                        14.04s
        openblas/2                                         7.16s
        openblas/4                                         3.71s
        openblas/8                                         3.70s
        mkl 11.0.083/1            7.97s
        mkl 10.2.2.025/1                                         13.7s
        mkl 10.2.2.025/2                                          7.6s
        mkl 10.2.2.025/4                                          4.0s
        mkl 10.2.2.025/8                                          2.0s
        goto2 1.13/1                                                     14.37s
        goto2 1.13/2                                                      7.26s
        goto2 1.13/4                                                      3.70s
        goto2 1.13/8                                                      1.94s
        goto2 1.13/16                                                     3.16s

        Test time in float32. There were 10 executions of gemm in
        float32 with matrices of shape 5000x5000 (M=N=K=5000)
        All memory layout was in C order.


        cuda version      8.0    7.5    7.0
        gpu
        M40               0.45s  0.47s
        k80               0.92s  0.96s
        K6000/NOECC       0.71s         0.69s
        P6000/NOECC       0.25s

        Titan X (Pascal)  0.28s
        GTX Titan X       0.45s  0.45s  0.47s
        GTX Titan Black   0.66s  0.64s  0.64s
        GTX 1080          0.35s
        GTX 980 Ti               0.41s
        GTX 970                  0.66s
        GTX 680                         1.57s
        GTX 750 Ti               2.01s  2.01s
        GTX 750                  2.46s  2.37s
        GTX 660                  2.32s  2.32s
        GTX 580                  2.42s
        GTX 480                  2.87s
        TX1                             7.6s (float32 storage and computation)
        GT 610                          33.5s

Some PyTensor flags:
    blas__ldflags= -framework Accelerate -rpath /Users/daniel/.pyenv/versions/3.12.7/lib
    compiledir= /Users/daniel/.pytensor/compiledir_macOS-15.1-arm64-arm-64bit-arm-3.12.7-64
    floatX= float64
    device= cpu
Some OS information:
    sys.platform= darwin
    sys.version= 3.12.7 (main, Oct 31 2024, 00:25:36) [Clang 16.0.0 (clang-1600.0.26.4)]
    sys.prefix= /Users/daniel/.pyenv/versions/3.12.7
Some environment variables:
    MKL_NUM_THREADS= None
    OMP_NUM_THREADS= None
    GOTO_NUM_THREADS= None

Numpy config: (used when the PyTensor flag "blas__ldflags" is empty)
/Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/numpy/__config__.py:155: UserWarning: Install `pyyaml` for better output
  warnings.warn("Install `pyyaml` for better output", stacklevel=1)
{
  "Compilers": {
    "c": {
      "name": "clang",
      "linker": "ld64",
      "version": "14.0.0",
      "commands": "cc",
      "args": "-fno-strict-aliasing, -DBLAS_SYMBOL_SUFFIX=64_, -DHAVE_BLAS_ILP64",
      "linker args": "-fno-strict-aliasing, -DBLAS_SYMBOL_SUFFIX=64_, -DHAVE_BLAS_ILP64"
    },
    "cython": {
      "name": "cython",
      "linker": "cython",
      "version": "3.0.8",
      "commands": "cython"
    },
    "c++": {
      "name": "clang",
      "linker": "ld64",
      "version": "14.0.0",
      "commands": "c++",
      "args": "-DBLAS_SYMBOL_SUFFIX=64_, -DHAVE_BLAS_ILP64",
      "linker args": "-DBLAS_SYMBOL_SUFFIX=64_, -DHAVE_BLAS_ILP64"
    }
  },
  "Machine Information": {
    "host": {
      "cpu": "aarch64",
      "family": "aarch64",
      "endian": "little",
      "system": "darwin"
    },
    "build": {
      "cpu": "aarch64",
      "family": "aarch64",
      "endian": "little",
      "system": "darwin"
    }
  },
  "Build Dependencies": {
    "blas": {
      "name": "openblas64",
      "found": true,
      "version": "0.3.23.dev",
      "detection method": "pkgconfig",
      "include directory": "/opt/arm64-builds/include",
      "lib directory": "/opt/arm64-builds/lib",
      "openblas configuration": "USE_64BITINT=1 DYNAMIC_ARCH=1 DYNAMIC_OLDER= NO_CBLAS= NO_LAPACK= NO_LAPACKE= NO_AFFINITY=1 USE_OPENMP= SANDYBRIDGE MAX_THREADS=3",
      "pc file directory": "/usr/local/lib/pkgconfig"
    },
    "lapack": {
      "name": "dep4335021056",
      "found": true,
      "version": "1.26.4",
      "detection method": "internal",
      "include directory": "unknown",
      "lib directory": "unknown",
      "openblas configuration": "unknown",
      "pc file directory": "unknown"
    }
  },
  "Python Information": {
    "path": "/private/var/folders/76/zy5ktkns50v6gt5g8r0sf6sc0000gn/T/cibw-run-q69bfk1p/cp312-macosx_arm64/build/venv/bin/python",
    "version": "3.12"
  },
  "SIMD Extensions": {
    "baseline": [
      "NEON",
      "NEON_FP16",
      "NEON_VFPV4",
      "ASIMD"
    ],
    "found": [
      "ASIMDHP"
    ],
    "not found": [
      "ASIMDFHM"
    ]
  }
}
Numpy dot module: numpy
Numpy location: /Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/numpy/__init__.py
Numpy version: 1.26.4
Traceback (most recent call last):
  File "/Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/pytensor/tensor/blas.py", line 428, in _ldflags
    assert t0 == "-"
           ^^^^^^^^^
AssertionError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/pytensor/link/vm.py", line 1227, in make_all
    node.op.make_thunk(node, storage_map, compute_map, [], impl=impl)
  File "/Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/pytensor/link/c/op.py", line 119, in make_thunk
    return self.make_c_thunk(node, storage_map, compute_map, no_recycling)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/pytensor/link/c/op.py", line 84, in make_c_thunk
    outputs = cl.make_thunk(
              ^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/pytensor/link/c/basic.py", line 1182, in make_thunk
    cthunk, module, in_storage, out_storage, error_storage = self.__compile__(
                                                             ^^^^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/pytensor/link/c/basic.py", line 1103, in __compile__
    thunk, module = self.cthunk_factory(
                    ^^^^^^^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/pytensor/link/c/basic.py", line 1614, in cthunk_factory
    key = self.cmodule_key()
          ^^^^^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/pytensor/link/c/basic.py", line 1266, in cmodule_key
    compile_args=self.compile_args(),
                 ^^^^^^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/pytensor/link/c/basic.py", line 947, in compile_args
    ret += x.c_compile_args(c_compiler=c_compiler)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/pytensor/tensor/blas.py", line 496, in c_compile_args
    return ldflags(libs=False, flags=True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/pytensor/tensor/blas.py", line 359, in ldflags
    return _ldflags(
           ^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/pytensor/tensor/blas.py", line 430, in _ldflags
    raise ValueError(f'invalid token "{t}" in ldflags_str: "{ldflags_str}"')
ValueError: invalid token "Accelerate" in ldflags_str: "-framework Accelerate -rpath /Users/daniel/.pyenv/versions/3.12.7/lib"

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/pytensor/misc/check_blas.py", line 274, in <module>
    t, impl = execute(
              ^^^^^^^^
  File "/Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/pytensor/misc/check_blas.py", line 57, in execute
    f = pytensor.function([], updates=[(c, 0.4 * c + 0.8 * dot(a, b))])
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/pytensor/compile/function/__init__.py", line 318, in function
    fn = pfunc(
         ^^^^^^
  File "/Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/pytensor/compile/function/pfunc.py", line 465, in pfunc
    return orig_function(
           ^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/pytensor/compile/function/types.py", line 1757, in orig_function
    fn = m.create(defaults)
         ^^^^^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/pytensor/compile/function/types.py", line 1649, in create
    _fn, _i, _o = self.linker.make_thunk(
                  ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/pytensor/link/basic.py", line 245, in make_thunk
    return self.make_all(
           ^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/pytensor/link/vm.py", line 1236, in make_all
    raise_with_op(fgraph, node)
  File "/Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/pytensor/link/utils.py", line 524, in raise_with_op
    raise exc_value.with_traceback(exc_trace)
  File "/Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/pytensor/link/vm.py", line 1227, in make_all
    node.op.make_thunk(node, storage_map, compute_map, [], impl=impl)
  File "/Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/pytensor/link/c/op.py", line 119, in make_thunk
    return self.make_c_thunk(node, storage_map, compute_map, no_recycling)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/pytensor/link/c/op.py", line 84, in make_c_thunk
    outputs = cl.make_thunk(
              ^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/pytensor/link/c/basic.py", line 1182, in make_thunk
    cthunk, module, in_storage, out_storage, error_storage = self.__compile__(
                                                             ^^^^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/pytensor/link/c/basic.py", line 1103, in __compile__
    thunk, module = self.cthunk_factory(
                    ^^^^^^^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/pytensor/link/c/basic.py", line 1614, in cthunk_factory
    key = self.cmodule_key()
          ^^^^^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/pytensor/link/c/basic.py", line 1266, in cmodule_key
    compile_args=self.compile_args(),
                 ^^^^^^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/pytensor/link/c/basic.py", line 947, in compile_args
    ret += x.c_compile_args(c_compiler=c_compiler)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/pytensor/tensor/blas.py", line 496, in c_compile_args
    return ldflags(libs=False, flags=True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/pytensor/tensor/blas.py", line 359, in ldflags
    return _ldflags(
           ^^^^^^^^^
  File "/Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/pytensor/tensor/blas.py", line 430, in _ldflags
    raise ValueError(f'invalid token "{t}" in ldflags_str: "{ldflags_str}"')
ValueError: invalid token "Accelerate" in ldflags_str: "-framework Accelerate -rpath /Users/daniel/.pyenv/versions/3.12.7/lib"
Apply node that caused the error: Gemm{inplace}(<Matrix(float64, shape=(?, ?))>, 0.8, <Matrix(float64, shape=(?, ?))>, <Matrix(float64, shape=(?, ?))>, 0.4)
Toposort index: 0
Inputs types: [TensorType(float64, shape=(None, None)), TensorType(float64, shape=()), TensorType(float64, shape=(None, None)), TensorType(float64, shape=(None, None)), TensorType(float64, shape=())]

HINT: Use a linker other than the C linker to print the inputs' shapes and strides.
HINT: Re-running with most PyTensor optimizations disabled could provide a back-trace showing when this node was created. This can be done by setting the PyTensor flag 'optimizer=fast_compile'. If that does not work, PyTensor optimizations can be disabled with 'optimizer=None'.
HINT: Use the PyTensor flag `exception_verbosity=high` for a debug print-out and storage map footprint of this Apply node.

Results of

PYTENSOR_FLAGS='optimizer=None,exception_verbosity=high'  python $(python -c "import pathlib, pytensor; print(pathlib.Path(pytensor.__file__).parent / 'misc/check_blas.py')")
We executed 10 calls to gemm with a and b matrices of shapes (5000, 5000) and (5000, 5000).

Total execution time: 11.61s on ERROR, unable to tell if PyTensor used the cpu:
[dot(<Matrix(float64, shape=(?, ?))>, <Matrix(float64, shape=(?, ?))>), ExpandDims{axes=[0, 1]}(0.8), Mul(ExpandDims{axes=[0, 1]}.0, dot.0), ExpandDims{axes=[0, 1]}(0.4), Mul(ExpandDims{axes=[0, 1]}.0, <Matrix(float64, shape=(?, ?))>), Add(Mul.0, Mul.0)].

@lucianopaz
Copy link
Contributor

Awesome @danieltomasz! I can reproduce that problem locally now. The latest commit to the PR I had mentioned before should have fixed it. Let me know if it works for you. If it did, I'll try to setup a test on Mac ARM in our CI matrix so that this can be verified.

@danieltomasz
Copy link
Author

Thanks @lucianopaz, everything seems to work ok now with cpython and pip install!

Python 3.12.7 (main, Oct 31 2024, 00:49:16) [Clang 16.0.0 (clang-1600.0.26.4)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import logging
>>> logger = logging.getLogger("pytensor.link.c.cmodule")
>>> logger.setLevel(logging.DEBUG)
>>> import pytensor
DEBUG (pytensor.link.c.cmodule): Will search for BLAS libraries in the following directories:
/Library/Developer/CommandLineTools/usr/lib/clang/16
/Users/daniel/.pyenv/versions/3.12.7/lib
DEBUG (pytensor.link.c.cmodule): Checking MKL flags with intel threading
DEBUG (pytensor.link.c.cmodule): Required file 'mkl_core' not found
DEBUG (pytensor.link.c.cmodule): Required file mkl_core not found
DEBUG (pytensor.link.c.cmodule): Checking MKL flags with GNU OpenMP threading
DEBUG (pytensor.link.c.cmodule): Required file 'mkl_core' not found
DEBUG (pytensor.link.c.cmodule): Required file mkl_core not found
DEBUG (pytensor.link.c.cmodule): Checking Accelerate framework
INFO (pytensor.link.c.cmodule): g++ -march=native selected lines: ['"/Library/Developer/CommandLineTools/usr/bin/clang" -cc1 -triple arm64-apple-macosx15.0.0 -Wundef-prefix=TARGET_OS_ -Wdeprecated-objc-isa-usage -Werror=deprecated-objc-isa-usage -Werror=implicit-function-declaration -E -disable-free -clear-ast-before-backend -disable-llvm-verifier -discard-value-names -main-file-name - -mrelocation-model pic -pic-level 2 -mframe-pointer=non-leaf -fno-strict-return -ffp-contract=on -fno-rounding-math -funwind-tables=1 -fobjc-msgsend-selector-stubs -target-sdk-version=15.1 -fvisibility-inlines-hidden-static-local-var -fno-modulemap-allow-subdirectory-search -target-cpu apple-m1 -target-feature +neon -target-feature +v8.5a -target-feature +zcm -target-feature +zcz -target-abi darwinpcs -debugger-tuning=lldb -target-linker-version 1115.7.3 -v -fcoverage-compilation-dir=/Users/daniel/blogspot-downloader -resource-dir /Library/Developer/CommandLineTools/usr/lib/clang/16 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -I/usr/local/include -internal-isystem /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/local/include -internal-isystem /Library/Developer/CommandLineTools/usr/lib/clang/16/include -internal-externc-isystem /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include -internal-externc-isystem /Library/Developer/CommandLineTools/usr/include -Wno-reorder-init-list -Wno-implicit-int-float-conversion -Wno-c99-designator -Wno-final-dtor-non-final-class -Wno-extra-semi-stmt -Wno-misleading-indentation -Wno-quoted-include-in-framework-header -Wno-implicit-fallthrough -Wno-enum-enum-conversion -Wno-enum-float-conversion -Wno-elaborated-enum-base -Wno-reserved-identifier -Wno-gnu-folding-constant -fdebug-compilation-dir=/Users/daniel/blogspot-downloader -ferror-limit 19 -stack-protector 1 -fstack-check -mdarwin-stkchk-strong-link -fblocks -fencode-extended-block-signature -fregister-global-dtors-with-atexit -fgnuc-version=4.2.1 -fmax-type-align=16 -fcommon -clang-vendor-feature=+disableNonDependentMemberExprInCurrentInstantiation -fno-odr-hash-protocols -clang-vendor-feature=+enableAggressiveVLAFolding -clang-vendor-feature=+revert09abecef7bbf -clang-vendor-feature=+thisNoAlignAttr -clang-vendor-feature=+thisNoNullAttr -clang-vendor-feature=+disableAtImportPrivateFrameworkInImplementationError -D__GCC_HAVE_DWARF2_CFI_ASM=1 -o - -x c -']
INFO (pytensor.link.c.cmodule): g++ default lines: ['"/Library/Developer/CommandLineTools/usr/bin/clang" -cc1 -triple arm64-apple-macosx15.0.0 -Wundef-prefix=TARGET_OS_ -Wdeprecated-objc-isa-usage -Werror=deprecated-objc-isa-usage -Werror=implicit-function-declaration -E -disable-free -clear-ast-before-backend -disable-llvm-verifier -discard-value-names -main-file-name - -mrelocation-model pic -pic-level 2 -mframe-pointer=non-leaf -fno-strict-return -ffp-contract=on -fno-rounding-math -funwind-tables=1 -fobjc-msgsend-selector-stubs -target-sdk-version=15.1 -fvisibility-inlines-hidden-static-local-var -fno-modulemap-allow-subdirectory-search -target-cpu apple-m1 -target-feature +v8.5a -target-feature +aes -target-feature +crc -target-feature +dotprod -target-feature +fp-armv8 -target-feature +fp16fml -target-feature +lse -target-feature +ras -target-feature +rcpc -target-feature +rdm -target-feature +sha2 -target-feature +sha3 -target-feature +neon -target-feature +zcm -target-feature +zcz -target-feature +fullfp16 -target-abi darwinpcs -debugger-tuning=lldb -target-linker-version 1115.7.3 -v -fcoverage-compilation-dir=/Users/daniel/blogspot-downloader -resource-dir /Library/Developer/CommandLineTools/usr/lib/clang/16 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -I/usr/local/include -internal-isystem /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/local/include -internal-isystem /Library/Developer/CommandLineTools/usr/lib/clang/16/include -internal-externc-isystem /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include -internal-externc-isystem /Library/Developer/CommandLineTools/usr/include -Wno-reorder-init-list -Wno-implicit-int-float-conversion -Wno-c99-designator -Wno-final-dtor-non-final-class -Wno-extra-semi-stmt -Wno-misleading-indentation -Wno-quoted-include-in-framework-header -Wno-implicit-fallthrough -Wno-enum-enum-conversion -Wno-enum-float-conversion -Wno-elaborated-enum-base -Wno-reserved-identifier -Wno-gnu-folding-constant -fdebug-compilation-dir=/Users/daniel/blogspot-downloader -ferror-limit 19 -stack-protector 1 -fstack-check -mdarwin-stkchk-strong-link -fblocks -fencode-extended-block-signature -fregister-global-dtors-with-atexit -fgnuc-version=4.2.1 -fmax-type-align=16 -fcommon -clang-vendor-feature=+disableNonDependentMemberExprInCurrentInstantiation -fno-odr-hash-protocols -clang-vendor-feature=+enableAggressiveVLAFolding -clang-vendor-feature=+revert09abecef7bbf -clang-vendor-feature=+thisNoAlignAttr -clang-vendor-feature=+thisNoNullAttr -clang-vendor-feature=+disableAtImportPrivateFrameworkInImplementationError -D__GCC_HAVE_DWARF2_CFI_ASM=1 -o - -x c -']
INFO (pytensor.link.c.cmodule): g++ -march=native equivalent flags: ['-march=apple-m1']

and

 python $(python -c "import pathlib, pytensor; print(pathlib.Path(pytensor.__file__).parent / 'misc/check_blas.py')")

        Some results that you can compare against. They were 10 executions
        of gemm in float64 with matrices of shape 2000x2000 (M=N=K=2000).
        All memory layout was in C order.

        CPU tested: Xeon E5345(2.33Ghz, 8M L2 cache, 1333Mhz FSB),
                    Xeon E5430(2.66Ghz, 12M L2 cache, 1333Mhz FSB),
                    Xeon E5450(3Ghz, 12M L2 cache, 1333Mhz FSB),
                    Xeon X5560(2.8Ghz, 12M L2 cache, hyper-threads?)
                    Core 2 E8500, Core i7 930(2.8Ghz, hyper-threads enabled),
                    Core i7 950(3.07GHz, hyper-threads enabled)
                    Xeon X5550(2.67GHz, 8M l2 cache?, hyper-threads enabled)


        Libraries tested:
            * numpy with ATLAS from distribution (FC9) package (1 thread)
            * manually compiled numpy and ATLAS with 2 threads
            * goto 1.26 with 1, 2, 4 and 8 threads
            * goto2 1.13 compiled with multiple threads enabled

                          Xeon   Xeon   Xeon  Core2 i7    i7     Xeon   Xeon
        lib/nb threads    E5345  E5430  E5450 E8500 930   950    X5560  X5550

        numpy 1.3.0 blas                                                775.92s
        numpy_FC9_atlas/1 39.2s  35.0s  30.7s 29.6s 21.5s 19.60s
        goto/1            18.7s  16.1s  14.2s 13.7s 16.1s 14.67s
        numpy_MAN_atlas/2 12.0s  11.6s  10.2s  9.2s  9.0s
        goto/2             9.5s   8.1s   7.1s  7.3s  8.1s  7.4s
        goto/4             4.9s   4.4s   3.7s  -     4.1s  3.8s
        goto/8             2.7s   2.4s   2.0s  -     4.1s  3.8s
        openblas/1                                        14.04s
        openblas/2                                         7.16s
        openblas/4                                         3.71s
        openblas/8                                         3.70s
        mkl 11.0.083/1            7.97s
        mkl 10.2.2.025/1                                         13.7s
        mkl 10.2.2.025/2                                          7.6s
        mkl 10.2.2.025/4                                          4.0s
        mkl 10.2.2.025/8                                          2.0s
        goto2 1.13/1                                                     14.37s
        goto2 1.13/2                                                      7.26s
        goto2 1.13/4                                                      3.70s
        goto2 1.13/8                                                      1.94s
        goto2 1.13/16                                                     3.16s

        Test time in float32. There were 10 executions of gemm in
        float32 with matrices of shape 5000x5000 (M=N=K=5000)
        All memory layout was in C order.


        cuda version      8.0    7.5    7.0
        gpu
        M40               0.45s  0.47s
        k80               0.92s  0.96s
        K6000/NOECC       0.71s         0.69s
        P6000/NOECC       0.25s

        Titan X (Pascal)  0.28s
        GTX Titan X       0.45s  0.45s  0.47s
        GTX Titan Black   0.66s  0.64s  0.64s
        GTX 1080          0.35s
        GTX 980 Ti               0.41s
        GTX 970                  0.66s
        GTX 680                         1.57s
        GTX 750 Ti               2.01s  2.01s
        GTX 750                  2.46s  2.37s
        GTX 660                  2.32s  2.32s
        GTX 580                  2.42s
        GTX 480                  2.87s
        TX1                             7.6s (float32 storage and computation)
        GT 610                          33.5s

Some PyTensor flags:
    blas__ldflags= -framework Accelerate -rpath /Users/daniel/.pyenv/versions/3.12.7/lib
    compiledir= /Users/daniel/.pytensor/compiledir_macOS-15.1-arm64-arm-64bit-arm-3.12.7-64
    floatX= float64
    device= cpu
Some OS information:
    sys.platform= darwin
    sys.version= 3.12.7 (main, Oct 31 2024, 00:49:16) [Clang 16.0.0 (clang-1600.0.26.4)]
    sys.prefix= /Users/daniel/.pyenv/versions/3.12.7
Some environment variables:
    MKL_NUM_THREADS= None
    OMP_NUM_THREADS= None
    GOTO_NUM_THREADS= None

Numpy config: (used when the PyTensor flag "blas__ldflags" is empty)
Build Dependencies:
  blas:
    detection method: system
    found: true
    include directory: unknown
    lib directory: unknown
    name: accelerate
    openblas configuration: unknown
    pc file directory: unknown
    version: unknown
  lapack:
    detection method: internal
    found: true
    include directory: unknown
    lib directory: unknown
    name: dep4405705904
    openblas configuration: unknown
    pc file directory: unknown
    version: 1.26.4
Compilers:
  c:
    args: -I/opt/homebrew/opt/openblas/include
    commands: gcc
    linker: ld64
    linker args: -L/opt/homebrew/opt/openblas/lib, -I/opt/homebrew/opt/openblas/include
    name: clang
    version: 16.0.0
  c++:
    commands: c++
    linker: ld64
    linker args: -L/opt/homebrew/opt/openblas/lib
    name: clang
    version: 16.0.0
  cython:
    commands: cython
    linker: cython
    name: cython
    version: 3.0.11
Machine Information:
  build:
    cpu: aarch64
    endian: little
    family: aarch64
    system: darwin
  host:
    cpu: aarch64
    endian: little
    family: aarch64
    system: darwin
Python Information:
  path: /Users/daniel/.pyenv/versions/3.12.7/bin/python3.12
  version: '3.12'
SIMD Extensions:
  baseline:
  - NEON
  - NEON_FP16
  - NEON_VFPV4
  - ASIMD
  found:
  - ASIMDHP
  not found:
  - ASIMDFHM

Numpy dot module: numpy
Numpy location: /Users/daniel/.pyenv/versions/3.12.7/lib/python3.12/site-packages/numpy/__init__.py
Numpy version: 1.26.4

We executed 10 calls to gemm with a and b matrices of shapes (5000, 5000) and (5000, 5000).

Total execution time: 16.22s on CPU (with direct PyTensor binding to blas).

Try to run this script a few times. Experience shows that the first time is not as fast as following calls. The difference is not big, but consistent.

@lucianopaz
Copy link
Contributor

@danieltomasz, this should now be fixed with #1056. If you want, you can try to run from the current pytensor main branch and check if it works. I had to do a bunch of extra changes to ensure compilation actually used blas symbols.

@danieltomasz
Copy link
Author

danieltomasz commented Nov 4, 2024

nice, after running

 python $(python -c "import pathlib, pytensor; print(pathlib.Path(pytensor.__file__).parent / 'misc/check_blas.py')")

flags are different now

blas__ldflags= -framework Accelerate -Wl,-rpath,/Users/daniel/.pyenv/versions/3.12.7/lib

And the time of running is shorter (down to around 10-13s from 14-16s)

We executed 10 calls to gemm with a and b matrices of shapes (5000, 5000) and (5000, 5000).

Total execution time: 10.02s on CPU (with direct PyTensor binding to blas).

@lucianopaz
Copy link
Contributor

nice, after running

 python $(python -c "import pathlib, pytensor; print(pathlib.Path(pytensor.__file__).parent / 'misc/check_blas.py')")

flags are different now

blas__ldflags= -framework Accelerate -Wl,-rpath,/Users/daniel/.pyenv/versions/3.12.7/lib

And the time of running is shorter (down to around 10-13s from 14-16s)

We executed 10 calls to gemm with a and b matrices of shapes (5000, 5000) and (5000, 5000).

Total execution time: 10.02s on CPU (with direct PyTensor binding to blas).

Yes, I changed the flags to make them aligned with what other blas flag specs that we use. And the execution time should be shorter because it’s actually linking to accelerate now. Before, it was failing to do so because of things that were handing downstream.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working installation macOS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants