Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.12 on M1 Pro Chip not using all the CPU cores. (device="cpu") #77938

Closed
PkuCuipy opened this issue May 20, 2022 · 3 comments
Closed

1.12 on M1 Pro Chip not using all the CPU cores. (device="cpu") #77938

PkuCuipy opened this issue May 20, 2022 · 3 comments

Comments

@PkuCuipy
Copy link

PkuCuipy commented May 20, 2022

🐛 Describe the bug

After updated to nightly-build PyTorch 1.12, a performance test is made to compare 'mps' over 'cpu' as shown below:

import torch
from tqdm import trange

DTYPE = torch.float32
MAT_SIZE = 5000
DEVICE = ["cpu", "mps"][0]      # it's CPU now

mat = torch.randn([MAT_SIZE, MAT_SIZE], dtype=DTYPE, device=DEVICE)

for i in trange(N_ITER := 100):
    mat @= mat                  # <--- Main Computation HERE
    print(mat[0, 0], end="")    # avoid sync-issue when using 'mps'

It's true that "mps" is somehow faster than "cpu" on this M1-Pro chip.
However, I soon noticed that it's not utilizing all the 10 CPU cores when device="cpu"?
Specifically, Activity Monitor.app shows that it only use ≈200% of CPU.

After further experiments, I found some interesting facts:

  1. As mentioned above, device="cpu" on version 1.12 will not using all CPU cores on M1 Pro Chip.
  2. When switching back to version 1.11, device="cpu" do take advantage of all the CPU cores.
  3. Although 2. is true, it's actually slower than 1.12! i.e. device="cpu" on 1.12 uses less CPU and less power yet got better performance.
  4. Althogh 1. is true, manully running N(such as 2) instances of this script will let the performance of each script drop down to 1/N of its original. (while indeed more CPU cores are scheduled and more Watts are consumed..)

I'm wondering the reason of 1. 2. 3. and 4., and am not sure whether it's some bug of PyTorch or some mistake in my experiments or in my mind.

Versions

PyTorch version: 1.12.0.dev20220518
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 12.3.1 (arm64)
GCC version: Could not collect
Clang version: 13.1.6 (clang-1316.0.21.2.5)
CMake version: Could not collect
Libc version: N/A

Python version: 3.9.12 | packaged by conda-forge | (main, Mar 24 2022, 23:25:14) [Clang 12.0.1 ] (64-bit runtime)
Python platform: macOS-12.3.1-arm64-arm-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.21.6
[pip3] torch==1.12.0.dev20220518
[pip3] torchlibrosa==0.0.9
[pip3] torchvision==0.9.0a0
[conda] numpy 1.21.6 py39h690d673_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
[conda] pytorch 1.12.0.dev20220518 py3.9_0 pytorch-nightly
[conda] torchlibrosa 0.0.9 pypi_0 pypi
[conda] torchvision 0.9.1 py39h0a40b5a_0_cpu https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge

@psobolewskiPhD
Copy link

Sounds like now "cpu" uses the matrix coprocessor AMX via the Accelerate library.
See: https://stackoverflow.com/questions/67587455/accelerate-framework-uses-only-one-core-on-mac-m1/67590869#67590869

@PkuCuipy
Copy link
Author

Sounds like now "cpu" uses the matrix coprocessor AMX via the Accelerate library.
See: https://stackoverflow.com/questions/67587455/accelerate-framework-uses-only-one-core-on-mac-m1/67590869#67590869

Thanks a lot! seems perfectly solved my confusion!

@psobolewskiPhD
Copy link

You can try to test this further by installing numpy with Accelerate BLAS via conda-forge:
conda-forge/numpy-feedstock#253
You should see a similar effect vs. OpenBLAS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants