aarch64 linux: torch.compile performance is 2x slow with nightly torch wheel compared to the wheel built with 'build_aarch64_wheel.py' script #1774

snadampal · 2024-04-05T17:59:20Z

For torchbench benchmarks with dynamo backend, the aarch64 linux nightly wheel performance is 2x slow compared to the wheel I've built using the pytorch/builder/build_aarch64_wheel.py script for the same pytorch commit.

The difference seems to be coming from
the https://github.com/pytorch/builder/blob/main/aarch64_linux/aarch64_ci_build.sh used for nightly builds. I suspect it's with the libomp.

How to reproduce?

git clone https://github.com/pytorch/benchmark.git
cd benchmark

# apply this PR: https://github.com/pytorch/benchmark/pull/2187

# setting omp threads =16, because i'm using c7g.4xl instance

OMP_NUM_THREADS=16 python3 run_benchmark.py cpu --model hf_DistilBert --test eval --torchdynamo inductor --freeze_prepack_weights --metrics="latencies,cpu_peak_mem"

The text was updated successfully, but these errors were encountered:

snadampal · 2024-04-05T19:42:48Z

it's indeed the libomp.so from conda in the nightly wheel is causing the issue. I replaced it with debian libgomp, the performance is restored, 2x improved.
I'm checking how I can switch to libgomp for the aarch64 nightly and release wheel building.

atalman · 2024-04-05T19:53:41Z

cc @malfet

malfet · 2024-04-05T20:20:09Z

@snadampal so, should we package libomp from debian in our build scripts?

bryantbiggs · 2024-04-05T20:23:29Z

I would love to help get away from these conda packaged deps and instead use something more OS native (i.e. - build in a container using whats provided via apt, yum, dnf, etc.)

snadampal · 2024-04-05T20:33:39Z

The issue is observed for the pytorch 2.3 release candidate wheels as well.
pip3 install torch==2.3.0 --index-url https://download.pytorch.org/whl/test/cpu

In the release wheel package, I see we are already packaging both the omp libraries.

libgomp-0f9e2209.so.1.0.0 --> from debian
libomp-b8e5bcfb.so. --> from conda

but looks like all the libraries are linked to libomp-b8e5bcfb.so. so, it might be more than packaging, we will have to switch the dependent libraries too.
for experimenting, I renamed libgomp-0f9e2209.so.1.0.0to libomp-b8e5bcfb.so and saw the performance restored.

malfet · 2024-04-05T20:34:22Z

@bryantbiggs I think this is a general direction build process is going towards: no Anaconda, just use what's in pypa docker

snadampal · 2024-04-05T20:37:00Z

I will check if the new scripts can be updated to remove conda dependency.

Otherwise we anyway have the fallback option: The old scripts https://github.com/pytorch/builder/blob/main/aarch64_linux/aarch64_wheel_ci_build.py are native manylinx os builds. they are being maintained, so, we can switch to them for the CD.

snadampal · 2024-04-05T21:09:32Z

one small correction to my previous statement. we are packaging only the coda libomp.so for the nightly and release wheels, not both conda and manylinux.

~~In the release wheel package, I see we are already packaging both the omp libraries.~~

coming to the solution, it's the same.
if we replace libomp.so with libgomp.so, the performance is back , but I don't think it's a good direction just to replace omp library with manylinux but keeping rest from conda, they might cause problems somewhere else.
so, I'm planning to switch everything to manylinux and pip.

snadampal · 2024-04-05T23:24:03Z

looking at how the wheel building scripts are integrated into nightly wheel workflow, everything is happening inside the manylinux docker. and the docker is missing many packages including openblas.
btw, even the older scripts are using conda for those missing packages, but only difference is, they are embedding the libgomp explicitly from host, instead of the one from conda.
as @malfet hinted, packaging libgomp into the wheel might be the only way for now.

any other thoughts?

malfet · 2024-04-05T23:32:05Z

@snadampal In original script I build OpenBLAS from source, because one that comes with OS was lacking OpenMP integration. And the only reason to use conda were to install cmake and ninja that were missing in PyPI at the time. Now one can(and should) completely eliminate Conda dependency for wheel builds

snadampal · 2024-04-06T00:06:06Z

Right now the scripts are using manylinux2014 os which is pretty old, comes with python 3.6, that's why we had to rely on conda for lot of packages. Given the EOL for it is June 30th, 2024. (link), I'm upgrading the docker to manylinux_2_28, the latest one. this might remove conda dependency. I'm trying it.

snadampal · 2024-04-08T16:08:58Z

it's becoming more involved than I initially thought. manylinux 2_28 comes with gcc-12 with which pytorch compilation is failing on aarch64.
@bryantbiggs , would you be able to give it a try?

bryantbiggs · 2024-04-08T16:26:34Z

ya, I'll see if I can take a look - still working my through building from source without conda for CUDA based build

snadampal · 2024-04-08T23:41:14Z

I have upgraded the docker to manylinux 2_28 and removed conda dependency completely, everything installed from manylinux or pypi. This solves the libomp performance issues.

here is the draft PR:
#1781

I had to disable pytorch tests building, via BUILD_TEST=0 to make it work, that's where GCC-12 breaks coming from, I will look into it next.

snadampal · 2024-04-09T04:38:33Z

I have fixed the pytorch test build issue, in fact it seems to be a known issue pytorch/pytorch#99278, and there was a PR too pytorch/pytorch#99468.
Looks like it was observed on x86 as well, so, it would be good if the PR can be merged.

With this PR, torch build is working fine in manylinux 2_28 docker with gcc12 toolchain.

snadampal · 2024-04-11T23:17:08Z

For now, I'm using gcc-11 toolchain on manylinux2_28, so, I'm not blocked on PyTorch test build PR mentioned above.

snadampal · 2024-04-12T18:55:36Z

I'm observing that compared to the default llvm libomp used in official torch wheels, gnu libgomp is giving 2x better perf for torch compile mode, but 10% slower perf for eager mode.
Unless this 10% slowness is root-caused, it's not good to switch to libgomp. I'm checking why the 10% drop, more importantly why libomp is 2x slower for torch.compile, is there any mixup in torch dynamo?

snadampal · 2024-04-15T15:02:21Z

Looks like there is no clear winner for libgomp vs libompon aarch64 linux. We ran several benchmarks from huggingface, and plotted libgomp perf as the reference blue line, and the relative libomp perf as orange bars. From the chart, it's clear that while libomp is better for few models, libgomp is better in general, so, we are going ahead with switching to libgomp for aarch64 linux, which has already shown upto 2x better perf for torch.compile.

In the current version of the scripts, torch libraries are linked to llvm openmp becasue conda openblas-openmp is linked to it. to switch to gnu libgomp, we are building the openblas from sources instead of installing from conda. In essence it reverts #1462 fixes #1774 (cherry picked from commit b57d3a8)

In the current version of the CD scripts, torch libraries are linked to llvm openmp because conda openblas-openmp is linked to it. To switch to gnu libgomp, we are building the openblas from sources instead of installing from conda. Building openBLAS shared library instead of static library to be able to discover LAPACK support in OpenBLAS. cherrypicked from pytorch#1803 fixes: pytorch#1774

In the current version of the CD scripts, torch libraries are linked to llvm openmp because conda openblas-openmp is linked to it. To switch to gnu libgomp, we are building the openblas from sources instead of installing from conda. Building openBLAS shared library instead of static library to be able to discover LAPACK support in OpenBLAS. cherrypicked from #1803 fixes: #1774

snadampal mentioned this issue Apr 8, 2024

aarch64: cd: remove conda dependency #1781

Open

snadampal mentioned this issue Apr 10, 2024

Aarch64 cd upgrade pytorch/pytorch#123747

Closed

snadampal mentioned this issue Apr 12, 2024

[aarch64] Add CUDA 12.4 build script for ARM wheel #1775

Merged

This was referenced Apr 18, 2024

aarch64: cd: switch from libomp to libgomp #1787

Merged

aarch64: cd: test openmp switch from libomp to libgomp pytorch/pytorch#124353

Closed

malfet closed this as completed in b57d3a8 Apr 24, 2024

malfet mentioned this issue Apr 25, 2024

[CD] Build OpenBLAS from source during CD build #1801

Closed

snadampal mentioned this issue Apr 25, 2024

aarch64: cd: switch from libomp to libgomp #1803

Merged

snadampal mentioned this issue Apr 28, 2024

aarch64: cd: switch from libomp to libgomp #1805

Merged

snadampal mentioned this issue May 2, 2024

[v2.3.1] Release Tracker pytorch/pytorch#125425

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aarch64 linux: torch.compile performance is 2x slow with nightly torch wheel compared to the wheel built with 'build_aarch64_wheel.py' script #1774

aarch64 linux: torch.compile performance is 2x slow with nightly torch wheel compared to the wheel built with 'build_aarch64_wheel.py' script #1774

snadampal commented Apr 5, 2024

snadampal commented Apr 5, 2024

atalman commented Apr 5, 2024

malfet commented Apr 5, 2024

bryantbiggs commented Apr 5, 2024

snadampal commented Apr 5, 2024

malfet commented Apr 5, 2024

snadampal commented Apr 5, 2024

snadampal commented Apr 5, 2024

snadampal commented Apr 5, 2024

malfet commented Apr 5, 2024

snadampal commented Apr 6, 2024

snadampal commented Apr 8, 2024

bryantbiggs commented Apr 8, 2024

snadampal commented Apr 8, 2024 •

edited

Loading

snadampal commented Apr 9, 2024

snadampal commented Apr 11, 2024

snadampal commented Apr 12, 2024

snadampal commented Apr 15, 2024

aarch64 linux: torch.compile performance is 2x slow with nightly torch wheel compared to the wheel built with 'build_aarch64_wheel.py' script #1774

aarch64 linux: torch.compile performance is 2x slow with nightly torch wheel compared to the wheel built with 'build_aarch64_wheel.py' script #1774

Comments

snadampal commented Apr 5, 2024

snadampal commented Apr 5, 2024

atalman commented Apr 5, 2024

malfet commented Apr 5, 2024

bryantbiggs commented Apr 5, 2024

snadampal commented Apr 5, 2024

malfet commented Apr 5, 2024

snadampal commented Apr 5, 2024

snadampal commented Apr 5, 2024

snadampal commented Apr 5, 2024

malfet commented Apr 5, 2024

snadampal commented Apr 6, 2024

snadampal commented Apr 8, 2024

bryantbiggs commented Apr 8, 2024

snadampal commented Apr 8, 2024 • edited Loading

snadampal commented Apr 9, 2024

snadampal commented Apr 11, 2024

snadampal commented Apr 12, 2024

snadampal commented Apr 15, 2024

snadampal commented Apr 8, 2024 •

edited

Loading