-
Notifications
You must be signed in to change notification settings - Fork 227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
aarch64 linux: torch.compile performance is 2x slow with nightly torch wheel compared to the wheel built with 'build_aarch64_wheel.py' script #1774
Comments
it's indeed the |
cc @malfet |
@snadampal so, should we package libomp from debian in our build scripts? |
I would love to help get away from these conda packaged deps and instead use something more OS native (i.e. - build in a container using whats provided via apt, yum, dnf, etc.) |
The issue is observed for the pytorch 2.3 release candidate wheels as well. In the release wheel package, I see we are already packaging both the omp libraries.
but looks like all the libraries are linked to |
@bryantbiggs I think this is a general direction build process is going towards: no Anaconda, just use what's in pypa docker |
I will check if the new scripts can be updated to remove conda dependency. Otherwise we anyway have the fallback option: The old scripts https://github.com/pytorch/builder/blob/main/aarch64_linux/aarch64_wheel_ci_build.py are native manylinx os builds. they are being maintained, so, we can switch to them for the CD. |
one small correction to my previous statement. we are packaging only the coda
coming to the solution, it's the same. |
looking at how the wheel building scripts are integrated into nightly wheel workflow, everything is happening inside the manylinux docker. and the docker is missing many packages including openblas. any other thoughts? |
@snadampal In original script I build OpenBLAS from source, because one that comes with OS was lacking OpenMP integration. And the only reason to use conda were to install cmake and ninja that were missing in PyPI at the time. Now one can(and should) completely eliminate Conda dependency for wheel builds |
Right now the scripts are using |
it's becoming more involved than I initially thought. manylinux 2_28 comes with gcc-12 with which pytorch compilation is failing on aarch64. |
ya, I'll see if I can take a look - still working my through building from source without conda for CUDA based build |
I have upgraded the docker to manylinux 2_28 and removed conda dependency completely, everything installed from manylinux or pypi. This solves the libomp performance issues. here is the draft PR: I had to disable pytorch tests building, via |
I have fixed the pytorch test build issue, in fact it seems to be a known issue pytorch/pytorch#99278, and there was a PR too pytorch/pytorch#99468. With this PR, torch build is working fine in manylinux 2_28 docker with gcc12 toolchain. |
For now, I'm using gcc-11 toolchain on manylinux2_28, so, I'm not blocked on PyTorch test build PR mentioned above. |
I'm observing that compared to the default llvm |
Looks like there is no clear winner for |
In the current version of the scripts, torch libraries are linked to llvm openmp becasue conda openblas-openmp is linked to it. to switch to gnu libgomp, we are building the openblas from sources instead of installing from conda. In essence it reverts #1462 fixes #1774 (cherry picked from commit b57d3a8)
In the current version of the CD scripts, torch libraries are linked to llvm openmp because conda openblas-openmp is linked to it. To switch to gnu libgomp, we are building the openblas from sources instead of installing from conda. Building openBLAS shared library instead of static library to be able to discover LAPACK support in OpenBLAS. cherrypicked from pytorch#1803 fixes: pytorch#1774
In the current version of the CD scripts, torch libraries are linked to llvm openmp because conda openblas-openmp is linked to it. To switch to gnu libgomp, we are building the openblas from sources instead of installing from conda. Building openBLAS shared library instead of static library to be able to discover LAPACK support in OpenBLAS. cherrypicked from #1803 fixes: #1774
For torchbench benchmarks with dynamo backend, the aarch64 linux nightly wheel performance is 2x slow compared to the wheel I've built using the pytorch/builder/build_aarch64_wheel.py script for the same pytorch commit.
The difference seems to be coming from
the https://github.com/pytorch/builder/blob/main/aarch64_linux/aarch64_ci_build.sh used for nightly builds. I suspect it's with the libomp.
How to reproduce?
The text was updated successfully, but these errors were encountered: