-
Notifications
You must be signed in to change notification settings - Fork 227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[aarch64] Add CUDA 12.4 build script for ARM wheel #1775
Conversation
cc @snadampal |
@tinglvv is this for grace or jetson or arm sbsa? |
Hi, this is built for ARM sbsa which is also used by Grace. Main purpose is to run this ARM+CUDA wheel on Grace CPU. I have not tested on Jetson yet. |
Hi, sorry do you mean testing with CPU-aarch64? I simply added the CUDA support on top of the existing workflow for CPU-aarch64. I have tested on Grace with the cuda related operations from test_ops.py. Sure I can try with your PR and let you know. |
ldconfig | ||
} | ||
|
||
function prune_124 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does NVPrune do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nWEIdia it removes GPU architectures from libraries, which are never used to lower the binary size. This workflow can be dangerous if libraries depend on heuristics and can select kernels from the same GPU family (we've seen issues before where sm_61 was dropped causing all kinds of issues on GTX cards).
I don't think the pruning is useful anymore, as we are using the CUDA dependencies from PyPI now.
However, we might want to keep it here and follow up with a cleanup in a separate PR.
@snadampal I'm unsure if this request is related to your PR or this change. If your PR needs additional testing, could you please tag us there? Tests pass, so can we merge it, @atalman @malfet or is your review pending? |
Hi @ptrblck , my PR is to migrate the whole aarch64 ci to manylinux 2_28 and remove conda packages. my request was to make sure this cuda addition changes work there as well, so that no surprises when we make the switch. |
Hi @snadampal, I am using a few packages from the conda environment, https://github.com/pytorch/builder/pull/1775/files#diff-3f5d59f85dd25cceac14ee2e14f8eb83547a68fbf087fc6dcc9471c368ba0ac2R88-R90. If conda is removed, is there a place to get these libs? |
Hi @snadampal, there's major change in the files which may delay the merge of this PR. The workflow for cuda-aarch64 wheel does not exist yet, so goal here was not to have a perfect workflow but to establish it first. We would test your PR as the next step as we iterate and resolve issues after the merge. Hope this works for you. Thanks. |
Hi @tinglvv , the libraries you pointed above are taken care in the new CI scripts (without conda), so I think CUDA should work fine there as well. I'm fine with rebasing mine if this PR goes first, but I request you to test CUDA again in my PR. |
Every time a package is moved out of conda, an angel gets its wings 😅 |
"/usr/local/cuda/lib64/libcudnn_ops_train.so.8", | ||
"/opt/conda/envs/aarch64_env/lib/libopenblas.so.0", | ||
"/opt/conda/envs/aarch64_env/lib/libgfortran.so.5", | ||
"/opt/conda/envs/aarch64_env/lib/libgomp.so.1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currently the scripts are packaging libomp.so
did you check the inference performance for any models? is there any performance difference observed with libgomp vs libomp in the current wheels?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm observing around 10% performance drop for eager mode inference with libgomp
compared to libomp
.
if you don't have a strong preference, I suggest keeping libomp
till we know better.
for more details, check my comment here: #1774 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please a follow-up on libgomp
to libomp
migration in your PR #1781, as it is not trivial change and certainly out of scope of this PR. I would like to underline that this PR is for enabling CUDA. The libgomp
has been used with PyTorch for long time: it is reliable and nothing is wrong with it functionality-wise. Moreover, I do not want Ting to waste her time on debugging dependencies due to libomp
in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Aidyn-A , currently the aarch64 wheels are linked to libomp
not libgomp
.
https://pypi.org/project/torch/#files
My point was why to change it now? without having a strong reason.
I have another PR to switch wheels from libomp
to libgomp
but is currently blocked due to the 10% regression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I am seeing, it comes with:
/usr/local/lib/python3.8/dist-packages/torch.libs/libgomp-0f9e2209.so.1.0.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope you are checking either torch 2.2 or nightly aarch64-linux wheel, because I am seeing,
libomp-b8e5bcfb.so => /home/ubuntu/.local/lib/python3.10/site-packages/torch/lib/./../../torch.libs/libomp-b8e5bcfb.so (0x0000ffffa8c30000)
complete list:
linux-vdso.so.1 (0x0000ffffb298d000)
libtorch_cpu.so => /home/ubuntu/.local/lib/python3.10/site-packages/torch/lib/./libtorch_cpu.so (0x0000ffffaa960000)
libgcc_s.so.1 => /lib/aarch64-linux-gnu/libgcc_s.so.1 (0x0000ffffaa930000)
libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000ffffaa780000)
/lib/ld-linux-aarch64.so.1 (0x0000ffffb2954000)
libc10.so => /home/ubuntu/.local/lib/python3.10/site-packages/torch/lib/./libc10.so (0x0000ffffaa680000)
librt.so.1 => /lib/aarch64-linux-gnu/librt.so.1 (0x0000ffffaa660000)
libdl.so.2 => /lib/aarch64-linux-gnu/libdl.so.2 (0x0000ffffaa640000)
libopenblasp-r0-f658af2e.3.25.so => /home/ubuntu/.local/lib/python3.10/site-packages/torch/lib/./../../torch.libs/libopenblasp-r0-f658af2e.3.25.so (0x0000ffffa8e10000)
libm.so.6 => /lib/aarch64-linux-gnu/libm.so.6 (0x0000ffffa8d70000)
libomp-b8e5bcfb.so => /home/ubuntu/.local/lib/python3.10/site-packages/torch/lib/./../../torch.libs/libomp-b8e5bcfb.so (0x0000ffffa8c30000)
libpthread.so.0 => /lib/aarch64-linux-gnu/libpthread.so.0 (0x0000ffffa8c10000)
libarm_compute-7362313d.so => /home/ubuntu/.local/lib/python3.10/site-packages/torch/lib/./../../torch.libs/libarm_compute-7362313d.so (0x0000ffffa8170000)
libarm_compute_graph-15f701fb.so => /home/ubuntu/.local/lib/python3.10/site-packages/torch/lib/./../../torch.libs/libarm_compute_graph-15f701fb.so (0x0000ffffa8030000)
libarm_compute_core-0793f69d.so => /home/ubuntu/.local/lib/python3.10/site-packages/torch/lib/./../../torch.libs/libarm_compute_core-0793f69d.so (0x0000ffffa7fe0000)
libstdc++.so.6 => /lib/aarch64-linux-gnu/libstdc++.so.6 (0x0000ffffa7db0000)
libgfortran-105e6576.so.5.0.0 => /home/ubuntu/.local/lib/python3.10/site-packages/torch/lib/./../../torch.libs/libgfortran-105e6576.so.5.0.0 (0x0000ffffa7c50000)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, libomp
wasn't intentionally chosen for aarch64-linux wheels, I think it was picked up from the conda environment. If we all agree that libgomp
is what is recommended for PyTorch, then I'm fine with switching to it now. In fact I've already suggested moving to gnu omp (#1774) but waiting mainly because of the 10% regressions with it for the eager mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now we have more data on libgomp
vs libomp
(please check #1774 (comment)), and i'm fine with switching to libgomp
for aarch64 linux.
Hi @tinglvv , the current CI doesn't cover aarch64-linux workflow, all the current scripts are only for CD. |
I see, that sounds great, thanks for the clarification. Will test it again in your PR.
Hi, I tested with this new docker file in the PR. I downgraded to gcc-11 in here. https://github.com/pytorch/builder/pull/1775/files#diff-83169a88982e6f1445d5ac8f76ef9e2dde63bfd0f502f3e536484328ff51d851R4. My workflow was
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm observing around 10% performance drop for eager mode inference with libgomp
compared to libomp
.
if you don't have a strong preference, I suggest keeping libomp
till we know better.
for more details, check my comment here: #1774 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please create a pull-request against pytorch to validate that it works, see pytorch/pytorch#123747 for example
Hi @tinglvv , is this targeted for PyTorch 2.3.x? |
Hi, we aim to merge this soon so I believe it is targeting 2.3.x. |
Sorry, but 2.3 feature development cut off date was mid March, so no, this should not affect/target 2.3, but 2.4 sounds like a good goal |
rebasing #124112. too many conflict files, so starting a new PR. Test pytorch/builder#1775 (merged) for ARM wheel addition Test pytorch/builder#1828 (merged) for setting MAX_JOBS Current issue to follow up: #126980 Co-authored-by: Aidyn-A <[email protected]> Pull Request resolved: #126174 Approved by: https://github.com/nWEIdia, https://github.com/atalman
rebasing pytorch#124112. too many conflict files, so starting a new PR. Test pytorch/builder#1775 (merged) for ARM wheel addition Test pytorch/builder#1828 (merged) for setting MAX_JOBS Current issue to follow up: pytorch#126980 Co-authored-by: Aidyn-A <[email protected]> Pull Request resolved: pytorch#126174 Approved by: https://github.com/nWEIdia, https://github.com/atalman
Add cuda_aarch64 ARM wheel build script with CUDA 12.4.
Reference #1302.
cc @ptrblck