Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[aarch64] Add CUDA 12.4 build script for ARM wheel #1775
[aarch64] Add CUDA 12.4 build script for ARM wheel #1775
Changes from 4 commits
babd9f5
91df612
e7d419b
0f2b149
7d80355
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currently the scripts are packaging libomp.so
did you check the inference performance for any models? is there any performance difference observed with libgomp vs libomp in the current wheels?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm observing around 10% performance drop for eager mode inference with
libgomp
compared tolibomp
.if you don't have a strong preference, I suggest keeping
libomp
till we know better.for more details, check my comment here: #1774 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please a follow-up on
libgomp
tolibomp
migration in your PR #1781, as it is not trivial change and certainly out of scope of this PR. I would like to underline that this PR is for enabling CUDA. Thelibgomp
has been used with PyTorch for long time: it is reliable and nothing is wrong with it functionality-wise. Moreover, I do not want Ting to waste her time on debugging dependencies due tolibomp
in this PR.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Aidyn-A , currently the aarch64 wheels are linked to
libomp
notlibgomp
.https://pypi.org/project/torch/#files
My point was why to change it now? without having a strong reason.
I have another PR to switch wheels from
libomp
tolibgomp
but is currently blocked due to the 10% regression.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I am seeing, it comes with:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope you are checking either torch 2.2 or nightly aarch64-linux wheel, because I am seeing,
libomp-b8e5bcfb.so => /home/ubuntu/.local/lib/python3.10/site-packages/torch/lib/./../../torch.libs/libomp-b8e5bcfb.so (0x0000ffffa8c30000)
complete list:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw,
libomp
wasn't intentionally chosen for aarch64-linux wheels, I think it was picked up from the conda environment. If we all agree thatlibgomp
is what is recommended for PyTorch, then I'm fine with switching to it now. In fact I've already suggested moving to gnu omp (#1774) but waiting mainly because of the 10% regressions with it for the eager mode.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now we have more data on
libgomp
vslibomp
(please check #1774 (comment)), and i'm fine with switching tolibgomp
for aarch64 linux.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does NVPrune do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nWEIdia it removes GPU architectures from libraries, which are never used to lower the binary size. This workflow can be dangerous if libraries depend on heuristics and can select kernels from the same GPU family (we've seen issues before where sm_61 was dropped causing all kinds of issues on GTX cards).
I don't think the pruning is useful anymore, as we are using the CUDA dependencies from PyPI now.
However, we might want to keep it here and follow up with a cleanup in a separate PR.