-
Notifications
You must be signed in to change notification settings - Fork 12.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suboptimal lowering of vget_high_XXX
on AArch64
#58323
Comments
@llvm/issue-subscribers-backend-aarch64 |
I'd like to see whether I can implement this, but my experience with LLVM is zero (apart from compiling and some packaging). Cc. @kbeyls, @davemgreen, @Tarinn @kbeyls said on IRC that sometimes EXT might be faster than DUP, but didn't remember when or why. I came up with this test case in
|
Do you mean the DUP which is a MOV, as in the GISel version of https://godbolt.org/z/jxEnWa1jz? If so I think that sounds good. There are tablegen patterns for "extract_subvector", that probably want to use DUPi64 as opposed to the EXT instructions. It looks like they are defined in a couple of places at the moment. The ones here look incorrect and a probably unused/dead:
There are ones in the ExtPat<> multiclass too. Hopefully one of the two can be updated to use the new instruction, and the other removed? |
Yes, that's what's meant here, as far as I understand. I added armv7 with neon as a separate tab too, but I don't observe any neon instructions inserted for that target. https://godbolt.org/z/qP6bs4sq6
I'll have to study that file a bit more to understand what exactly needs changing, will continue on Monday! Thanks for the pointers! |
Yeah - Arm has a different register layout to AArch64, where the Q0 register is made up of D0 and D1 so the extract should be as cheap as just using the second register. (In that case it also looks like it is splitting the load into two register). |
Clang/LLVM lowers
a = vget_high_XXX(b)
NEON intrinsics toEXT Va.16B, Vb.16B, Vb.16B, #8
instead of theDUP Va.1D, Vb.D[1]
suggested by ARM NEON intrinsics guide. This lowering has two drawbacks:EXT Va.16B, Vb.16B, Vb.16B, #8
generates two 64-bit microoperations instead of one.EXT Va.16B, Vb.16B, Vb.16B, #8
leaves the upper part of the destination register initialized (but unused), thereby preventing its power-gating.The text was updated successfully, but these errors were encountered: