-
Notifications
You must be signed in to change notification settings - Fork 12.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rob pike endianness trick not getting optimized anymore #49708
Comments
Please can you post standalone reproducers via godbolt links? |
https://godbolt.org/z/zfrc65YfW GCC 11.1 and Clang 11 fine, Clang 12 is off. |
Looks like SLPVectorizer is acting up again, for 32/64 -bit cases. |
Godbolt: https://godbolt.org/z/4b4dcf9T9 |
Not to dismiss the bug, but I'm curious if using __builtin_bswap64() is an option? Or is that not portable enough to other compilers? Note that we may have made SLP less likely to avoid this pattern with the fix for bug 50256 ( https://reviews.llvm.org/rG49950cb1f6f6 )...although this isn't quite the same - there are no loads to combine here; this is just truncates, shifts, and stores. |
I'm sympathetic with the bugreport. I think we should be making a much better effort to match bswap |
I guess you could look at this from two different perspectives:
If we're looking at SLP vectorization itself, I see the following issues:
|
I've gone through and tweaked the AVX2 truncate and non-uniform shift costs to more closely match the 'worse case scenario' that we're supposed to keep to (we really need some way to automate this......) This will prevent the regression on pre-AVX512 targets, but on Skylake etc. with better trunc/shift ops the SLP will still produce the partial vectorization nightmare which the load/store combines can't recover from. So we're probably at the limit of what we can achieve with cost model accuracy alone. If we vectorized the entire thing, we might be able to handle a store(trunc(lshr(broadcast(),c))) pattern in store combining - but that would require us finally fixing [Bug #31572]. Also, I'm not sure whether this should be a 12.x blocker - this is just a performance regression, not a correctness regression (not sure what the exact policy is for this?). |
Extended Description
I have the following repo with Rob Pike's endianness tricks:
https://github.com/SoapGentoo/portable-endianness
It seems Clang 12 broke the 64-bit store routines.
Clang 11.0.1:
store64_to_LE(unsigned long, unsigned char*): # @store64_to_LE(unsigned long, unsigned char*)
mov qword ptr [rsi], rdi
ret
store64_to_BE(unsigned long, unsigned char*): # @store64_to_BE(unsigned long, unsigned char*)
movbe qword ptr [rsi], rdi
ret
Clang 12.0.0:
.LCPI8_0:
.quad 8 # 0x8
.quad 16 # 0x10
.quad 24 # 0x18
.quad 32 # 0x20
.LCPI8_1:
.byte 0 # 0x0
.byte 8 # 0x8
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
store64_to_LE(unsigned long, unsigned char*): # @store64_to_LE(unsigned long, unsigned char*)
mov byte ptr [rsi], dil
vmovq xmm0, rdi
vpbroadcastq ymm0, xmm0
vpsrlvq ymm0, ymm0, ymmword ptr [rip + .LCPI8_0]
vextracti128 xmm1, ymm0, 1
vmovdqa xmm2, xmmword ptr [rip + .LCPI8_1] # xmm2 = <0,8,u,u,u,u,u,u,u,u,u,u,u,u,u,u>
vpshufb xmm1, xmm1, xmm2
vpshufb xmm0, xmm0, xmm2
vpunpcklwd xmm0, xmm0, xmm1 # xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3]
vmovd dword ptr [rsi + 1], xmm0
mov rax, rdi
shr rax, 40
mov byte ptr [rsi + 5], al
mov rax, rdi
shr rax, 48
mov byte ptr [rsi + 6], al
shr rdi, 56
mov byte ptr [rsi + 7], dil
vzeroupper
ret
.LCPI11_0:
.quad 56 # 0x38
.quad 48 # 0x30
.quad 40 # 0x28
.quad 32 # 0x20
.LCPI11_1:
.byte 0 # 0x0
.byte 8 # 0x8
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
store64_to_BE(unsigned long, unsigned char*): # @store64_to_BE(unsigned long, unsigned char*)
mov rax, rdi
vmovq xmm0, rdi
vpbroadcastq ymm0, xmm0
vpsrlvq ymm0, ymm0, ymmword ptr [rip + .LCPI11_0]
vextracti128 xmm1, ymm0, 1
vmovdqa xmm2, xmmword ptr [rip + .LCPI11_1] # xmm2 = <0,8,u,u,u,u,u,u,u,u,u,u,u,u,u,u>
vpshufb xmm1, xmm1, xmm2
vpshufb xmm0, xmm0, xmm2
vpunpcklwd xmm0, xmm0, xmm1 # xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3]
vmovd dword ptr [rsi], xmm0
mov rcx, rdi
shr rcx, 24
mov byte ptr [rsi + 4], cl
mov rcx, rdi
shr rcx, 16
mov byte ptr [rsi + 5], cl
mov byte ptr [rsi + 6], ah
mov byte ptr [rsi + 7], al
vzeroupper
ret
I consider detecting this sort of pattern very important in order to write performant, portable code.
The text was updated successfully, but these errors were encountered: