Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rob pike endianness trick not getting optimized anymore #49708

Open
SoapGentoo opened this issue May 16, 2021 · 8 comments
Open

rob pike endianness trick not getting optimized anymore #49708

SoapGentoo opened this issue May 16, 2021 · 8 comments
Labels
bugzilla Issues migrated from bugzilla missed-optimization

Comments

@SoapGentoo
Copy link

Bugzilla Link 50364
Version 12.0
OS Linux
Blocks #48661
CC @alexey-bataev,@efriedma-quic,@LebedevRI,@RKSimon,@zygoloid,@rotateright,@tstellar

Extended Description

I have the following repo with Rob Pike's endianness tricks:
https://github.com/SoapGentoo/portable-endianness

It seems Clang 12 broke the 64-bit store routines.

Clang 11.0.1:
store64_to_LE(unsigned long, unsigned char*): # @​store64_to_LE(unsigned long, unsigned char*)
mov qword ptr [rsi], rdi
ret
store64_to_BE(unsigned long, unsigned char*): # @​store64_to_BE(unsigned long, unsigned char*)
movbe qword ptr [rsi], rdi
ret

Clang 12.0.0:
.LCPI8_0:
.quad 8 # 0x8
.quad 16 # 0x10
.quad 24 # 0x18
.quad 32 # 0x20
.LCPI8_1:
.byte 0 # 0x0
.byte 8 # 0x8
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
store64_to_LE(unsigned long, unsigned char*): # @​store64_to_LE(unsigned long, unsigned char*)
mov byte ptr [rsi], dil
vmovq xmm0, rdi
vpbroadcastq ymm0, xmm0
vpsrlvq ymm0, ymm0, ymmword ptr [rip + .LCPI8_0]
vextracti128 xmm1, ymm0, 1
vmovdqa xmm2, xmmword ptr [rip + .LCPI8_1] # xmm2 = <0,8,u,u,u,u,u,u,u,u,u,u,u,u,u,u>
vpshufb xmm1, xmm1, xmm2
vpshufb xmm0, xmm0, xmm2
vpunpcklwd xmm0, xmm0, xmm1 # xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3]
vmovd dword ptr [rsi + 1], xmm0
mov rax, rdi
shr rax, 40
mov byte ptr [rsi + 5], al
mov rax, rdi
shr rax, 48
mov byte ptr [rsi + 6], al
shr rdi, 56
mov byte ptr [rsi + 7], dil
vzeroupper
ret
.LCPI11_0:
.quad 56 # 0x38
.quad 48 # 0x30
.quad 40 # 0x28
.quad 32 # 0x20
.LCPI11_1:
.byte 0 # 0x0
.byte 8 # 0x8
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
store64_to_BE(unsigned long, unsigned char*): # @​store64_to_BE(unsigned long, unsigned char*)
mov rax, rdi
vmovq xmm0, rdi
vpbroadcastq ymm0, xmm0
vpsrlvq ymm0, ymm0, ymmword ptr [rip + .LCPI11_0]
vextracti128 xmm1, ymm0, 1
vmovdqa xmm2, xmmword ptr [rip + .LCPI11_1] # xmm2 = <0,8,u,u,u,u,u,u,u,u,u,u,u,u,u,u>
vpshufb xmm1, xmm1, xmm2
vpshufb xmm0, xmm0, xmm2
vpunpcklwd xmm0, xmm0, xmm1 # xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3]
vmovd dword ptr [rsi], xmm0
mov rcx, rdi
shr rcx, 24
mov byte ptr [rsi + 4], cl
mov rcx, rdi
shr rcx, 16
mov byte ptr [rsi + 5], cl
mov byte ptr [rsi + 6], ah
mov byte ptr [rsi + 7], al
vzeroupper
ret

I consider detecting this sort of pattern very important in order to write performant, portable code.

@LebedevRI
Copy link
Member

Please can you post standalone reproducers via godbolt links?

@SoapGentoo
Copy link
Author

https://godbolt.org/z/zfrc65YfW

GCC 11.1 and Clang 11 fine, Clang 12 is off.

@LebedevRI
Copy link
Member

Looks like SLPVectorizer is acting up again, for 32/64 -bit cases.
https://godbolt.org/z/csrh5x1aM

@RKSimon
Copy link
Collaborator

RKSimon commented May 16, 2021

Godbolt: https://godbolt.org/z/4b4dcf9T9

@rotateright
Copy link
Contributor

Not to dismiss the bug, but I'm curious if using __builtin_bswap64() is an option? Or is that not portable enough to other compilers?

Note that we may have made SLP less likely to avoid this pattern with the fix for bug 50256 ( https://reviews.llvm.org/rG49950cb1f6f6 )...although this isn't quite the same - there are no loads to combine here; this is just truncates, shifts, and stores.

@LebedevRI
Copy link
Member

Not to dismiss the bug, but I'm curious if using __builtin_bswap64() is an
option? Or is that not portable enough to other compilers?

Note that we may have made SLP less likely to avoid this pattern with the
fix for bug 50256 ( https://reviews.llvm.org/rG49950cb1f6f6 )...although
this isn't quite the same - there are no loads to combine here; this is just
truncates, shifts, and stores.

I'm sympathetic with the bugreport.
We could have ended up with this pattern after loop unrolling,
where the load was templated over the type size.

I think we should be making a much better effort to match bswap
intrinsics, and certainly we shouldn't prevent them from being detected.

@efriedma-quic
Copy link
Collaborator

I guess you could look at this from two different perspectives:

  1. SLP vectorization is messing up the pattern; we should avoid vectorization here.
  2. SLP vectorization should be improved to vectorize this pattern well. Then we can convert the resulting sequence to use scalar registers if appropriate.

If we're looking at SLP vectorization itself, I see the following issues:

  1. There's a general issue that SLP misses patterns involving zero (here, "val >> 0").
  2. We should recognize that the shifts are equivalent to a shuffle of some sort.
  3. The cost modeling is a bit dubious. See https://godbolt.org/z/3P1oWW6nc .

@RKSimon
Copy link
Collaborator

RKSimon commented May 20, 2021

I've gone through and tweaked the AVX2 truncate and non-uniform shift costs to more closely match the 'worse case scenario' that we're supposed to keep to (we really need some way to automate this......)

This will prevent the regression on pre-AVX512 targets, but on Skylake etc. with better trunc/shift ops the SLP will still produce the partial vectorization nightmare which the load/store combines can't recover from.

So we're probably at the limit of what we can achieve with cost model accuracy alone.

If we vectorized the entire thing, we might be able to handle a store(trunc(lshr(broadcast(),c))) pattern in store combining - but that would require us finally fixing [Bug #​31572].

Also, I'm not sure whether this should be a 12.x blocker - this is just a performance regression, not a correctness regression (not sure what the exact policy is for this?).

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugzilla Issues migrated from bugzilla missed-optimization
Projects
None yet
Development

No branches or pull requests

6 participants