rob pike endianness trick not getting optimized anymore #49708

SoapGentoo · 2021-05-16T13:24:21Z


Bugzilla Link	50364
Version	12.0
OS	Linux
Blocks	#48661
CC	@alexey-bataev,@efriedma-quic,@LebedevRI,@RKSimon,@zygoloid,@rotateright,@tstellar

Extended Description

I have the following repo with Rob Pike's endianness tricks:
https://github.com/SoapGentoo/portable-endianness

It seems Clang 12 broke the 64-bit store routines.

Clang 11.0.1:
store64_to_LE(unsigned long, unsigned char*): # @store64_to_LE(unsigned long, unsigned char*)
mov qword ptr [rsi], rdi
ret
store64_to_BE(unsigned long, unsigned char*): # @store64_to_BE(unsigned long, unsigned char*)
movbe qword ptr [rsi], rdi
ret

Clang 12.0.0:
.LCPI8_0:
.quad 8 # 0x8
.quad 16 # 0x10
.quad 24 # 0x18
.quad 32 # 0x20
.LCPI8_1:
.byte 0 # 0x0
.byte 8 # 0x8
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
store64_to_LE(unsigned long, unsigned char*): # @store64_to_LE(unsigned long, unsigned char*)
mov byte ptr [rsi], dil
vmovq xmm0, rdi
vpbroadcastq ymm0, xmm0
vpsrlvq ymm0, ymm0, ymmword ptr [rip + .LCPI8_0]
vextracti128 xmm1, ymm0, 1
vmovdqa xmm2, xmmword ptr [rip + .LCPI8_1] # xmm2 = <0,8,u,u,u,u,u,u,u,u,u,u,u,u,u,u>
vpshufb xmm1, xmm1, xmm2
vpshufb xmm0, xmm0, xmm2
vpunpcklwd xmm0, xmm0, xmm1 # xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3]
vmovd dword ptr [rsi + 1], xmm0
mov rax, rdi
shr rax, 40
mov byte ptr [rsi + 5], al
mov rax, rdi
shr rax, 48
mov byte ptr [rsi + 6], al
shr rdi, 56
mov byte ptr [rsi + 7], dil
vzeroupper
ret
.LCPI11_0:
.quad 56 # 0x38
.quad 48 # 0x30
.quad 40 # 0x28
.quad 32 # 0x20
.LCPI11_1:
.byte 0 # 0x0
.byte 8 # 0x8
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
store64_to_BE(unsigned long, unsigned char*): # @store64_to_BE(unsigned long, unsigned char*)
mov rax, rdi
vmovq xmm0, rdi
vpbroadcastq ymm0, xmm0
vpsrlvq ymm0, ymm0, ymmword ptr [rip + .LCPI11_0]
vextracti128 xmm1, ymm0, 1
vmovdqa xmm2, xmmword ptr [rip + .LCPI11_1] # xmm2 = <0,8,u,u,u,u,u,u,u,u,u,u,u,u,u,u>
vpshufb xmm1, xmm1, xmm2
vpshufb xmm0, xmm0, xmm2
vpunpcklwd xmm0, xmm0, xmm1 # xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3]
vmovd dword ptr [rsi], xmm0
mov rcx, rdi
shr rcx, 24
mov byte ptr [rsi + 4], cl
mov rcx, rdi
shr rcx, 16
mov byte ptr [rsi + 5], cl
mov byte ptr [rsi + 6], ah
mov byte ptr [rsi + 7], al
vzeroupper
ret

I consider detecting this sort of pattern very important in order to write performant, portable code.

LebedevRI · 2021-05-16T13:27:24Z

Please can you post standalone reproducers via godbolt links?

SoapGentoo · 2021-05-16T13:28:52Z

https://godbolt.org/z/zfrc65YfW

GCC 11.1 and Clang 11 fine, Clang 12 is off.

LebedevRI · 2021-05-16T13:35:31Z

Looks like SLPVectorizer is acting up again, for 32/64 -bit cases.
https://godbolt.org/z/csrh5x1aM

RKSimon · 2021-05-16T14:19:18Z

Godbolt: https://godbolt.org/z/4b4dcf9T9

rotateright · 2021-05-17T15:04:40Z

Not to dismiss the bug, but I'm curious if using __builtin_bswap64() is an option? Or is that not portable enough to other compilers?

Note that we may have made SLP less likely to avoid this pattern with the fix for bug 50256 ( https://reviews.llvm.org/rG49950cb1f6f6 )...although this isn't quite the same - there are no loads to combine here; this is just truncates, shifts, and stores.

LebedevRI · 2021-05-17T15:13:44Z

Not to dismiss the bug, but I'm curious if using __builtin_bswap64() is an
option? Or is that not portable enough to other compilers?

Note that we may have made SLP less likely to avoid this pattern with the
fix for bug 50256 ( https://reviews.llvm.org/rG49950cb1f6f6 )...although
this isn't quite the same - there are no loads to combine here; this is just
truncates, shifts, and stores.

I'm sympathetic with the bugreport.
We could have ended up with this pattern after loop unrolling,
where the load was templated over the type size.

I think we should be making a much better effort to match bswap
intrinsics, and certainly we shouldn't prevent them from being detected.

efriedma-quic · 2021-05-18T00:18:22Z

I guess you could look at this from two different perspectives:

SLP vectorization is messing up the pattern; we should avoid vectorization here.
SLP vectorization should be improved to vectorize this pattern well. Then we can convert the resulting sequence to use scalar registers if appropriate.

If we're looking at SLP vectorization itself, I see the following issues:

There's a general issue that SLP misses patterns involving zero (here, "val >> 0").
We should recognize that the shifts are equivalent to a shuffle of some sort.
The cost modeling is a bit dubious. See https://godbolt.org/z/3P1oWW6nc .

RKSimon · 2021-05-20T11:38:06Z

I've gone through and tweaked the AVX2 truncate and non-uniform shift costs to more closely match the 'worse case scenario' that we're supposed to keep to (we really need some way to automate this......)

This will prevent the regression on pre-AVX512 targets, but on Skylake etc. with better trunc/shift ops the SLP will still produce the partial vectorization nightmare which the load/store combines can't recover from.

So we're probably at the limit of what we can achieve with cost model accuracy alone.

If we vectorized the entire thing, we might be able to handle a store(trunc(lshr(broadcast(),c))) pattern in store combining - but that would require us finally fixing [Bug #31572].

Also, I'm not sure whether this should be a 12.x blocker - this is just a performance regression, not a correctness regression (not sure what the exact policy is for this?).

tstellar mentioned this issue Feb 22, 2021

[meta] 12.0.1 Release Blockers #48661

Closed

llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 11, 2021

Endilll added the missed-optimization label Jul 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rob pike endianness trick not getting optimized anymore #49708

rob pike endianness trick not getting optimized anymore #49708

SoapGentoo commented May 16, 2021

LebedevRI commented May 16, 2021

SoapGentoo commented May 16, 2021

LebedevRI commented May 16, 2021

RKSimon commented May 16, 2021

rotateright commented May 17, 2021

LebedevRI commented May 17, 2021

efriedma-quic commented May 18, 2021

RKSimon commented May 20, 2021

rob pike endianness trick not getting optimized anymore #49708

rob pike endianness trick not getting optimized anymore #49708

Comments

SoapGentoo commented May 16, 2021

Extended Description

LebedevRI commented May 16, 2021

SoapGentoo commented May 16, 2021

LebedevRI commented May 16, 2021

RKSimon commented May 16, 2021

rotateright commented May 17, 2021

LebedevRI commented May 17, 2021

efriedma-quic commented May 18, 2021

RKSimon commented May 20, 2021