-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unnecessary sign extension in Vector512.Create(sbyte)
and Vector512.Create(short)
#108820
Comments
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
This likely needs some non-trivial perf measurement to ensure that it doesn't introduce new dependency chains, particularly when the underlying byte was built up by a prior byte-wide instruction (which causes a merge operation into the larger 32/64-bit register).
Less bytes and/or less instructions does not always mean faster. We need to ensure its measured and considered in the context of the entire application, not just theoretical cycle counts. |
My measurement says otherwise, at least in Golden Cove microarchitecture. Benchmark Code
Benchmark Disassembly.NET 9.0.0 (9.0.24.47305), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI; BenchmarkPlayground.SignExtensionBenchmarks.IntegerAddLatency()
mov rax,[rcx+8]
mov rdx,[rcx+10]
mov rcx,[rcx+18]
xor r8d,r8d
nop
M00_L00:
add rax,rdx
add rax,rdx
add rax,rdx
add rax,rdx
add rax,rdx
add rax,rdx
add rax,rdx
add rax,rdx
add rax,rdx
add rax,rdx
add rax,rdx
add rax,rdx
add rax,rdx
add rax,rdx
add rax,rdx
add rax,rdx
add rdx,rcx
add r8d,10
cmp r8d,100000
jl short M00_L00
ret
; Total bytes of code 81 .NET 9.0.0 (9.0.24.47305), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI; BenchmarkPlayground.SignExtensionBenchmarks.SignExtendLatency()
mov rax,[rcx+8]
mov rdx,[rcx+10]
mov rcx,[rcx+18]
xor r8d,r8d
nop
M00_L00:
movsx r10,al
add rdx,r10
movsx r10,dl
add rdx,r10
movsx r10,dl
add rdx,r10
movsx r10,dl
add rdx,r10
movsx r10,dl
add rdx,r10
movsx r10,dl
add rdx,r10
movsx r10,dl
add rdx,r10
movsx r10,dl
add rdx,r10
movsx r10,dl
add rdx,r10
movsx r10,dl
add rdx,r10
movsx r10,dl
add rdx,r10
movsx r10,dl
add rdx,r10
movsx r10,dl
add rdx,r10
movsx r10,dl
add rdx,r10
movsx r10,dl
add rdx,r10
movsx r10,dl
add rdx,r10
add rax,rcx
add r8d,10
cmp r8d,100000
jl short M00_L00
mov rax,rdx
ret
; Total bytes of code 148 .NET 9.0.0 (9.0.24.47305), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI; BenchmarkPlayground.SignExtensionBenchmarks.ZeroExtendLatency()
mov rax,[rcx+8]
mov rdx,[rcx+10]
mov rcx,[rcx+18]
xor r8d,r8d
nop
M00_L00:
movzx r10d,al
add rdx,r10
movzx r10d,dl
add rdx,r10
movzx r10d,dl
add rdx,r10
movzx r10d,dl
add rdx,r10
movzx r10d,dl
add rdx,r10
movzx r10d,dl
add rdx,r10
movzx r10d,dl
add rdx,r10
movzx r10d,dl
add rdx,r10
movzx r10d,dl
add rdx,r10
movzx r10d,dl
add rdx,r10
movzx r10d,dl
add rdx,r10
movzx r10d,dl
add rdx,r10
movzx r10d,dl
add rdx,r10
movzx r10d,dl
add rdx,r10
movzx r10d,dl
add rdx,r10
movzx r10d,dl
add rdx,r10
add rax,rcx
add r8d,10
cmp r8d,100000
jl short M00_L00
mov rax,rdx
ret
; Total bytes of code 148 The zero extension seems to be treated as a register renaming operation.
This is why I didn't propose to eliminate |
I did say "many" not "all". These types of changes are beginning to get near microarchitecture specific levels and are rarely meaningful to real world perf for an end to end application scenario. You will inversely find, as an example, that You will also find that microbenches like the above are not representative of actual instruction latency. Rather, they may only be representative in particular scenarios. The official architecture optimization manuals, as well as various other well known in depth guide's (such as by Agner Fog), go into more detail and cover additional considerations like decoder delays, dependency chains, instruction fusion, and many other considerations where the use of In cases where removing |
Description
When I was playing around with the
Vector512<sbyte>
, I noticed thatVector512.Create(sbyte)
first sign-extends the value, then broadcasts to the whole vector register.Here, sign-extension is unnecessary, as the upper 56 bits of
rsi
would be completely ignored anyway.A similar thing happens in
Vector512.Create(short)
as well.As of uops.info, in recent Intel CPUs, from Golden Cove and Gracemont microarchitectures onward, the
movzx
instruction would be interpreted as a register rename, so it does not require a clock cycle to execute. The same is true for earlier microarchitectures if the destination register is different from the source register. This is significantly faster thanmovsx
, which takes one clock cycle in any case. Andmovsx
can be executed in fewer execution ports thanmovzx
.Possible optimization:
movzx
instead (Intel CPUs only).movsx rsi, si
→vpbroadcastw zmm0, esi
can bemovzx rax, si
→vpbroadcastw zmm0, eax
.movzx
wouldn't take any longer thanmovsx
anyway.Vector512.Create((sbyte)(rsi >> -8))
would just beshr rsi, 56
→vpbroadcastb zmm0, esi
.A known workaround for this problem is to first use unsigned types to broadcast, then re-interpret the vector register to be one with signed data.
Configuration
Compiler Explorer
Regression?
Unknown
Data
uops.info
Analysis
The text was updated successfully, but these errors were encountered: