Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unnecessary sign extension in Vector512.Create(sbyte) and Vector512.Create(short) #108820

Closed
MineCake147E opened this issue Oct 13, 2024 · 4 comments · Fixed by #108824
Closed
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI in-pr There is an active PR which will close this issue when it is merged tenet-performance Performance related issue
Milestone

Comments

@MineCake147E
Copy link
Contributor

MineCake147E commented Oct 13, 2024

Description

When I was playing around with the Vector512<sbyte>, I noticed that Vector512.Create(sbyte) first sign-extends the value, then broadcasts to the whole vector register.

static Vector512<sbyte> Create1(ulong value) => 
    Vector512.Create((sbyte)value);
Program:Create1(ulong):System.Runtime.Intrinsics.Vector512`1[byte] (FullOpts):
       movsx    rsi, sil
       vpbroadcastb zmm0, esi
       vmovups  zmmword ptr [rdi], zmm0
       mov      rax, rdi
       vzeroupper 
       ret

Here, sign-extension is unnecessary, as the upper 56 bits of rsi would be completely ignored anyway.

A similar thing happens in Vector512.Create(short) as well.

static Vector512<short> Create2(ulong value) => 
    Vector512.Create((short)value);
Program:Create2(ulong):System.Runtime.Intrinsics.Vector512`1[short] (FullOpts):
       movsx    rsi, si
       vpbroadcastw zmm0, esi
       vmovups  zmmword ptr [rdi], zmm0
       mov      rax, rdi
       vzeroupper 
       ret      

As of uops.info, in recent Intel CPUs, from Golden Cove and Gracemont microarchitectures onward, the movzx instruction would be interpreted as a register rename, so it does not require a clock cycle to execute. The same is true for earlier microarchitectures if the destination register is different from the source register. This is significantly faster than movsx, which takes one clock cycle in any case. And movsx can be executed in fewer execution ports than movzx.

Possible optimization:

  • One clock cycle wasted on unnecessary sign extensions can be saved by using movzx instead (Intel CPUs only).
    • For example, movsx rsi, sivpbroadcastw zmm0, esi can be movzx rax, sivpbroadcastw zmm0, eax.
    • Even for AMD CPUs, movzx wouldn't take any longer than movsx anyway.
  • If the value to be broadcast is first processed as data of type 32 bits or greater, eliminating the conversion instruction itself would save instruction decoding resources.
    • For example, Vector512.Create((sbyte)(rsi >> -8)) would just be shr rsi, 56vpbroadcastb zmm0, esi.

A known workaround for this problem is to first use unsigned types to broadcast, then re-interpret the vector register to be one with signed data.

static Vector512<sbyte> Create3(ulong value) => 
    Vector512.Create((byte)value).AsSByte();

static Vector512<short> Create4(ulong value) => 
    Vector512.Create((ushort)value).AsInt16();
Program:Create3(ulong):System.Runtime.Intrinsics.Vector512`1[byte] (FullOpts):
       movzx    rsi, sil
       vpbroadcastb zmm0, esi
       vmovups  zmmword ptr [rdi], zmm0
       mov      rax, rdi
       vzeroupper 
       ret      

Program:Create4(ulong):System.Runtime.Intrinsics.Vector512`1[short] (FullOpts):
       movzx    rsi, si
       vpbroadcastw zmm0, esi
       vmovups  zmmword ptr [rdi], zmm0
       mov      rax, rdi
       vzeroupper 
       ret      

Configuration

Compiler Explorer

Regression?

Unknown

Data

uops.info

Analysis

@MineCake147E MineCake147E added the tenet-performance Performance related issue label Oct 13, 2024
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Oct 13, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Oct 13, 2024
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

xtqqczze added a commit to xtqqczze/dotnet-runtime that referenced this issue Oct 13, 2024
@dotnet-policy-service dotnet-policy-service bot added the in-pr There is an active PR which will close this issue when it is merged label Oct 13, 2024
xtqqczze added a commit to xtqqczze/dotnet-runtime that referenced this issue Oct 13, 2024
xtqqczze added a commit to xtqqczze/dotnet-runtime that referenced this issue Oct 13, 2024
xtqqczze added a commit to xtqqczze/dotnet-runtime that referenced this issue Oct 13, 2024
xtqqczze added a commit to xtqqczze/dotnet-runtime that referenced this issue Oct 14, 2024
@tannergooding
Copy link
Member

tannergooding commented Oct 14, 2024

This likely needs some non-trivial perf measurement to ensure that it doesn't introduce new dependency chains, particularly when the underlying byte was built up by a prior byte-wide instruction (which causes a merge operation into the larger 32/64-bit register).

movzx/movsx is often desirable explicitly to break such dependencies and is unlikely to be the performance bottleneck in practice. On many pieces of hardware, movsx is likewise simply a register rename operation.


Less bytes and/or less instructions does not always mean faster. We need to ensure its measured and considered in the context of the entire application, not just theoretical cycle counts.

@MineCake147E
Copy link
Contributor Author

On many pieces of hardware, movsx is likewise simply a register rename operation.

My measurement says otherwise, at least in Golden Cove microarchitecture.

Benchmark Code
using System;
using System.Collections.Generic;
using System.Linq;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;
using System.Runtime.Intrinsics;
using System.Security.Cryptography;
using System.Text;
using System.Threading.Tasks;

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Jobs;

namespace BenchmarkPlayground
{
    [SimpleJob(runtimeMoniker: RuntimeMoniker.HostProcess)]
    [DisassemblyDiagnoser(maxDepth: int.MaxValue)]
    public class SignExtensionBenchmarks
    {
        [GlobalSetup]
        public void Setup()
        {
            Span<nuint> k = [0, 0, 0];
            RandomNumberGenerator.Fill(MemoryMarshal.AsBytes(k));
            l0 = k[0];
            l1 = k[1] | 1;
            l2 = k[2];
        }

        nuint l0, l1 = 1, l2;

        [SkipLocalsInit]
        [Benchmark(Baseline = true, OperationsPerInvoke = 1 << 20)]
        public nuint IntegerAddLatency()
        {
            var v0 = l0;
            var v1 = l1;
            var v2 = l2;
            for (int i = 0; i < 1 << 20; i += 16)
            {
                v0 += v1;
                v0 += v1;
                v0 += v1;
                v0 += v1;
                v0 += v1;
                v0 += v1;
                v0 += v1;
                v0 += v1;
                v0 += v1;
                v0 += v1;
                v0 += v1;
                v0 += v1;
                v0 += v1;
                v0 += v1;
                v0 += v1;
                v0 += v1;
                v1 += v2;   // Make sure that RyuJIT doesn't optimize all additions above away.
            }
            return v0;
        }

        [SkipLocalsInit]
        [Benchmark(OperationsPerInvoke = 1 << 20)]
        public ulong SignExtendLatency()
        {
            var v0 = (nint)l0;
            var v1 = (nint)l1;
            var v2 = (nint)l2;
            sbyte h;
            for (int i = 0; i < 1 << 20; i += 16)
            {
                h = (sbyte)v0;
                v1 += h;
                h = (sbyte)v1;
                v1 += h;
                h = (sbyte)v1;
                v1 += h;
                h = (sbyte)v1;
                v1 += h;
                h = (sbyte)v1;
                v1 += h;
                h = (sbyte)v1;
                v1 += h;
                h = (sbyte)v1;
                v1 += h;
                h = (sbyte)v1;
                v1 += h;
                h = (sbyte)v1;
                v1 += h;
                h = (sbyte)v1;
                v1 += h;
                h = (sbyte)v1;
                v1 += h;
                h = (sbyte)v1;
                v1 += h;
                h = (sbyte)v1;
                v1 += h;
                h = (sbyte)v1;
                v1 += h;
                h = (sbyte)v1;
                v1 += h;
                h = (sbyte)v1;
                v1 += h;
                v0 += v2;   // Make sure that RyuJIT doesn't optimize all additions above away.
            }
            return (nuint)v1;
        }

        [SkipLocalsInit]
        [Benchmark(OperationsPerInvoke = 1 << 20)]
        public ulong ZeroExtendLatency()
        {
            var v0 = l0;
            var v1 = l1;
            var v2 = l2;
            byte h;
            for (int i = 0; i < 1 << 20; i += 16)
            {
                h = (byte)v0;
                v1 += h;
                h = (byte)v1;
                v1 += h;
                h = (byte)v1;
                v1 += h;
                h = (byte)v1;
                v1 += h;
                h = (byte)v1;
                v1 += h;
                h = (byte)v1;
                v1 += h;
                h = (byte)v1;
                v1 += h;
                h = (byte)v1;
                v1 += h;
                h = (byte)v1;
                v1 += h;
                h = (byte)v1;
                v1 += h;
                h = (byte)v1;
                v1 += h;
                h = (byte)v1;
                v1 += h;
                h = (byte)v1;
                v1 += h;
                h = (byte)v1;
                v1 += h;
                h = (byte)v1;
                v1 += h;
                h = (byte)v1;
                v1 += h;
                v0 += v2;   // Make sure that RyuJIT doesn't optimize all additions above away.
            }
            return v1;
        }
    }
}

BenchmarkDotNet v0.14.0, Windows 11 (10.0.22631.4317/23H2/2023Update/SunValley3)
Intel Xeon w5-2455X, 1 CPU, 24 logical and 12 physical cores
.NET SDK 9.0.100-rc.2.24474.11
  [Host]     : .NET 9.0.0 (9.0.24.47305), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  DefaultJob : .NET 9.0.0 (9.0.24.47305), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI


Method Mean Error StdDev Ratio Code Size
IntegerAddLatency 0.2531 ns 0.0007 ns 0.0005 ns 1.00 81 B
SignExtendLatency 0.4914 ns 0.0011 ns 0.0009 ns 1.94 148 B
ZeroExtendLatency 0.2535 ns 0.0008 ns 0.0007 ns 1.00 148 B
Benchmark Disassembly

.NET 9.0.0 (9.0.24.47305), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

; BenchmarkPlayground.SignExtensionBenchmarks.IntegerAddLatency()
       mov       rax,[rcx+8]
       mov       rdx,[rcx+10]
       mov       rcx,[rcx+18]
       xor       r8d,r8d
       nop
M00_L00:
       add       rax,rdx
       add       rax,rdx
       add       rax,rdx
       add       rax,rdx
       add       rax,rdx
       add       rax,rdx
       add       rax,rdx
       add       rax,rdx
       add       rax,rdx
       add       rax,rdx
       add       rax,rdx
       add       rax,rdx
       add       rax,rdx
       add       rax,rdx
       add       rax,rdx
       add       rax,rdx
       add       rdx,rcx
       add       r8d,10
       cmp       r8d,100000
       jl        short M00_L00
       ret
; Total bytes of code 81

.NET 9.0.0 (9.0.24.47305), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

; BenchmarkPlayground.SignExtensionBenchmarks.SignExtendLatency()
       mov       rax,[rcx+8]
       mov       rdx,[rcx+10]
       mov       rcx,[rcx+18]
       xor       r8d,r8d
       nop
M00_L00:
       movsx     r10,al
       add       rdx,r10
       movsx     r10,dl
       add       rdx,r10
       movsx     r10,dl
       add       rdx,r10
       movsx     r10,dl
       add       rdx,r10
       movsx     r10,dl
       add       rdx,r10
       movsx     r10,dl
       add       rdx,r10
       movsx     r10,dl
       add       rdx,r10
       movsx     r10,dl
       add       rdx,r10
       movsx     r10,dl
       add       rdx,r10
       movsx     r10,dl
       add       rdx,r10
       movsx     r10,dl
       add       rdx,r10
       movsx     r10,dl
       add       rdx,r10
       movsx     r10,dl
       add       rdx,r10
       movsx     r10,dl
       add       rdx,r10
       movsx     r10,dl
       add       rdx,r10
       movsx     r10,dl
       add       rdx,r10
       add       rax,rcx
       add       r8d,10
       cmp       r8d,100000
       jl        short M00_L00
       mov       rax,rdx
       ret
; Total bytes of code 148

.NET 9.0.0 (9.0.24.47305), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

; BenchmarkPlayground.SignExtensionBenchmarks.ZeroExtendLatency()
       mov       rax,[rcx+8]
       mov       rdx,[rcx+10]
       mov       rcx,[rcx+18]
       xor       r8d,r8d
       nop
M00_L00:
       movzx     r10d,al
       add       rdx,r10
       movzx     r10d,dl
       add       rdx,r10
       movzx     r10d,dl
       add       rdx,r10
       movzx     r10d,dl
       add       rdx,r10
       movzx     r10d,dl
       add       rdx,r10
       movzx     r10d,dl
       add       rdx,r10
       movzx     r10d,dl
       add       rdx,r10
       movzx     r10d,dl
       add       rdx,r10
       movzx     r10d,dl
       add       rdx,r10
       movzx     r10d,dl
       add       rdx,r10
       movzx     r10d,dl
       add       rdx,r10
       movzx     r10d,dl
       add       rdx,r10
       movzx     r10d,dl
       add       rdx,r10
       movzx     r10d,dl
       add       rdx,r10
       movzx     r10d,dl
       add       rdx,r10
       movzx     r10d,dl
       add       rdx,r10
       add       rax,rcx
       add       r8d,10
       cmp       r8d,100000
       jl        short M00_L00
       mov       rax,rdx
       ret
; Total bytes of code 148

The zero extension seems to be treated as a register renaming operation.
In contrast, the sign extension turns out to be significantly slower than the zero extension, meaning it's an actual operation beyond register renaming.


This likely needs some non-trivial perf measurement to ensure that it doesn't introduce new dependency chains, particularly when the underlying byte was built up by a prior byte-wide instruction (which causes a merge operation into the larger 32/64-bit register).

movzx/movsx is often desirable explicitly to break such dependencies

This is why I didn't propose to eliminate movzx entirely in all cases.
And replacing movsx with movzx solves the aforementioned latency issues on Intel CPUs without changing the result of Vector512.Create(sbyte) and Vector512.Create(short), and without introducing a new dependency problem.

@tannergooding
Copy link
Member

My measurement says otherwise, at least in Golden Cove microarchitecture.

I did say "many" not "all".

These types of changes are beginning to get near microarchitecture specific levels and are rarely meaningful to real world perf for an end to end application scenario. You will inversely find, as an example, that movzx is not free and itself has actual cost on many CPUs, particularly several CPUs released in the last 10 years (Skylake is one notable example). While you will find that AMD Zen processors, as an example, do have zero cost movzx and movsx.

You will also find that microbenches like the above are not representative of actual instruction latency. Rather, they may only be representative in particular scenarios. The official architecture optimization manuals, as well as various other well known in depth guide's (such as by Agner Fog), go into more detail and cover additional considerations like decoder delays, dependency chains, instruction fusion, and many other considerations where the use of movzx and movsx improve performance and/or allow the CPU to produce better internal microcode.


In cases where removing movzx/movsx allows a broadcast to contain the memory operand directly, removing the sign extension should be beneficial across all hardware. However, in other cases it is very much going to be a hit or miss scenario and it is better/safer (and follows the more general official guidance across all microarchitectures) to preserve the designated and appropriate extension prior to consuming the full register width. This may be elidable in a subset of scenarios where we know the register was not previously a partial mutation, but that would likely be a minor peephole optimization and not something we'd want to broadly model in direct IR.

@JulieLeeMSFT JulieLeeMSFT removed the untriaged New issue has not been triaged by the area owner label Oct 15, 2024
@JulieLeeMSFT JulieLeeMSFT added this to the 10.0.0 milestone Oct 15, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Dec 26, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI in-pr There is an active PR which will close this issue when it is merged tenet-performance Performance related issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants