Unnecessary sign extension in `Vector512.Create(sbyte)` and `Vector512.Create(short)` #108820

MineCake147E · 2024-10-13T14:14:56Z

Description

When I was playing around with the Vector512<sbyte>, I noticed that Vector512.Create(sbyte) first sign-extends the value, then broadcasts to the whole vector register.

static Vector512<sbyte> Create1(ulong value) => 
    Vector512.Create((sbyte)value);

Program:Create1(ulong):System.Runtime.Intrinsics.Vector512`1[byte] (FullOpts):
       movsx    rsi, sil
       vpbroadcastb zmm0, esi
       vmovups  zmmword ptr [rdi], zmm0
       mov      rax, rdi
       vzeroupper 
       ret

Here, sign-extension is unnecessary, as the upper 56 bits of rsi would be completely ignored anyway.

A similar thing happens in Vector512.Create(short) as well.

static Vector512<short> Create2(ulong value) => 
    Vector512.Create((short)value);

Program:Create2(ulong):System.Runtime.Intrinsics.Vector512`1[short] (FullOpts):
       movsx    rsi, si
       vpbroadcastw zmm0, esi
       vmovups  zmmword ptr [rdi], zmm0
       mov      rax, rdi
       vzeroupper 
       ret

As of uops.info, in recent Intel CPUs, from Golden Cove and Gracemont microarchitectures onward, the movzx instruction would be interpreted as a register rename, so it does not require a clock cycle to execute. The same is true for earlier microarchitectures if the destination register is different from the source register. This is significantly faster than movsx, which takes one clock cycle in any case. And movsx can be executed in fewer execution ports than movzx.

Possible optimization:

One clock cycle wasted on unnecessary sign extensions can be saved by using movzx instead (Intel CPUs only).
- For example, movsx rsi, si → vpbroadcastw zmm0, esi can be movzx rax, si → vpbroadcastw zmm0, eax.
- Even for AMD CPUs, movzx wouldn't take any longer than movsx anyway.
If the value to be broadcast is first processed as data of type 32 bits or greater, eliminating the conversion instruction itself would save instruction decoding resources.
- For example, Vector512.Create((sbyte)(rsi >> -8)) would just be shr rsi, 56 → vpbroadcastb zmm0, esi.

A known workaround for this problem is to first use unsigned types to broadcast, then re-interpret the vector register to be one with signed data.

static Vector512<sbyte> Create3(ulong value) => 
    Vector512.Create((byte)value).AsSByte();

static Vector512<short> Create4(ulong value) => 
    Vector512.Create((ushort)value).AsInt16();

Program:Create3(ulong):System.Runtime.Intrinsics.Vector512`1[byte] (FullOpts):
       movzx    rsi, sil
       vpbroadcastb zmm0, esi
       vmovups  zmmword ptr [rdi], zmm0
       mov      rax, rdi
       vzeroupper 
       ret      

Program:Create4(ulong):System.Runtime.Intrinsics.Vector512`1[short] (FullOpts):
       movzx    rsi, si
       vpbroadcastw zmm0, esi
       vmovups  zmmword ptr [rdi], zmm0
       mov      rax, rdi
       vzeroupper 
       ret

Configuration

Compiler Explorer

Regression?

Unknown

Data

uops.info

Analysis

The text was updated successfully, but these errors were encountered:

dotnet-policy-service · 2024-10-13T14:15:19Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

References: * dotnet#108820

tannergooding · 2024-10-14T18:26:25Z

This likely needs some non-trivial perf measurement to ensure that it doesn't introduce new dependency chains, particularly when the underlying byte was built up by a prior byte-wide instruction (which causes a merge operation into the larger 32/64-bit register).

movzx/movsx is often desirable explicitly to break such dependencies and is unlikely to be the performance bottleneck in practice. On many pieces of hardware, movsx is likewise simply a register rename operation.

Less bytes and/or less instructions does not always mean faster. We need to ensure its measured and considered in the context of the entire application, not just theoretical cycle counts.

MineCake147E · 2024-10-15T00:40:17Z

On many pieces of hardware, movsx is likewise simply a register rename operation.

My measurement says otherwise, at least in Golden Cove microarchitecture.

Benchmark Code

using System;
using System.Collections.Generic;
using System.Linq;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;
using System.Runtime.Intrinsics;
using System.Security.Cryptography;
using System.Text;
using System.Threading.Tasks;

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Jobs;

namespace BenchmarkPlayground
{
    [SimpleJob(runtimeMoniker: RuntimeMoniker.HostProcess)]
    [DisassemblyDiagnoser(maxDepth: int.MaxValue)]
    public class SignExtensionBenchmarks
    {
        [GlobalSetup]
        public void Setup()
        {
            Span<nuint> k = [0, 0, 0];
            RandomNumberGenerator.Fill(MemoryMarshal.AsBytes(k));
            l0 = k[0];
            l1 = k[1] | 1;
            l2 = k[2];
        }

        nuint l0, l1 = 1, l2;

        [SkipLocalsInit]
        [Benchmark(Baseline = true, OperationsPerInvoke = 1 << 20)]
        public nuint IntegerAddLatency()
        {
            var v0 = l0;
            var v1 = l1;
            var v2 = l2;
            for (int i = 0; i < 1 << 20; i += 16)
            {
                v0 += v1;
                v0 += v1;
                v0 += v1;
                v0 += v1;
                v0 += v1;
                v0 += v1;
                v0 += v1;
                v0 += v1;
                v0 += v1;
                v0 += v1;
                v0 += v1;
                v0 += v1;
                v0 += v1;
                v0 += v1;
                v0 += v1;
                v0 += v1;
                v1 += v2;   // Make sure that RyuJIT doesn't optimize all additions above away.
            }
            return v0;
        }

        [SkipLocalsInit]
        [Benchmark(OperationsPerInvoke = 1 << 20)]
        public ulong SignExtendLatency()
        {
            var v0 = (nint)l0;
            var v1 = (nint)l1;
            var v2 = (nint)l2;
            sbyte h;
            for (int i = 0; i < 1 << 20; i += 16)
            {
                h = (sbyte)v0;
                v1 += h;
                h = (sbyte)v1;
                v1 += h;
                h = (sbyte)v1;
                v1 += h;
                h = (sbyte)v1;
                v1 += h;
                h = (sbyte)v1;
                v1 += h;
                h = (sbyte)v1;
                v1 += h;
                h = (sbyte)v1;
                v1 += h;
                h = (sbyte)v1;
                v1 += h;
                h = (sbyte)v1;
                v1 += h;
                h = (sbyte)v1;
                v1 += h;
                h = (sbyte)v1;
                v1 += h;
                h = (sbyte)v1;
                v1 += h;
                h = (sbyte)v1;
                v1 += h;
                h = (sbyte)v1;
                v1 += h;
                h = (sbyte)v1;
                v1 += h;
                h = (sbyte)v1;
                v1 += h;
                v0 += v2;   // Make sure that RyuJIT doesn't optimize all additions above away.
            }
            return (nuint)v1;
        }

        [SkipLocalsInit]
        [Benchmark(OperationsPerInvoke = 1 << 20)]
        public ulong ZeroExtendLatency()
        {
            var v0 = l0;
            var v1 = l1;
            var v2 = l2;
            byte h;
            for (int i = 0; i < 1 << 20; i += 16)
            {
                h = (byte)v0;
                v1 += h;
                h = (byte)v1;
                v1 += h;
                h = (byte)v1;
                v1 += h;
                h = (byte)v1;
                v1 += h;
                h = (byte)v1;
                v1 += h;
                h = (byte)v1;
                v1 += h;
                h = (byte)v1;
                v1 += h;
                h = (byte)v1;
                v1 += h;
                h = (byte)v1;
                v1 += h;
                h = (byte)v1;
                v1 += h;
                h = (byte)v1;
                v1 += h;
                h = (byte)v1;
                v1 += h;
                h = (byte)v1;
                v1 += h;
                h = (byte)v1;
                v1 += h;
                h = (byte)v1;
                v1 += h;
                h = (byte)v1;
                v1 += h;
                v0 += v2;   // Make sure that RyuJIT doesn't optimize all additions above away.
            }
            return v1;
        }
    }
}


BenchmarkDotNet v0.14.0, Windows 11 (10.0.22631.4317/23H2/2023Update/SunValley3)
Intel Xeon w5-2455X, 1 CPU, 24 logical and 12 physical cores
.NET SDK 9.0.100-rc.2.24474.11
  [Host]     : .NET 9.0.0 (9.0.24.47305), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  DefaultJob : .NET 9.0.0 (9.0.24.47305), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

Method	Mean	Error	StdDev	Ratio	Code Size
IntegerAddLatency	0.2531 ns	0.0007 ns	0.0005 ns	1.00	81 B
SignExtendLatency	0.4914 ns	0.0011 ns	0.0009 ns	1.94	148 B
ZeroExtendLatency	0.2535 ns	0.0008 ns	0.0007 ns	1.00	148 B

Benchmark Disassembly

.NET 9.0.0 (9.0.24.47305), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

; BenchmarkPlayground.SignExtensionBenchmarks.IntegerAddLatency()
       mov       rax,[rcx+8]
       mov       rdx,[rcx+10]
       mov       rcx,[rcx+18]
       xor       r8d,r8d
       nop
M00_L00:
       add       rax,rdx
       add       rax,rdx
       add       rax,rdx
       add       rax,rdx
       add       rax,rdx
       add       rax,rdx
       add       rax,rdx
       add       rax,rdx
       add       rax,rdx
       add       rax,rdx
       add       rax,rdx
       add       rax,rdx
       add       rax,rdx
       add       rax,rdx
       add       rax,rdx
       add       rax,rdx
       add       rdx,rcx
       add       r8d,10
       cmp       r8d,100000
       jl        short M00_L00
       ret
; Total bytes of code 81

.NET 9.0.0 (9.0.24.47305), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

; BenchmarkPlayground.SignExtensionBenchmarks.SignExtendLatency()
       mov       rax,[rcx+8]
       mov       rdx,[rcx+10]
       mov       rcx,[rcx+18]
       xor       r8d,r8d
       nop
M00_L00:
       movsx     r10,al
       add       rdx,r10
       movsx     r10,dl
       add       rdx,r10
       movsx     r10,dl
       add       rdx,r10
       movsx     r10,dl
       add       rdx,r10
       movsx     r10,dl
       add       rdx,r10
       movsx     r10,dl
       add       rdx,r10
       movsx     r10,dl
       add       rdx,r10
       movsx     r10,dl
       add       rdx,r10
       movsx     r10,dl
       add       rdx,r10
       movsx     r10,dl
       add       rdx,r10
       movsx     r10,dl
       add       rdx,r10
       movsx     r10,dl
       add       rdx,r10
       movsx     r10,dl
       add       rdx,r10
       movsx     r10,dl
       add       rdx,r10
       movsx     r10,dl
       add       rdx,r10
       movsx     r10,dl
       add       rdx,r10
       add       rax,rcx
       add       r8d,10
       cmp       r8d,100000
       jl        short M00_L00
       mov       rax,rdx
       ret
; Total bytes of code 148

.NET 9.0.0 (9.0.24.47305), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

; BenchmarkPlayground.SignExtensionBenchmarks.ZeroExtendLatency()
       mov       rax,[rcx+8]
       mov       rdx,[rcx+10]
       mov       rcx,[rcx+18]
       xor       r8d,r8d
       nop
M00_L00:
       movzx     r10d,al
       add       rdx,r10
       movzx     r10d,dl
       add       rdx,r10
       movzx     r10d,dl
       add       rdx,r10
       movzx     r10d,dl
       add       rdx,r10
       movzx     r10d,dl
       add       rdx,r10
       movzx     r10d,dl
       add       rdx,r10
       movzx     r10d,dl
       add       rdx,r10
       movzx     r10d,dl
       add       rdx,r10
       movzx     r10d,dl
       add       rdx,r10
       movzx     r10d,dl
       add       rdx,r10
       movzx     r10d,dl
       add       rdx,r10
       movzx     r10d,dl
       add       rdx,r10
       movzx     r10d,dl
       add       rdx,r10
       movzx     r10d,dl
       add       rdx,r10
       movzx     r10d,dl
       add       rdx,r10
       movzx     r10d,dl
       add       rdx,r10
       add       rax,rcx
       add       r8d,10
       cmp       r8d,100000
       jl        short M00_L00
       mov       rax,rdx
       ret
; Total bytes of code 148

The zero extension seems to be treated as a register renaming operation.
In contrast, the sign extension turns out to be significantly slower than the zero extension, meaning it's an actual operation beyond register renaming.

This likely needs some non-trivial perf measurement to ensure that it doesn't introduce new dependency chains, particularly when the underlying byte was built up by a prior byte-wide instruction (which causes a merge operation into the larger 32/64-bit register).

movzx/movsx is often desirable explicitly to break such dependencies

This is why I didn't propose to eliminate movzx entirely in all cases.
And replacing movsx with movzx solves the aforementioned latency issues on Intel CPUs without changing the result of Vector512.Create(sbyte) and Vector512.Create(short), and without introducing a new dependency problem.

tannergooding · 2024-10-15T16:19:33Z

My measurement says otherwise, at least in Golden Cove microarchitecture.

I did say "many" not "all".

These types of changes are beginning to get near microarchitecture specific levels and are rarely meaningful to real world perf for an end to end application scenario. You will inversely find, as an example, that movzx is not free and itself has actual cost on many CPUs, particularly several CPUs released in the last 10 years (Skylake is one notable example). While you will find that AMD Zen processors, as an example, do have zero cost movzx and movsx.

You will also find that microbenches like the above are not representative of actual instruction latency. Rather, they may only be representative in particular scenarios. The official architecture optimization manuals, as well as various other well known in depth guide's (such as by Agner Fog), go into more detail and cover additional considerations like decoder delays, dependency chains, instruction fusion, and many other considerations where the use of movzx and movsx improve performance and/or allow the CPU to produce better internal microcode.

In cases where removing movzx/movsx allows a broadcast to contain the memory operand directly, removing the sign extension should be beneficial across all hardware. However, in other cases it is very much going to be a hit or miss scenario and it is better/safer (and follows the more general official guidance across all microarchitectures) to preserve the designated and appropriate extension prior to consuming the full register width. This may be elidable in a subset of scenarios where we know the register was not previously a partial mutation, but that would likely be a minor peephole optimization and not something we'd want to broadly model in direct IR.

MineCake147E added the tenet-performance Performance related issue label Oct 13, 2024

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Oct 13, 2024

dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Oct 13, 2024

xtqqczze added a commit to xtqqczze/dotnet-runtime that referenced this issue Oct 13, 2024

Avoid unnecessary sign extension broadcasting to vector

96014a8

References: * dotnet#108820

EgorBo mentioned this issue Oct 13, 2024

Remove redundant sign/zero extension for SIMD broadcasts #108824

Merged

dotnet-policy-service bot added the in-pr There is an active PR which will close this issue when it is merged label Oct 13, 2024

xtqqczze added a commit to xtqqczze/dotnet-runtime that referenced this issue Oct 13, 2024

Avoid unnecessary sign extension broadcasting to vector

38d3d7b

References: * dotnet#108820

xtqqczze added a commit to xtqqczze/dotnet-runtime that referenced this issue Oct 13, 2024

Avoid unnecessary sign extension broadcasting to vector

866ff23

References: * dotnet#108820

xtqqczze added a commit to xtqqczze/dotnet-runtime that referenced this issue Oct 13, 2024

Avoid unnecessary sign extension broadcasting to vector

0497147

References: * dotnet#108820

xtqqczze added a commit to xtqqczze/dotnet-runtime that referenced this issue Oct 14, 2024

Avoid unnecessary sign extension broadcasting to vector

bde892c

References: * dotnet#108820

JulieLeeMSFT assigned EgorBo Oct 15, 2024

JulieLeeMSFT removed the untriaged New issue has not been triaged by the area owner label Oct 15, 2024

JulieLeeMSFT added this to the 10.0.0 milestone Oct 15, 2024

EgorBo closed this as completed in #108824 Nov 25, 2024

github-actions bot locked and limited conversation to collaborators Dec 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unnecessary sign extension in `Vector512.Create(sbyte)` and `Vector512.Create(short)` #108820

Unnecessary sign extension in `Vector512.Create(sbyte)` and `Vector512.Create(short)` #108820

MineCake147E commented Oct 13, 2024 •

edited

Loading

dotnet-policy-service bot commented Oct 13, 2024

tannergooding commented Oct 14, 2024 •

edited

Loading

MineCake147E commented Oct 15, 2024

.NET 9.0.0 (9.0.24.47305), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

.NET 9.0.0 (9.0.24.47305), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

.NET 9.0.0 (9.0.24.47305), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

tannergooding commented Oct 15, 2024

Unnecessary sign extension in Vector512.Create(sbyte) and Vector512.Create(short) #108820

Unnecessary sign extension in Vector512.Create(sbyte) and Vector512.Create(short) #108820

Comments

MineCake147E commented Oct 13, 2024 • edited Loading

Description

Configuration

Regression?

Data

Analysis

dotnet-policy-service bot commented Oct 13, 2024

tannergooding commented Oct 14, 2024 • edited Loading

MineCake147E commented Oct 15, 2024

.NET 9.0.0 (9.0.24.47305), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

.NET 9.0.0 (9.0.24.47305), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

.NET 9.0.0 (9.0.24.47305), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

tannergooding commented Oct 15, 2024

Unnecessary sign extension in `Vector512.Create(sbyte)` and `Vector512.Create(short)` #108820

Unnecessary sign extension in `Vector512.Create(sbyte)` and `Vector512.Create(short)` #108820

MineCake147E commented Oct 13, 2024 •

edited

Loading

tannergooding commented Oct 14, 2024 •

edited

Loading