Vector{128,256} operations that use MmShuffle fall back to method call #2121

gfoidl · 2022-05-18T12:13:55Z

Prerequisites

I have written a descriptive issue title
I have verified that I am running the latest version of ImageSharp
I have verified if the problem exist in both DEBUG and RELEASE mode
I have searched open and closed issues to ensure it has not already been reported

ImageSharp version

Current main branch

Other ImageSharp packages and versions

none

Environment (Operating system, version and so on)

all .NET supported

.NET Framework version

all

Description

While working on #1762 I recognized that methods that use

ImageSharp/src/ImageSharp/Common/Helpers/SimdUtils.Shuffle.cs

Lines 236 to 238 in c661ab1

    
           [MethodImpl(InliningOptions.ShortMethod)] 
        
           public static byte MmShuffle(byte p3, byte p2, byte p1, byte p0) 
        
               => (byte)((p3 << 6) | (p2 << 4) | (p1 << 2) | p0);

won't emit platform-intrinsics, rather fallback to a method call as the value isn't a constant.

E.g. Vp8Encoding.FTransformPass1SSE2 looks after inlining the vector constants like

       push      rdi
       push      rsi
       push      rbx
       sub       rsp,60
       vzeroupper
       xor       eax,eax
       mov       [rsp+50],rax
       mov       [rsp+58],rax
       mov       rsi,rdx
       mov       rdi,r8
       mov       rbx,r9
       vmovupd   xmm0,[rcx]
       lea       rcx,[rsp+40]
       vmovapd   [rsp+30],xmm0
       lea       rdx,[rsp+30]
       mov       r8d,0B1
       call      System.Runtime.Intrinsics.X86.Sse2.ShuffleHigh(System.Runtime.Intrinsics.Vector128`1<Int16>, Byte)
       vmovupd   xmm0,[rsi]
       lea       rcx,[rsp+50]
       vmovapd   [rsp+30],xmm0
       lea       rdx,[rsp+30]
       mov       r8d,0B1
       call      System.Runtime.Intrinsics.X86.Sse2.ShuffleHigh(System.Runtime.Intrinsics.Vector128`1<Int16>, Byte)
       vmovapd   xmm0,[rsp+40]
       vpunpcklqdq xmm0,xmm0,[rsp+50]
       vmovapd   xmm1,[rsp+40]
       vpunpckhqdq xmm1,xmm1,[rsp+50]
       vpaddw    xmm2,xmm0,xmm1
       vpmaddwd  xmm3,xmm2,[7FF7D3640A20]
       vpmaddwd  xmm2,xmm2,[7FF7D3640A30]
       vpsubw    xmm0,xmm0,xmm1
       vpmaddwd  xmm1,xmm0,[7FF7D3640A40]
       vpaddd    xmm1,xmm1,[7FF7D3640A50]
       vpsrad    xmm1,xmm1,9
       vpmaddwd  xmm0,xmm0,[7FF7D3640A60]
       vpaddd    xmm0,xmm0,[7FF7D3640A70]
       vpsrad    xmm0,xmm0,9
       vpackssdw xmm0,xmm1,xmm0
       vpackssdw xmm1,xmm3,xmm2
       vpunpcklwd xmm2,xmm1,xmm0
       vpunpckhwd xmm0,xmm1,xmm0
       vpunpckhdq xmm1,xmm2,xmm0
       vpunpckldq xmm0,xmm2,xmm0
       vmovupd   [rdi],xmm0
       mov       rcx,rbx
       vmovapd   [rsp+20],xmm1
       lea       rdx,[rsp+20]
       mov       r8d,4E
       call      System.Runtime.Intrinsics.X86.Sse2.Shuffle(System.Runtime.Intrinsics.Vector128`1<Int32>, Byte)
       nop
       add       rsp,60
       pop       rbx
       pop       rsi
       pop       rdi
       ret
; Total bytes of code 245

If instead SimdUtils.Shuffle.MmShuffle the constant is given as literal, then the code boils down to:

       vzeroupper
       vpshufhw  xmm0,[rdx],0B1
       vpshufhw  xmm1,[rcx],0B1
       vpunpcklqdq xmm2,xmm1,xmm0
       vpunpckhqdq xmm0,xmm1,xmm0
       vpaddw    xmm1,xmm2,xmm0
       vpmaddwd  xmm3,xmm1,[7FF7D365DAC0]
       vpmaddwd  xmm1,xmm1,[7FF7D365DAD0]
       vpsubw    xmm0,xmm2,xmm0
       vpmaddwd  xmm2,xmm0,[7FF7D365DAE0]
       vpaddd    xmm2,xmm2,[7FF7D365DAF0]
       vpsrad    xmm2,xmm2,9
       vpmaddwd  xmm0,xmm0,[7FF7D365DB00]
       vpaddd    xmm0,xmm0,[7FF7D365DB10]
       vpsrad    xmm0,xmm0,9
       vpackssdw xmm0,xmm2,xmm0
       vpackssdw xmm1,xmm3,xmm1
       vpunpcklwd xmm2,xmm1,xmm0
       vpunpckhwd xmm0,xmm1,xmm0
       vpunpckhdq xmm1,xmm2,xmm0
       vpunpckldq xmm0,xmm2,xmm0
       vmovupd   [r8],xmm0
       vpshufd   xmm0,xmm1,4E
       vmovupd   [r9],xmm0
       ret
; Total bytes of code 127

This is a de-facto a JIT-limitation, recorded in dotnet/runtime#9989 and dotnet/runtime#38003, also noticed in #1517 (comment)

As this is quite a difference in code-gen, hence in perf it should show up too, I propose to change to typing the shuffle literals explicetely. E.g.

-Vector128<short> shuf01_p = Sse2.ShuffleHigh(row01, SimdUtils.Shuffle.MmShuffle(2, 3, 0, 1));   // or any similar pattern
+Vector128<short> shuf01_p = Sse2.ShuffleHigh(row01, 0xB1);  // MmShuffle(2, 3, 0, 1)

If OK I'd like to tackle this in one shot with #1762 (as I'm touching these pieces anyway).
Edit: I was too eager, the PR is out...

Steps to Reproduce

Look at dissassembly of any method that uses SimdUtils.Shuffle.MmShuffle.

Images

No response

The text was updated successfully, but these errors were encountered:

JimBobSquarePants · 2022-05-18T13:57:07Z

@gfoidl This has been on my mind for a long while and I was actually thinking about it the other day.

Ideally I'd like to be able to keep using MMShuffle in some manner as it helps visual the operation and helps porting.
I was thinking maybe we could generate a bunch of constants for that using either T4 or source generators. Maybe even a full collection like the following.

https://github.com/john-h-k/MathSharp/blob/706c02a979c9b7e5b55bf5503c999059a0cb2330/sources/MathSharp/Utils/ShuffleValues.cs#L20

tannergooding · 2022-05-18T14:02:09Z

Another option is to use binary literals. 0b10_11_00_01 is a bit more readable and 0xB1 but less than MmShuffle(2, 3, 0, 1)

Defining constants like public const int MmShuffle_2301 = 0b10_11_00_01 would also be viable and allow the optimization.

-- Getting the JIT to handle this properly is still on my radar, but its not "simple" and probably won't make .NET 7 unfortunately.

gfoidl · 2022-05-18T14:10:51Z

The approach with the constant looks good to me. It's self describing and the optimization kicks in.

Cf. SixLabors#2121 (comment)

gfoidl · 2022-05-18T14:39:02Z

How could things like

ImageSharp/src/ImageSharp/Common/Helpers/Shuffle/IComponentShuffle.cs

Lines 113 to 119 in c661ab1

    
           internal readonly struct WZYXShuffle4 : IShuffle4 
        
           { 
        
               public byte Control 
        
               { 
        
                   [MethodImpl(InliningOptions.ShortMethod)] 
        
                   get => SimdUtils.Shuffle.MmShuffle(0, 1, 2, 3); 
        
               }

be handled in order for the JIT to don't emit the software fallback?

Typing a literal or the constant approach (as outlined above) won't work here as it's no compile time constant.
Or falls this under the category "trade-off that has to be taken"?

tannergooding · 2022-05-18T14:55:58Z

be handled in order for the JIT to don't emit the software fallback?

This pattern could be changed to effectively expose Shuffle(op1), Shuffle(op1, op2) methods instead and those could themselves do Isa.Shuffle(op1, MmShuffle_0123) which would get optimized as expected.

The current issue in the JIT is that we are making a decision during "import" on whether or not a given call should be an intrinsic or left as a call. For most APIs this is "always intrinsic", for the handful of APIs that require a constant operand (because the underlying instruction encoding requires it), we currently just say "if the operand is constant, intrinsic; otherwise leave as call".

This misses some cases, particularly where an input becomes constant after inlining. However, enabling handling for that case requires that we either generate some "small fallback" (not feasible in many cases), emit the 256-case jump table "inline" (which would overall be worse for throughput), or update the JIT to support converting the intrinsic back into a call very late (around rationalization or lowering). The last option is the "best", but its also incredibly complicated to handle and its not bubbled up to the point where I can do that work yet.

gfoidl · 2022-05-18T15:02:15Z

@tannergooding thanks for the insights -- much appreciated!

gfoidl added the needs triage label May 18, 2022

gfoidl mentioned this issue May 18, 2022

Used inline SIMD vectors if they are constants #2122

Merged

4 tasks

JimBobSquarePants added area:performance and removed needs triage labels May 18, 2022

JimBobSquarePants added this to the 3.0.0 milestone May 18, 2022

gfoidl added a commit to gfoidl/ImageSharp that referenced this issue May 18, 2022

Switched to const by MMShuffle-pattern instead of the comment approach

e3655a9

Cf. SixLabors#2121 (comment)

JimBobSquarePants mentioned this issue Feb 21, 2023

Use compile time constants over method for all MMShuffle operations. #2365

Merged

4 tasks

JimBobSquarePants closed this as completed in #2365 Feb 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vector{128,256} operations that use MmShuffle fall back to method call #2121

Vector{128,256} operations that use MmShuffle fall back to method call #2121

gfoidl commented May 18, 2022 •

edited

Loading

JimBobSquarePants commented May 18, 2022

tannergooding commented May 18, 2022

gfoidl commented May 18, 2022

gfoidl commented May 18, 2022

tannergooding commented May 18, 2022 •

edited

Loading

gfoidl commented May 18, 2022

Vector{128,256} operations that use MmShuffle fall back to method call #2121

Vector{128,256} operations that use MmShuffle fall back to method call #2121

Comments

gfoidl commented May 18, 2022 • edited Loading

Prerequisites

ImageSharp version

Other ImageSharp packages and versions

Environment (Operating system, version and so on)

.NET Framework version

Description

Steps to Reproduce

Images

JimBobSquarePants commented May 18, 2022

tannergooding commented May 18, 2022

gfoidl commented May 18, 2022

gfoidl commented May 18, 2022

tannergooding commented May 18, 2022 • edited Loading

gfoidl commented May 18, 2022

gfoidl commented May 18, 2022 •

edited

Loading

tannergooding commented May 18, 2022 •

edited

Loading