Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vector{128,256} operations that use MmShuffle fall back to method call #2121

Closed
4 tasks done
gfoidl opened this issue May 18, 2022 · 6 comments · Fixed by #2365
Closed
4 tasks done

Vector{128,256} operations that use MmShuffle fall back to method call #2121

gfoidl opened this issue May 18, 2022 · 6 comments · Fixed by #2365

Comments

@gfoidl
Copy link
Contributor

gfoidl commented May 18, 2022

Prerequisites

  • I have written a descriptive issue title
  • I have verified that I am running the latest version of ImageSharp
  • I have verified if the problem exist in both DEBUG and RELEASE mode
  • I have searched open and closed issues to ensure it has not already been reported

ImageSharp version

Current main branch

Other ImageSharp packages and versions

none

Environment (Operating system, version and so on)

all .NET supported

.NET Framework version

all

Description

While working on #1762 I recognized that methods that use

[MethodImpl(InliningOptions.ShortMethod)]
public static byte MmShuffle(byte p3, byte p2, byte p1, byte p0)
=> (byte)((p3 << 6) | (p2 << 4) | (p1 << 2) | p0);
won't emit platform-intrinsics, rather fallback to a method call as the value isn't a constant.

E.g. Vp8Encoding.FTransformPass1SSE2 looks after inlining the vector constants like

       push      rdi
       push      rsi
       push      rbx
       sub       rsp,60
       vzeroupper
       xor       eax,eax
       mov       [rsp+50],rax
       mov       [rsp+58],rax
       mov       rsi,rdx
       mov       rdi,r8
       mov       rbx,r9
       vmovupd   xmm0,[rcx]
       lea       rcx,[rsp+40]
       vmovapd   [rsp+30],xmm0
       lea       rdx,[rsp+30]
       mov       r8d,0B1
       call      System.Runtime.Intrinsics.X86.Sse2.ShuffleHigh(System.Runtime.Intrinsics.Vector128`1<Int16>, Byte)
       vmovupd   xmm0,[rsi]
       lea       rcx,[rsp+50]
       vmovapd   [rsp+30],xmm0
       lea       rdx,[rsp+30]
       mov       r8d,0B1
       call      System.Runtime.Intrinsics.X86.Sse2.ShuffleHigh(System.Runtime.Intrinsics.Vector128`1<Int16>, Byte)
       vmovapd   xmm0,[rsp+40]
       vpunpcklqdq xmm0,xmm0,[rsp+50]
       vmovapd   xmm1,[rsp+40]
       vpunpckhqdq xmm1,xmm1,[rsp+50]
       vpaddw    xmm2,xmm0,xmm1
       vpmaddwd  xmm3,xmm2,[7FF7D3640A20]
       vpmaddwd  xmm2,xmm2,[7FF7D3640A30]
       vpsubw    xmm0,xmm0,xmm1
       vpmaddwd  xmm1,xmm0,[7FF7D3640A40]
       vpaddd    xmm1,xmm1,[7FF7D3640A50]
       vpsrad    xmm1,xmm1,9
       vpmaddwd  xmm0,xmm0,[7FF7D3640A60]
       vpaddd    xmm0,xmm0,[7FF7D3640A70]
       vpsrad    xmm0,xmm0,9
       vpackssdw xmm0,xmm1,xmm0
       vpackssdw xmm1,xmm3,xmm2
       vpunpcklwd xmm2,xmm1,xmm0
       vpunpckhwd xmm0,xmm1,xmm0
       vpunpckhdq xmm1,xmm2,xmm0
       vpunpckldq xmm0,xmm2,xmm0
       vmovupd   [rdi],xmm0
       mov       rcx,rbx
       vmovapd   [rsp+20],xmm1
       lea       rdx,[rsp+20]
       mov       r8d,4E
       call      System.Runtime.Intrinsics.X86.Sse2.Shuffle(System.Runtime.Intrinsics.Vector128`1<Int32>, Byte)
       nop
       add       rsp,60
       pop       rbx
       pop       rsi
       pop       rdi
       ret
; Total bytes of code 245

If instead SimdUtils.Shuffle.MmShuffle the constant is given as literal, then the code boils down to:

       vzeroupper
       vpshufhw  xmm0,[rdx],0B1
       vpshufhw  xmm1,[rcx],0B1
       vpunpcklqdq xmm2,xmm1,xmm0
       vpunpckhqdq xmm0,xmm1,xmm0
       vpaddw    xmm1,xmm2,xmm0
       vpmaddwd  xmm3,xmm1,[7FF7D365DAC0]
       vpmaddwd  xmm1,xmm1,[7FF7D365DAD0]
       vpsubw    xmm0,xmm2,xmm0
       vpmaddwd  xmm2,xmm0,[7FF7D365DAE0]
       vpaddd    xmm2,xmm2,[7FF7D365DAF0]
       vpsrad    xmm2,xmm2,9
       vpmaddwd  xmm0,xmm0,[7FF7D365DB00]
       vpaddd    xmm0,xmm0,[7FF7D365DB10]
       vpsrad    xmm0,xmm0,9
       vpackssdw xmm0,xmm2,xmm0
       vpackssdw xmm1,xmm3,xmm1
       vpunpcklwd xmm2,xmm1,xmm0
       vpunpckhwd xmm0,xmm1,xmm0
       vpunpckhdq xmm1,xmm2,xmm0
       vpunpckldq xmm0,xmm2,xmm0
       vmovupd   [r8],xmm0
       vpshufd   xmm0,xmm1,4E
       vmovupd   [r9],xmm0
       ret
; Total bytes of code 127

This is a de-facto a JIT-limitation, recorded in dotnet/runtime#9989 and dotnet/runtime#38003, also noticed in #1517 (comment)

As this is quite a difference in code-gen, hence in perf it should show up too, I propose to change to typing the shuffle literals explicetely. E.g.

-Vector128<short> shuf01_p = Sse2.ShuffleHigh(row01, SimdUtils.Shuffle.MmShuffle(2, 3, 0, 1));   // or any similar pattern
+Vector128<short> shuf01_p = Sse2.ShuffleHigh(row01, 0xB1);  // MmShuffle(2, 3, 0, 1)

If OK I'd like to tackle this in one shot with #1762 (as I'm touching these pieces anyway).
Edit: I was too eager, the PR is out...

Steps to Reproduce

Look at dissassembly of any method that uses SimdUtils.Shuffle.MmShuffle.

Images

No response

@JimBobSquarePants
Copy link
Member

@gfoidl This has been on my mind for a long while and I was actually thinking about it the other day.

Ideally I'd like to be able to keep using MMShuffle in some manner as it helps visual the operation and helps porting.
I was thinking maybe we could generate a bunch of constants for that using either T4 or source generators. Maybe even a full collection like the following.

https://github.com/john-h-k/MathSharp/blob/706c02a979c9b7e5b55bf5503c999059a0cb2330/sources/MathSharp/Utils/ShuffleValues.cs#L20

@tannergooding
Copy link
Contributor

Another option is to use binary literals. 0b10_11_00_01 is a bit more readable and 0xB1 but less than MmShuffle(2, 3, 0, 1)

Defining constants like public const int MmShuffle_2301 = 0b10_11_00_01 would also be viable and allow the optimization.

-- Getting the JIT to handle this properly is still on my radar, but its not "simple" and probably won't make .NET 7 unfortunately.

@gfoidl
Copy link
Contributor Author

gfoidl commented May 18, 2022

The approach with the constant looks good to me. It's self describing and the optimization kicks in.

@gfoidl
Copy link
Contributor Author

gfoidl commented May 18, 2022

How could things like

internal readonly struct WZYXShuffle4 : IShuffle4
{
public byte Control
{
[MethodImpl(InliningOptions.ShortMethod)]
get => SimdUtils.Shuffle.MmShuffle(0, 1, 2, 3);
}
be handled in order for the JIT to don't emit the software fallback?

Typing a literal or the constant approach (as outlined above) won't work here as it's no compile time constant.
Or falls this under the category "trade-off that has to be taken"?

@tannergooding
Copy link
Contributor

tannergooding commented May 18, 2022

be handled in order for the JIT to don't emit the software fallback?

This pattern could be changed to effectively expose Shuffle(op1), Shuffle(op1, op2) methods instead and those could themselves do Isa.Shuffle(op1, MmShuffle_0123) which would get optimized as expected.

The current issue in the JIT is that we are making a decision during "import" on whether or not a given call should be an intrinsic or left as a call. For most APIs this is "always intrinsic", for the handful of APIs that require a constant operand (because the underlying instruction encoding requires it), we currently just say "if the operand is constant, intrinsic; otherwise leave as call".

This misses some cases, particularly where an input becomes constant after inlining. However, enabling handling for that case requires that we either generate some "small fallback" (not feasible in many cases), emit the 256-case jump table "inline" (which would overall be worse for throughput), or update the JIT to support converting the intrinsic back into a call very late (around rationalization or lowering). The last option is the "best", but its also incredibly complicated to handle and its not bubbled up to the point where I can do that work yet.

@gfoidl
Copy link
Contributor Author

gfoidl commented May 18, 2022

@tannergooding thanks for the insights -- much appreciated!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants