-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The new SIMD support could use some improvements #13811
Comments
The first issue is tracked by https://github.com/dotnet/coreclr/issues/17225. We don't currently supporting emitting constants for the For the second issue, there may be something that could be improved in the register allocator. We have a few issues related to extraneous movaps that are inserted for varying code patterns. |
CC. @CarolEidt, @echesakovMSFT |
@tannergooding “you need to manually create static readonly variables for the time being” it became more than 2 times slower with static readonly variables, from 2.977ms to 6.491ms (for 1M pixels). Because now the code has a lot of memory references in the body of the loop. Take a look: https://gist.github.com/Const-me/5434ced8c37f91af56f358871d92bef5 |
Ah, I see. This is a case where loop-hosting would be beneficial. I don't believe we have a bug tracking that for the Hardware Intrinsics today. |
Right. For reference, here’s C++ code which does the same thing, processing same about of data: I guess I’ll stay with C++ for a while, for code like this. |
👍 for better loop hoisting. private static readonly Vector128<byte> myConst = Vector128.Create((byte)42);
static int Foo1(Vector128<byte> a, Vector128<byte> ugly_const)
{
return Sse2.MoveMask(Sse2.Add(a, ugly_const));
}
static int Foo2(Vector128<byte> v, int n)
{
int result = 0;
var local = myConst;
for (int i = 0; i < n; i++)
{
result += Foo1(v, local);
}
return result;
} But ideally could be just (with better loop hoisting): private static readonly Vector128<byte> myConst = Vector128.Create((byte)42);
static int Foo1(Vector128<byte> a)
{
return Sse2.MoveMask(Sse2.Add(a, myConst));
}
static int Foo2(Vector128<byte> v, int n)
{
int result = 0;
for (int i = 0; i < n; i++)
{
result += Foo1(v);
}
return result;
} |
OK so my initial 8x difference seems to be unrelated to any of that. It’s just memory access costs. Here’s result for larger arrays:
.NET source: I know 2 possible explanations.
|
I had an error in my benchmark. It included time it took JIT to make the x86 code. Here’s corrected numbers.
As you see, C# performance is pretty comparable to C++, which is awesome. But this is only true when manually loading all the invariants outside of the loop, and passing them as methods arguments. Here’s the complete code: https://gist.github.com/Const-me/3cb028a44abe2f207d147c167563b297 For SIMD algorithms more complex than this one, the requirement complicates code a lot, unfortunately. Please, consider doing something in the runtime (or maybe in the compiler) to fix the performance of the original straightforward version. |
We now do a bit of hoisting but the invariant bits are fragmented across statements and it appears the hoisting algorithm is not set up for this. For example we hoist the
So one follow-up here is to examine why the jit is unable to hoist a statement whose only loop dependence is on an earlier statement that is loop invariant. We could also look at changing things upstream so the trees are not fragmented. Fragmentation comes from C# and the IL, where we see st/ld pairs in some of the paths, eg runtime/src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector128.cs Lines 262 to 268 in 543037d
produces a st/load into
A bit surprised roslyn doesn't optimize away this local, perhaps stores to locals are considered observable and so don't get elided. We could also try fixing in source via nested calls, or see if the jit can peephole stloc/ldloc if we know the ldloc is the only use. Since we do an IL prescan we could potentially know this. So also worth following up on a potential importer opt to possibly reduce temp usage. I am going to keep this in 5.0 as it impacts HW intrinsic CQ. May even have broader impact on CQ if we look at importer temp elimination or fixing hoisting to handle cases like this. |
@AndyAyersMS I'm going to move this out of 5.0 since it doesn't look like we're going to be able to get to it at this point. |
Moving to future as it looks like hoisting of |
Latest assembly code produced: ; Assembly listing for method ConsoleApp15._13811:convertToGrayscale(long,long,int)
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; fully interruptible
; No PGO data
; 0 inlinees with PGO data; 14 single block inlinees; 0 inlinees without PGO data
; invoked as altjit
; Final local variable assignments
;
; V00 arg0 [V00,T00] ( 10, 28 ) long -> rcx
; V01 arg1 [V01,T02] ( 5, 14 ) long -> rdx
; V02 arg2 [V02,T04] ( 3, 3 ) int -> r8 single-def
; V03 loc0 [V03,T03] ( 3, 6 ) long -> rax single-def
; V04 loc1 [V04,T10] ( 4, 16 ) simd16 -> registers ld-addr-op
; V05 loc2 [V05,T11] ( 4, 16 ) simd16 -> registers ld-addr-op
; V06 loc3 [V06,T12] ( 4, 16 ) simd16 -> registers ld-addr-op
; V07 loc4 [V07,T42] ( 2, 8 ) simd16 -> mm6
;* V08 loc5 [V08 ] ( 0, 0 ) simd16 -> zero-ref
;# V09 OutArgs [V09 ] ( 1, 1 ) lclBlk ( 0) [rsp+00H] "OutgoingArgSpace"
; V10 tmp1 [V10,T17] ( 2, 16 ) simd16 -> mm5 "impAppendStmt"
; V11 tmp2 [V11,T13] ( 4, 16 ) simd16 -> mm5 "Inline stloc first use temp"
; V12 tmp3 [V12,T14] ( 4, 16 ) simd16 -> mm6 "Inline stloc first use temp"
;* V13 tmp4 [V13 ] ( 0, 0 ) long -> zero-ref "Inlining Arg"
; V14 tmp5 [V14,T32] ( 3, 12 ) simd16 -> mm7 "Inline stloc first use temp"
; V15 tmp6 [V15,T18] ( 2, 16 ) simd16 -> mm8 "Inlining Arg"
; V16 tmp7 [V16,T19] ( 2, 16 ) simd16 -> mm7 "Inlining Arg"
; V17 tmp8 [V17,T33] ( 3, 12 ) simd16 -> mm8 "Inline stloc first use temp"
; V18 tmp9 [V18,T20] ( 2, 16 ) simd16 -> mm9 "Inlining Arg"
; V19 tmp10 [V19,T21] ( 2, 16 ) simd16 -> mm8 "Inlining Arg"
; V20 tmp11 [V20,T05] ( 4, 32 ) simd16 -> mm5 "Inlining Arg"
; V21 tmp12 [V21,T06] ( 4, 32 ) simd16 -> mm6 "Inlining Arg"
; V22 tmp13 [V22,T34] ( 3, 12 ) simd16 -> mm9 "Inline stloc first use temp"
; V23 tmp14 [V23,T22] ( 2, 16 ) simd16 -> mm7 "Inlining Arg"
; V24 tmp15 [V24,T23] ( 2, 16 ) simd16 -> mm8 "Inlining Arg"
; V25 tmp16 [V25,T24] ( 2, 16 ) simd16 -> mm5 "Inlining Arg"
; V26 tmp17 [V26,T01] ( 3, 24 ) long -> r8 "Inlining Arg"
; V27 tmp18 [V27,T15] ( 4, 16 ) simd16 -> mm6 "Inline stloc first use temp"
; V28 tmp19 [V28,T16] ( 4, 16 ) simd16 -> mm7 "Inline stloc first use temp"
;* V29 tmp20 [V29 ] ( 0, 0 ) long -> zero-ref "Inlining Arg"
; V30 tmp21 [V30,T35] ( 3, 12 ) simd16 -> mm8 "Inline stloc first use temp"
; V31 tmp22 [V31,T25] ( 2, 16 ) simd16 -> mm9 "Inlining Arg"
; V32 tmp23 [V32,T26] ( 2, 16 ) simd16 -> mm8 "Inlining Arg"
; V33 tmp24 [V33,T36] ( 3, 12 ) simd16 -> mm9 "Inline stloc first use temp"
; V34 tmp25 [V34,T27] ( 2, 16 ) simd16 -> mm10 "Inlining Arg"
; V35 tmp26 [V35,T28] ( 2, 16 ) simd16 -> mm9 "Inlining Arg"
; V36 tmp27 [V36,T07] ( 4, 32 ) simd16 -> mm6 "Inlining Arg"
; V37 tmp28 [V37,T08] ( 4, 32 ) simd16 -> mm7 "Inlining Arg"
; V38 tmp29 [V38,T37] ( 3, 12 ) simd16 -> mm10 "Inline stloc first use temp"
; V39 tmp30 [V39,T29] ( 2, 16 ) simd16 -> mm7 "Inlining Arg"
; V40 tmp31 [V40,T30] ( 2, 16 ) simd16 -> mm8 "Inlining Arg"
; V41 tmp32 [V41,T31] ( 2, 16 ) simd16 -> mm6 "Inlining Arg"
; V42 cse0 [V42,T09] ( 5, 17 ) simd16 -> mm1 "CSE - moderate"
; V43 cse1 [V43,T38] ( 3, 9 ) simd16 -> mm0 "CSE - moderate"
; V44 cse2 [V44,T39] ( 3, 9 ) simd16 -> mm2 "CSE - moderate"
; V45 cse3 [V45,T40] ( 3, 9 ) simd16 -> mm3 "CSE - moderate"
; V46 cse4 [V46,T41] ( 3, 9 ) simd16 -> mm4 "CSE - moderate"
;
; Lcl frame size = 88
G_M6610_IG01:
sub rsp, 88
vzeroupper
vmovaps qword ptr [rsp+40H], xmm6
vmovaps qword ptr [rsp+30H], xmm7
vmovaps qword ptr [rsp+20H], xmm8
vmovaps qword ptr [rsp+10H], xmm9
vmovaps qword ptr [rsp], xmm10
;; size=41 bbWeight=4 PerfScore 45.00
G_M6610_IG02:
shl r8d, 2
movsxd rax, r8d
add rax, rcx
cmp rcx, rax
jae G_M6610_IG04
vmovupd xmm0, xmmword ptr [reloc @RWD00]
vmovupd xmm1, xmmword ptr [reloc @RWD16]
vmovupd xmm2, xmmword ptr [reloc @RWD32]
vmovupd xmm3, xmmword ptr [reloc @RWD48]
vmovupd xmm4, xmmword ptr [reloc @RWD64]
align [0 bytes for IG03]
;; size=64 bbWeight=1 PerfScore 17.25
G_M6610_IG03:
vmovdqu xmm5, xmmword ptr [rcx]
vmovdqu xmm6, xmmword ptr [rcx+16]
vmovaps xmm7, xmm0
vpand xmm8, xmm5, xmm7
vpand xmm7, xmm6, xmm7
vpackusdw xmm7, xmm8, xmm7
vpslldq xmm7, xmm7, 1
vmovaps xmm8, xmm1
vpand xmm9, xmm5, xmm8
vpand xmm8, xmm6, xmm8
vpackusdw xmm8, xmm9, xmm8
vpsrldq xmm5, xmm5, 1
vpsrldq xmm6, xmm6, 1
vmovaps xmm9, xmm1
vpand xmm5, xmm5, xmm9
vpand xmm6, xmm6, xmm9
vpackusdw xmm5, xmm5, xmm6
vpmulhuw xmm7, xmm7, xmm2
vpmulhuw xmm8, xmm8, xmm3
vpmulhuw xmm5, xmm5, xmm4
vpaddusw xmm6, xmm7, xmm8
vpaddusw xmm5, xmm6, xmm5
vpsrlw xmm5, xmm5, 8
lea r8, [rcx+32]
vmovdqu xmm6, xmmword ptr [r8]
vmovdqu xmm7, xmmword ptr [r8+16]
vmovaps xmm8, xmm0
vpand xmm9, xmm6, xmm8
vpand xmm8, xmm7, xmm8
vpackusdw xmm8, xmm9, xmm8
vpslldq xmm8, xmm8, 1
vmovaps xmm9, xmm1
vpand xmm10, xmm6, xmm9
vpand xmm9, xmm7, xmm9
vpackusdw xmm9, xmm10, xmm9
vpsrldq xmm6, xmm6, 1
vpsrldq xmm7, xmm7, 1
vmovaps xmm10, xmm1
vpand xmm6, xmm6, xmm10
vpand xmm7, xmm7, xmm10
vpackusdw xmm6, xmm6, xmm7
vpmulhuw xmm7, xmm8, xmm2
vpmulhuw xmm8, xmm9, xmm3
vpmulhuw xmm6, xmm6, xmm4
vpaddusw xmm7, xmm7, xmm8
vpaddusw xmm6, xmm7, xmm6
vpsrlw xmm6, xmm6, 8
vpackuswb xmm5, xmm5, xmm6
vmovdqu xmmword ptr [rdx], xmm5
add rcx, 64
add rdx, 16
cmp rcx, rax
jb G_M6610_IG03
;; size=271 bbWeight=4 PerfScore 264.33
G_M6610_IG04:
vmovaps xmm6, qword ptr [rsp+40H]
vmovaps xmm7, qword ptr [rsp+30H]
vmovaps xmm8, qword ptr [rsp+20H]
vmovaps xmm9, qword ptr [rsp+10H]
vmovaps xmm10, qword ptr [rsp]
add rsp, 88
ret
;; size=39 bbWeight=1 PerfScore 21.25
RWD00 dq 000000FF000000FFh, 000000FF000000FFh
RWD16 dq 0000FF000000FF00h, 0000FF000000FF00h
RWD32 dq 4C854C854C854C85h, 4C854C854C854C85h
RWD48 dq 962C962C962C962Ch, 962C962C962C962Ch
RWD64 dq 1D4E1D4E1D4E1D4Eh, 1D4E1D4E1D4E1D4Eh
; Total bytes of code 373, prolog size 36, PerfScore 389.33, instruction count 78, allocated bytes for code 415 (MethodHash=8dbde62d) for method ConsoleApp15._13811:convertToGrayscale(long,long,int)
; ============================================================
|
Here's the source code of a function which computes brightness of RGBA images. Very typical SIMD code for image processing I traditionally do in C++. Decided to try the new C# for a change.
First of all, nice job inlining the code. The complete
convertToGrayscale
is compiled into a single function without any calls.But there’s still room for improvements. When looking at the disassembly in VS2019 debugger, I noticed 2 things.
The code only uses first 6 out of 16 available vector registers. There're enough unused registers for all the invariant constants involved in the algorithm.
If you’ll manage to apply these 2 optimizations, I would expect up to 2x speedup, because it will be twice as few instructions in the body of the loop.
category:cq
theme:loop-opt
skill-level:expert
cost:large
The text was updated successfully, but these errors were encountered: