-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: Evaluate GDV call args early #75634
Conversation
The normal evaluation order for a callvirt is the following: 1. The 'this' arg is evaluated 2. The arguments are evaluated 3. 'this' is null-checked 4. The call is performed Step 1 and 2 happen as part of the IL instructions that load the arguments, while step 3 and 4 happen as part of the callvirt IL instruction. For GDV the guards needs to dereference 'this'. We were doing this too early, causing step 3 to happen before step 2. The fix is to spill all arguments for GDVs to temps. Fix dotnet#75607
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsThe normal evaluation order for a callvirt is the following:
Step 1 and 2 happen as part of the IL instructions that load the For GDV the guards need to dereference 'this'. We were doing this too Fix #75607
|
Mostly positive diffs since we avoid generating code twice to evaluate these args, and inlining normally does not do much with args that have side effects anyway. Since the critical path does not get shorter these improvements probably do not translate to anything TP wise. There's a few large regressions that I need to drill into tomorrow. |
} | ||
else | ||
{ | ||
spill = (argNode->gtFlags & GTF_SIDE_EFFECT) != 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't sufficient. We also have to spill all args with GTF_GLOB_REF
before we spill these. Unfortunately I don't think GTF_GLOB_REF
is accurate at this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
((argNode->gtFlags & GTF_GLOB_REF) != 0) || gtHasLocalsWithAddrOp(argNode)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, unfortunately I suppose that's going to be on the expensive side TP and CQ wise, but I'll try it out.
Arguments with global references or order side effects may need to happen before a later argument with side effects is evaluated. Since we do not have accurate GTF_GLOB_REF at this point use gtHasLocalsWithAddrOp which does a full tree walk checking for lvHasLdAddrOp.
If we have struct args with side effects, do we now end up making extra copies in some places? Perhaps we have this problem already with GDV since the "last use" struct copy avoidance in morph is so simplistic. On the other hand, the diffs suggest that forcing these spills actually helps in most cases. I suppose this relates to #75751? |
When I spot checked the diffs it's because the change means we do not have to generate code to evaluate arguments twice anymore. This does make the code smaller, but it does not make the critical path shorter (and in some cases may lead to extra shuffling).
Yes, there are some cases of this, and it can be excacerbated due to promotion. For example in private static void SwapIfGreater(Span<T> keys, Comparison<T> comparer, int i, int j)
{
Debug.Assert(i != j);
if (comparer(keys[i], keys[j]) > 0)
{
T key = keys[i];
keys[i] = keys[j];
keys[j] = key;
}
} Base:; Assembly listing for method System.Collections.Generic.ArraySortHelper`1[System.ValueTuple`2[ushort,ushort]]:SwapIfGreater(System.Span`1[System.ValueTuple`2[ushort,ushort]],System.Comparison`1[System.ValueTuple`2[ushort,ushort]],int,int)
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; optimized using profile data
; rsp based frame
; partially interruptible
; with Dynamic PGO: edge weights are valid, and fgCalledCount is 51
; 0 inlinees with PGO data; 2 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
; V00 arg0 [V00,T00] ( 4, 8 ) byref -> rcx ld-addr-op single-def
; V01 arg1 [V01,T03] ( 4, 3 ) ref -> rdx class-hnd single-def
; V02 arg2 [V02,T01] ( 6, 4 ) int -> r8 single-def
; V03 arg3 [V03,T02] ( 6, 4 ) int -> r9 single-def
;* V04 loc0 [V04 ] ( 0, 0 ) struct ( 8) zero-ref
; V05 OutArgs [V05 ] ( 1, 1 ) lclBlk (32) [rsp+00H] "OutgoingArgSpace"
; V06 tmp1 [V06,T12] ( 3, 2 ) int -> r8 "guarded devirt return temp"
; V07 tmp2 [V07,T13] ( 3, 2 ) long -> r11 "guarded devirt call target temp"
;* V08 tmp3 [V08 ] ( 0, 0 ) ref -> zero-ref class-hnd single-def "guarded devirt this exact temp"
;* V09 tmp4 [V09 ] ( 0, 0 ) struct ( 8) zero-ref ld-addr-op "Inlining Arg"
;* V10 tmp5 [V10 ] ( 0, 0 ) struct ( 8) zero-ref "Inlining Arg"
;* V11 tmp6 [V11 ] ( 0, 0 ) ushort -> zero-ref "Inlining Arg"
; V12 tmp7 [V12,T10] ( 5, 3 ) byref -> rax single-def V20._reference(offs=0x00) P-INDEP "field V00._reference (fldOffset=0x0)"
; V13 tmp8 [V13,T11] ( 5, 3 ) int -> r10 V20._length(offs=0x08) P-INDEP "field V00._length (fldOffset=0x8)"
; V14 tmp9 [V14,T16] ( 2, 1.02) ushort -> rdx V04.Item1(offs=0x00) P-INDEP "field V04.Item1 (fldOffset=0x0)"
; V15 tmp10 [V15,T17] ( 2, 1.02) ushort -> rcx V04.Item2(offs=0x02) P-INDEP "field V04.Item2 (fldOffset=0x2)"
; V16 tmp11 [V16,T14] ( 2, 2 ) ushort -> r8 V09.Item1(offs=0x00) P-INDEP "field V09.Item1 (fldOffset=0x0)"
;* V17 tmp12 [V17 ] ( 0, 0 ) ushort -> zero-ref V09.Item2(offs=0x02) P-INDEP "field V09.Item2 (fldOffset=0x2)"
; V18 tmp13 [V18,T15] ( 2, 2 ) ushort -> rcx V10.Item1(offs=0x00) P-INDEP "field V10.Item1 (fldOffset=0x0)"
;* V19 tmp14 [V19 ] ( 0, 0 ) ushort -> zero-ref V10.Item2(offs=0x02) P-INDEP "field V10.Item2 (fldOffset=0x2)"
;* V20 tmp15 [V20 ] ( 0, 0 ) struct (16) zero-ref "Promoted implicit byref"
; V21 tmp16 [V21,T04] ( 2, 4 ) byref -> r8 single-def "BlockOp address local"
; V22 tmp17 [V22,T05] ( 2, 4 ) byref -> rax single-def "BlockOp address local"
; V23 tmp18 [V23,T18] ( 2, 0 ) ref -> rcx single-def "argument with side effect"
; V24 tmp19 [V24,T19] ( 2, 0 ) int -> rdx "argument with side effect"
; V25 tmp20 [V25,T06] ( 3, 3.06) byref -> rcx single-def "BlockOp address local"
; V26 tmp21 [V26,T07] ( 3, 3.06) byref -> rdi single-def "BlockOp address local"
; V27 cse0 [V27,T08] ( 6, 3.02) byref -> rsi "CSE - aggressive"
; V28 cse1 [V28,T09] ( 6, 3.02) byref -> rdi "CSE - moderate"
;
; Lcl frame size = 40
G_M40445_IG01: ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, nogc <-- Prolog IG
push rdi
push rsi
sub rsp, 40
;; size=6 bbWeight=1 PerfScore 2.25
G_M40445_IG02: ; gcrefRegs=00000004 {rdx}, byrefRegs=00000002 {rcx}, byref, isz
; gcrRegs +[rdx]
; byrRegs +[rcx]
mov rax, bword ptr [rcx]
; byrRegs +[rax]
mov r10d, dword ptr [rcx+08H]
mov r11, qword ptr [rdx+18H]
mov rcx, 0xD1FFAB1E ; code for <>c:<Canonicalize>b__138_0
; byrRegs -[rcx]
cmp r11, rcx
jne SHORT G_M40445_IG06
cmp r8d, r10d
jae SHORT G_M40445_IG07
mov edx, r8d
; gcrRegs -[rdx]
lea rsi, bword ptr [rax+4*rdx]
; byrRegs +[rsi]
mov r8, rsi
; byrRegs +[r8]
movzx r8, word ptr [r8]
; byrRegs -[r8]
cmp r9d, r10d
jae SHORT G_M40445_IG07
mov r10d, r9d
lea rdi, bword ptr [rax+4*r10]
; byrRegs +[rdi]
mov rax, rdi
movzx rcx, word ptr [rax]
sub r8d, ecx
;; size=66 bbWeight=1 PerfScore 16.25
G_M40445_IG03: ; gcrefRegs=00000000 {}, byrefRegs=000000C0 {rsi rdi}, byref, isz
; byrRegs -[rax]
test r8d, r8d
jle SHORT G_M40445_IG05
;; size=5 bbWeight=1 PerfScore 1.25
G_M40445_IG04: ; gcrefRegs=00000000 {}, byrefRegs=000000C0 {rsi rdi}, byref
mov rcx, rsi
; byrRegs +[rcx]
movzx rdx, word ptr [rcx]
movzx rcx, word ptr [rcx+02H]
; byrRegs -[rcx]
mov r9d, dword ptr [rdi]
mov dword ptr [rsi], r9d
mov word ptr [rdi], dx
mov word ptr [rdi+02H], cx
;; size=23 bbWeight=0.51 PerfScore 4.72
G_M40445_IG05: ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, epilog, nogc
; byrRegs -[rsi rdi]
add rsp, 40
pop rsi
pop rdi
ret
;; size=7 bbWeight=1 PerfScore 2.25
G_M40445_IG06: ; gcVars=0000000000000000 {}, gcrefRegs=00000004 {rdx}, byrefRegs=00000001 {rax}, gcvars, byref, isz
; gcrRegs +[rdx]
; byrRegs +[rax]
mov rcx, gword ptr [rdx+08H]
; gcrRegs +[rcx]
cmp r8d, r10d
jae SHORT G_M40445_IG07
mov edx, r8d
; gcrRegs -[rdx]
lea rsi, bword ptr [rax+4*rdx]
; byrRegs +[rsi]
mov edx, dword ptr [rsi]
cmp r9d, r10d
jae SHORT G_M40445_IG07
mov r8d, r9d
lea rdi, bword ptr [rax+4*r8]
; byrRegs +[rdi]
mov r8d, dword ptr [rdi]
call r11
; gcrRegs -[rcx]
; byrRegs -[rax]
; gcr arg pop 0
mov r8d, eax
jmp SHORT G_M40445_IG03
;; size=41 bbWeight=0 PerfScore 0.00
G_M40445_IG07: ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref
; byrRegs -[rsi rdi]
call CORINFO_HELP_RNGCHKFAIL
; gcr arg pop 0
int3
;; size=6 bbWeight=0 PerfScore 0.00
; Total bytes of code 154, prolog size 6, PerfScore 42.12, instruction count 51, allocated bytes for code 154 (MethodHash=fc1a6202) for method System.Collections.Generic.ArraySortHelper`1[System.ValueTuple`2[ushort,ushort]]:SwapIfGreater(System.Span`1[System.ValueTuple`2[ushort,ushort]],System.Comparison`1[System.ValueTuple`2[ushort,ushort]],int,int)
; ============================================================ Diff:; Assembly listing for method System.Collections.Generic.ArraySortHelper`1[System.ValueTuple`2[ushort,ushort]]:SwapIfGreater(System.Span`1[System.ValueTuple`2[ushort,ushort]],System.Comparison`1[System.ValueTuple`2[ushort,ushort]],int,int)
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; optimized using profile data
; rsp based frame
; partially interruptible
; with Dynamic PGO: edge weights are valid, and fgCalledCount is 51
; 0 inlinees with PGO data; 2 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
; V00 arg0 [V00,T00] ( 4, 8 ) byref -> rcx ld-addr-op single-def
; V01 arg1 [V01,T05] ( 4, 3 ) ref -> rdx class-hnd single-def
; V02 arg2 [V02,T01] ( 4, 4 ) int -> r8 single-def
; V03 arg3 [V03,T02] ( 4, 4 ) int -> r9 single-def
;* V04 loc0 [V04 ] ( 0, 0 ) struct ( 8) zero-ref
; V05 OutArgs [V05 ] ( 1, 1 ) lclBlk (32) [rsp+00H] "OutgoingArgSpace"
; V06 tmp1 [V06,T12] ( 3, 2 ) int -> r10 "guarded devirt return temp"
;* V07 tmp2 [V07 ] ( 0, 0 ) struct ( 8) zero-ref "guarded devirt arg temp"
;* V08 tmp3 [V08 ] ( 0, 0 ) struct ( 8) zero-ref "guarded devirt arg temp"
; V09 tmp4 [V09,T13] ( 3, 2 ) long -> r9 "guarded devirt call target temp"
;* V10 tmp5 [V10 ] ( 0, 0 ) ref -> zero-ref class-hnd single-def "guarded devirt this exact temp"
;* V11 tmp6 [V11 ] ( 0, 0 ) struct ( 8) zero-ref ld-addr-op "Inlining Arg"
;* V12 tmp7 [V12 ] ( 0, 0 ) ushort -> zero-ref "Inlining Arg"
; V13 tmp8 [V13,T10] ( 3, 3 ) byref -> rax single-def V23._reference(offs=0x00) P-INDEP "field V00._reference (fldOffset=0x0)"
; V14 tmp9 [V14,T11] ( 3, 3 ) int -> rcx V23._length(offs=0x08) P-INDEP "field V00._length (fldOffset=0x8)"
; V15 tmp10 [V15,T17] ( 2, 1.02) ushort -> rax V04.Item1(offs=0x00) P-INDEP "field V04.Item1 (fldOffset=0x0)"
; V16 tmp11 [V16,T18] ( 2, 1.02) ushort -> rcx V04.Item2(offs=0x02) P-INDEP "field V04.Item2 (fldOffset=0x2)"
; V17 tmp12 [V17,T14] ( 3, 2 ) ushort -> r10 V07.Item1(offs=0x00) P-INDEP "field V07.Item1 (fldOffset=0x0)"
; V18 tmp13 [V18,T19] ( 2, 1 ) ushort -> r8 V07.Item2(offs=0x02) P-INDEP "field V07.Item2 (fldOffset=0x2)"
; V19 tmp14 [V19,T15] ( 3, 2 ) ushort -> rax V08.Item1(offs=0x00) P-INDEP "field V08.Item1 (fldOffset=0x0)"
; V20 tmp15 [V20,T20] ( 2, 1 ) ushort -> rcx V08.Item2(offs=0x02) P-INDEP "field V08.Item2 (fldOffset=0x2)"
; V21 tmp16 [V21,T16] ( 2, 2 ) ushort -> r10 V11.Item1(offs=0x00) P-INDEP "field V11.Item1 (fldOffset=0x0)"
;* V22 tmp17 [V22 ] ( 0, 0 ) ushort -> zero-ref V11.Item2(offs=0x02) P-INDEP "field V11.Item2 (fldOffset=0x2)"
;* V23 tmp18 [V23 ] ( 0, 0 ) struct (16) zero-ref "Promoted implicit byref"
; V24 tmp19 [V24,T03] ( 3, 6 ) byref -> r8 single-def "BlockOp address local"
; V25 tmp20 [V25,T04] ( 3, 6 ) byref -> rcx single-def "BlockOp address local"
; V26 tmp21 [V26,T21] ( 3, 0 ) struct ( 8) [rsp+30H] do-not-enreg[SF] "by-value struct argument"
; V27 tmp22 [V27,T22] ( 3, 0 ) struct ( 8) [rsp+28H] do-not-enreg[SF] "by-value struct argument"
; V28 tmp23 [V28,T23] ( 2, 0 ) ref -> rdx single-def "argument with side effect"
; V29 tmp24 [V29,T06] ( 3, 3.06) byref -> r8 single-def "BlockOp address local"
; V30 tmp25 [V30,T07] ( 3, 3.06) byref -> rdi single-def "BlockOp address local"
; V31 cse0 [V31,T08] ( 4, 3.02) byref -> rsi "CSE - aggressive"
; V32 cse1 [V32,T09] ( 4, 3.02) byref -> rdi "CSE - moderate"
;
; Lcl frame size = 56
G_M40445_IG01: ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, nogc <-- Prolog IG
push rdi
push rsi
sub rsp, 56
;; size=6 bbWeight=1 PerfScore 2.25
G_M40445_IG02: ; gcrefRegs=00000004 {rdx}, byrefRegs=00000002 {rcx}, byref, isz
; gcrRegs +[rdx]
; byrRegs +[rcx]
mov rax, bword ptr [rcx]
; byrRegs +[rax]
mov ecx, dword ptr [rcx+08H]
; byrRegs -[rcx]
cmp r8d, ecx
jae G_M40445_IG07
mov r8d, r8d
lea rsi, bword ptr [rax+4*r8]
; byrRegs +[rsi]
mov r8, rsi
; byrRegs +[r8]
movzx r10, word ptr [r8]
movzx r8, word ptr [r8+02H]
; byrRegs -[r8]
cmp r9d, ecx
jae G_M40445_IG07
mov ecx, r9d
lea rdi, bword ptr [rax+4*rcx]
; byrRegs +[rdi]
mov rcx, rdi
; byrRegs +[rcx]
movzx rax, word ptr [rcx]
; byrRegs -[rax]
movzx rcx, word ptr [rcx+02H]
; byrRegs -[rcx]
mov r9, qword ptr [rdx+18H]
mov r11, 0xD1FFAB1E ; code for <>c:<Canonicalize>b__138_0
cmp r9, r11
jne SHORT G_M40445_IG06
movzx r10, r10w
movzx rdx, ax
; gcrRegs -[rdx]
sub r10d, edx
;; size=89 bbWeight=1 PerfScore 20.75
G_M40445_IG03: ; gcrefRegs=00000000 {}, byrefRegs=000000C0 {rsi rdi}, byref, isz
test r10d, r10d
jle SHORT G_M40445_IG05
;; size=5 bbWeight=1 PerfScore 1.25
G_M40445_IG04: ; gcrefRegs=00000000 {}, byrefRegs=000000C0 {rsi rdi}, byref
mov r8, rsi
; byrRegs +[r8]
movzx rax, word ptr [r8]
movzx rcx, word ptr [r8+02H]
mov edx, dword ptr [rdi]
mov dword ptr [rsi], edx
mov word ptr [rdi], ax
mov word ptr [rdi+02H], cx
;; size=23 bbWeight=0.51 PerfScore 4.72
G_M40445_IG05: ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, epilog, nogc
; byrRegs -[rsi rdi r8]
add rsp, 56
pop rsi
pop rdi
ret
;; size=7 bbWeight=1 PerfScore 2.25
G_M40445_IG06: ; gcVars=0000000000000000 {}, gcrefRegs=00000004 {rdx}, byrefRegs=000000C0 {rsi rdi}, gcvars, byref, isz
; gcrRegs +[rdx]
; byrRegs +[rsi rdi]
mov rdx, gword ptr [rdx+08H]
mov word ptr [rsp+30H], r10w
mov word ptr [rsp+32H], r8w
mov word ptr [rsp+28H], ax
mov word ptr [rsp+2AH], cx
mov rcx, rdx
; gcrRegs +[rcx]
mov edx, dword ptr [rsp+30H]
; gcrRegs -[rdx]
mov r8d, dword ptr [rsp+28H]
call r9
; gcrRegs -[rcx]
; gcr arg pop 0
mov r10d, eax
jmp SHORT G_M40445_IG03
;; size=46 bbWeight=0 PerfScore 0.00
G_M40445_IG07: ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref
; byrRegs -[rsi rdi]
call CORINFO_HELP_RNGCHKFAIL
; gcr arg pop 0
int3
;; size=6 bbWeight=0 PerfScore 0.00
; Total bytes of code 182, prolog size 6, PerfScore 49.42, instruction count 52, allocated bytes for code 182 (MethodHash=fc1a6202) for method System.Collections.Generic.ArraySortHelper`1[System.ValueTuple`2[ushort,ushort]]:SwapIfGreater(System.Span`1[System.ValueTuple`2[ushort,ushort]],System.Comparison`1[System.ValueTuple`2[ushort,ushort]],int,int)
; ============================================================ Notice how we have to reassemble the struct on stack in the fallback case too. We could potentially check if all argument side effects were NullReferenceException and skip the new logic in these cases. It is probably a quite common case (though would not help this particular case, we were actually reordering a NullReferenceException with an IndexOutOfRangeException before). |
@jakobbotsch this is ready to merge, right? |
I want to check if #76671 and #76017 fix some of the regressions, so I'm waiting to get those in first. |
/azp run runtime-coreclr pgo, runtime-coreclr libraries-pgo |
Azure Pipelines successfully started running 2 pipeline(s). |
The normal evaluation order for a callvirt is the following:
Step 1 and 2 happen as part of the IL instructions that load the
arguments, while step 3 and 4 happen as part of the callvirt IL
instruction.
For GDV the guards need to dereference 'this'. We were doing this too
early, causing step 3 to happen before step 2. The fix is to spill
arguments for GDVs more aggressively when side effects are encountered.
Fix #75607