Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Fix placement of GT_START_NOGC for tailcalls in face of bulk copy with write barrier calls #105551

Merged
merged 2 commits into from
Jul 26, 2024

Conversation

jakobbotsch
Copy link
Member

@jakobbotsch jakobbotsch commented Jul 26, 2024

When the JIT generates code for a tailcall it must generate code to write the arguments into the incoming parameter area. Since the GC ness of the arguments of the tailcall may not match the GC ness of the parameters, we have to disable GC before we start writing these. This is done by finding the earliest GT_PUTARG_STK node and placing the start of the NOGC region right before it.

In addition, there is logic to take care of potential overlap between the arguments and parameters. For example, if the call has an operand that uses one of the parameters, then we must take care that we do not override that parameter with the tailcall argument before the use of it. To do so, we sometimes may need to introduce copies from the parameter locals to locals on the stack frame.

This used to work fine, however, with #101761 we started transforming block copies into managed calls in certain scenarios. It was possible for the JIT to decide to introduce a copy to a local and for this transformation to then kick in. This would cause us to end up with the managed helper call after starting the nogc region. In checked builds this would hit an assert during GC scan; in release builds, it would end up with corrupted data.

The fix here is to make sure we insert the GT_START_NOGC after all the potential temporary copies we may introduce as part of the tailcall logic.

There was an additional assumption that the first PUTARG_STK operand was the earliest one in execution order. That is not guaranteed, so this change stops relying on that as well by introducing a new LIR::FirstNode and using that to determine the earliest PUTARG_STK node.

Fix #102370
Fix #104123
Fix #105441

I will backport this to preview 7. For preview 6, a workaround of setting DOTNET_TailCallOpt=0 and DOTNET_ReadyToRun=0 can be utilized.

Codegen diff in a test case:

@@ -1,99 +1,99 @@
 ; Assembly listing for method Program:Foo(System.ValueTuple`2[System.Action`1[Program+LargeStruct],Program+LargeStruct]) (Tier1)
 ; Emitting BLENDED_CODE for X64 with AVX - Unix
 ; Tier1 code
 ; optimized code
 ; rbp based frame
 ; fully interruptible
 ; Final local variable assignments
 ;
 ;  V00 arg0         [V00,T01] (  2,  2   )  struct (104) [rbp+0x10]  do-not-enreg[SF] single-def <System.ValueTuple`2[System.Action`1[Program+LargeStruct],Program+LargeStruct]>
 ;# V01 OutArgs      [V01    ] (  1,  1   )  struct ( 0) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
 ;  V02 rat0         [V02,T00] (  3,  6   )  struct (104) [rbp-0x68]  do-not-enreg[SF] must-init "Fast tail call lowering is creating a new local variable" <System.ValueTuple`2[System.Action`1[Program+LargeStruct],Program+LargeStruct]>
 ;
 ; Lcl frame size = 112
 
 G_M54833_IG01:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, nogc <-- Prolog IG
        push     rbp
        sub      rsp, 112
        lea      rbp, [rsp+0x70]
        xor      eax, eax
        mov      qword ptr [rbp-0x68], rax
        vxorps   xmm8, xmm8, xmm8
        vmovdqu  ymmword ptr [rbp-0x60], ymm8
        vmovdqu  ymmword ptr [rbp-0x40], ymm8
        vmovdqu  ymmword ptr [rbp-0x20], ymm8
                                                 ;; size=36 bbWeight=1 PerfScore 9.33
 G_M54833_IG02:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref
-       nop
-                                                ;; size=1 bbWeight=1 PerfScore 0.25
-G_M54833_IG03:        ; bbWeight=1, nogc, extend
        lea      rdi, bword ptr [rbp-0x68]
        ; byrRegs +[rdi]
        lea      rsi, [rbp+0x10]
        mov      edx, 104
        call     [CORINFO_HELP_BULK_WRITEBARRIER]
        ; byrRegs -[rdi]
        ; gcr arg pop 0
+       nop
+                                                ;; size=20 bbWeight=1 PerfScore 4.50
+G_M54833_IG03:        ; bbWeight=1, nogc, extend
        lea      rdi, [rbp+0x10]
        lea      rsi, [rbp+0x18]
        mov      rcx, gword ptr [rsi]
        ; gcrRegs +[rcx]
        mov      gword ptr [rbp+0x10], rcx
        add      rsi, 8
        add      rdi, 8
        mov      rcx, gword ptr [rsi]
        mov      gword ptr [rbp+0x18], rcx
        add      rsi, 8
        add      rdi, 8
        mov      rcx, gword ptr [rsi]
        mov      gword ptr [rbp+0x20], rcx
        add      rsi, 8
        add      rdi, 8
        mov      rcx, gword ptr [rsi]
        mov      gword ptr [rbp+0x28], rcx
        add      rsi, 8
        add      rdi, 8
        mov      rcx, gword ptr [rsi]
        mov      gword ptr [rbp+0x30], rcx
        add      rsi, 8
        add      rdi, 8
        mov      rcx, gword ptr [rsi]
        mov      gword ptr [rbp+0x38], rcx
        add      rsi, 8
        add      rdi, 8
        mov      rcx, gword ptr [rsi]
        mov      gword ptr [rbp+0x40], rcx
        add      rsi, 8
        add      rdi, 8
        mov      rcx, gword ptr [rsi]
        mov      gword ptr [rbp+0x48], rcx
        add      rsi, 8
        add      rdi, 8
        mov      rcx, gword ptr [rsi]
        mov      gword ptr [rbp+0x50], rcx
        add      rsi, 8
        add      rdi, 8
        mov      rcx, gword ptr [rsi]
        mov      gword ptr [rbp+0x58], rcx
        add      rsi, 8
        add      rdi, 8
        mov      rcx, gword ptr [rsi]
        mov      gword ptr [rbp+0x60], rcx
        add      rsi, 8
        add      rdi, 8
        mov      rcx, gword ptr [rsi]
        mov      gword ptr [rbp+0x68], rcx
        mov      rdi, gword ptr [rbp-0x68]
        ; gcrRegs +[rdi]
        mov      rdi, gword ptr [rdi+0x08]
        mov      rax, gword ptr [rbp-0x68]
        ; gcrRegs +[rax]
-                                                ;; size=211 bbWeight=1 PerfScore 50.75
+                                                ;; size=192 bbWeight=1 PerfScore 46.50
 G_M54833_IG04:        ; bbWeight=1, epilog, nogc, extend
        add      rsp, 112
        pop      rbp
        tail.jmp [rax+0x18]System.Action`1[Program+LargeStruct]:Invoke(Program+LargeStruct):this
                                                 ;; size=9 bbWeight=1 PerfScore 2.75
 
 ; Total bytes of code 257, prolog size 36, PerfScore 63.08, instruction count 68, allocated bytes for code 257 (MethodHash=bbe929ce) for method Program:Foo(System.ValueTuple`2[System.Action`1[Program+LargeStruct],Program+LargeStruct]) (Tier1)
 ; ============================================================

…opy with write barrier calls

When the JIT generates code for a tailcall it must generate code to
write the arguments into the incoming parameter area. Since the GC ness
of the arguments of the tailcall may not match the GC ness of the
parameters, we have to disable GC before we start writing these. This is
done by finding the earliest `GT_PUTARG_STK` node and placing the start
of the NOGC region right before it.

In addition, there is logic to take care of potential overlap between
the arguments and parameters. For example, if the call has an operand
that uses one of the parameters, then we must take care that we do not
override that parameter with the tailcall argument before the use of it.
To do so, we sometimes may need to introduce copies from the parameter
locals to locals on the stack frame.

This used to work fine, however, with dotnet#101761 we started transforming
block copies into managed calls in certain scenarios. It was possible
for the JIT to decide to introduce a copy to a local and for this
transformation to then kick in. This would cause us to end up with the
managed helper call after starting the nogc region. In checked builds
this would hit an assert during GC scan; in release builds, it would end
up with corrupted data.

The fix here is to make sure we insert the `GT_START_NOGC` after all the
potential temporary copies we may introduce as part of the tailcat stll
logic.

There was an additional assumption that the first `PUTARG_STK` operand
was the earliest one in execution order. That is not guaranteed, so this
change stops relying on that as well by introducing a new
`LIR::FirstNode` and using that to determine the earliest `PUTARG_STK`
node.

Fix dotnet#102370
Fix dotnet#104123
Fix dotnet#105441
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jul 26, 2024
@jakobbotsch
Copy link
Member Author

jakobbotsch commented Jul 26, 2024

cc @dotnet/jit-contrib PTAL @EgorBo

superpmi-diffs/replay failing because the windows-arm64 collection failed (the build step timed out right before finishing). I kicked off a new run.

Diffs

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Any idea why the TailCallOpt config is set up how it is? Seems like we could use an enable range there to allow bisecting in cases like we've seen recently.

@hoyosjs
Copy link
Member

hoyosjs commented Jul 26, 2024

/backport to release/9.0-preview7

Copy link
Contributor

Started backporting to release/9.0-preview7: https://github.com/dotnet/runtime/actions/runs/10115236657

@AndyAyersMS AndyAyersMS merged commit 99c9f5b into dotnet:main Jul 26, 2024
101 of 108 checks passed
@JulieLeeMSFT JulieLeeMSFT added this to the 9.0.0 milestone Jul 26, 2024
@jakobbotsch jakobbotsch deleted the fix-102370 branch July 26, 2024 19:34
@jakobbotsch
Copy link
Member Author

LGTM.

Any idea why the TailCallOpt config is set up how it is? Seems like we could use an enable range there to allow bisecting in cases like we've seen recently.

No idea.. I also don't know why we have both DOTNET_FastTailCalls and DOTNET_TailCallOpt.

For bisections related to optimizations I usually find that DOTNET_JitOnlyOptimizeRange will do the job, although sometimes it's nice to have something more fine-grained.

@github-actions github-actions bot locked and limited conversation to collaborators Aug 27, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
5 participants