Don't clone a large tree in impRuntimeLookupToTree #81472

EgorBo · 2023-02-01T11:29:16Z

Don't clone the whole tree here (it ends up being CSE'd anyway) in importer, e.g.:

ghost · 2023-02-01T11:29:27Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak
See info in area-owners.md if you want to be subscribed.

Issue Details

Don't clone the whole tree here (it ends up being CSE'd anyway) in importer, e.g.:

Author:	EgorBo
Assignees:	EgorBo
Labels:	`area-CodeGen-coreclr`
Milestone:	-

EgorBo · 2023-02-01T13:32:35Z

no TP diffs

EgorBo · 2023-02-06T11:09:25Z

Thanks to @jakobbotsch's fix for SPMI replay, I do see diffs now. I don't know when I'll land #81635 so I suggest we merge this PR. It seems to show some nice diffs and TP improvements: https://dev.azure.com/dnceng-public/public/_build/results?buildId=160683&view=ms.vss-build-web.run-extensions-tab e.g. -22kb for aspnet collection

PTAL @dotnet/jit-contrib

jakobbotsch · 2023-02-06T11:23:58Z

Why does CSE not get these cases? Is the 'use' here in a cold block so CSE does not want to increase register pressure for the benefit of the cold block? (related: #75253)

EgorBo · 2023-02-06T11:29:57Z

Why does CSE not get these cases? Is the 'use' here in a cold block so CSE does not want to increase register pressure for the benefit of the cold block? (related: #75253)

use's block is always taken if def's block is not cold. It's:

if ((tmp = IND(X)) !=0)
    return tmp;
else
    return HELPER(); // expected to always be cold (has artificial weight of 0.2)

EgorBo · 2023-02-06T11:33:08Z

Another potential optimization here is to never expand runtime lookups in cold blocks at all - I was thinking about it but hopefully I'll do that in #81635

jakobbotsch · 2023-02-06T11:33:59Z

Then I'd be interested to see why CSE doesn't get this case for us, it would probably avoid some of the regressions I see in the diffs, e.g.:

@@ -8,10 +8,11 @@
 ; Final local variable assignments
 ;
 ;  V00 this         [V00,T00] (  3,  3   )   byref  ->  rcx         this single-def
-;* V01 TypeCtx      [V01    ] (  0,  0   )    long  ->  zero-ref    single-def
+;  V01 TypeCtx      [V01,T01] (  3,  3   )    long  ->  rdx         single-def
 ;# V02 OutArgs      [V02    ] (  1,  1   )  lclBlk ( 0) [rsp+00H]   "OutgoingArgSpace"
-;* V03 tmp1         [V03    ] (  0,  0   )   byref  ->  zero-ref    "bubbling QMark1"
-;* V04 tmp2         [V04    ] (  0,  0   )    long  ->  zero-ref    "spilling Runtime Lookup tree"
+;  V03 tmp1         [V03,T02] (  2,  4   )   byref  ->  rcx         single-def "bubbling QMark1"
+;* V04 tmp2         [V04    ] (  0,  0   )    long  ->  zero-ref    "fgMakeTemp is creating a new local variable"
+;* V05 tmp3         [V05    ] (  0,  0   )    long  ->  zero-ref    "spilling Runtime Lookup tree"
 ;
 ; Lcl frame size = 0
 
@@ -19,14 +20,15 @@ G_M20711_IG01:        ; bbWeight=1, gcrefRegs=00000000 {}, byrefRegs=00000000 {}
 						;; size=0 bbWeight=1 PerfScore 0.00
 G_M20711_IG02:        ; bbWeight=1, gcrefRegs=00000000 {}, byrefRegs=00000002 {rcx}, byref
        ; byrRegs +[rcx]
-       mov      rax, gword ptr [rcx+10H]
+       add      rcx, 8
+       mov      rax, gword ptr [rcx+08H]
        ; gcrRegs +[rax]
-						;; size=4 bbWeight=1 PerfScore 2.00
+						;; size=8 bbWeight=1 PerfScore 2.25
 G_M20711_IG03:        ; bbWeight=1, epilog, nogc, extend
        ret      
 						;; size=1 bbWeight=1 PerfScore 1.00

EgorBo · 2023-02-06T11:36:17Z

I'll check. I didn't expect any size diffs at all - I was mostly interested on impact on JIT TP since I removed a tree.

Btw, we do a similar thing when we xpand CASTCLASS/ISINST - we save a type handle to a local right in the importer.

jakobbotsch · 2023-02-06T11:37:08Z

src/coreclr/jit/importer.cpp

+    GenTree* handleForResult =
+        opts.OptimizationEnabled() ? fgInsertCommaFormTemp(&handleForNullCheck) : gtCloneExpr(handleForNullCheck);


Do we need a comma here? Can we create a new statement given that we are in the importer?

Good question, will check!

EgorBo · 2023-02-06T14:21:26Z

@jakobbotsch so a separate statement seems to produce smaller diffs and leads to overall PerfScore regression.

The improvements/regressions seem all just (un)fortunate CSE/loop hoisting decisions, e.g. here

it was decided not to hoist it from the loop. A lot of improvements are in cold blocks like you mentioned, e.g.

(presumably it decided not to do CSE for a block with weght 0.25) -- Although, that weight is a bit off, the fast path is expected to have the same weight as the nullcheck. But that's because QMARKs don't have ability to specify weights (I'll fix it in my PR)

Overall, the PR is a size/perfscore improvement + TP wins

jakobbotsch · 2023-02-06T14:36:54Z

@jakobbotsch so a separate statement seems to produce smaller diffs and leads to overall PerfScore regression.

That's a bit surprising, seems like deficiencies in downstream phases?

it was decided not to hoist it from the loop.

Decided or it hit a limitation due to the new IR?

Overall the change makes sense to me given that the value is reused in the hot code path, so it seems fine to me. Would just like to understand why CSE does not get these on its own and with all of the extra heuristics it uses to try to make a good decision.

EgorBo · 2023-02-06T14:52:50Z

@jakobbotsch so a separate statement seems to produce smaller diffs and leads to overall PerfScore regression.

That's a bit surprising, seems like deficiencies in downstream phases?

it was decided not to hoist it from the loop.

Decided or it hit a limitation due to the new IR?

Overall the change makes sense to me given that the value is reused in the hot code path, so it seems fine to me. Would just like to understand why CSE does not get these on its own and with all of the extra heuristics it uses to try to make a good decision.

Can't find that diff, analyzed around 20 of them already, I'm seeing regressions like this:

Not sure what the source, probably forward sub didn't handle it well, closer to lower we have a tree like this:

Probably worth handling in EarlyProp (or introduce a new Forward sub late phase)

AndyAyersMS · 2023-02-06T16:02:25Z

Hoisting currently can't handle assignments so splitting things across statements (see eg #35735) will block opportunties.

EgorBo · 2023-02-06T17:33:19Z

The change in this PR will be overwritten with #81635 at some point, I'm just wondering if this results in perf improvements (but mainly I did it for TP wins)

EgorBo · 2023-02-10T10:14:56Z

Closing to do it in #81472

Don't clone a large tree in impRuntimeLookupToTree

bc3b9ef

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 1, 2023

ghost assigned EgorBo Feb 1, 2023

EgorBo closed this Feb 1, 2023

EgorBo reopened this Feb 5, 2023

EgorBo added 3 commits February 6, 2023 00:38

Merge branch 'main' of github.com:dotnet/runtime into jit-tp-clone-nc

8e3a496

Update importer.cpp

1ba2515

minimize regressions

22b61ad

EgorBo marked this pull request as ready for review February 6, 2023 11:07

jakobbotsch reviewed Feb 6, 2023

View reviewed changes

EgorBo closed this Feb 10, 2023

EgorBo deleted the jit-tp-clone-nc branch February 10, 2023 10:15

ghost locked as resolved and limited conversation to collaborators Mar 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't clone a large tree in impRuntimeLookupToTree #81472

Don't clone a large tree in impRuntimeLookupToTree #81472

EgorBo commented Feb 1, 2023

ghost commented Feb 1, 2023

EgorBo commented Feb 1, 2023

EgorBo commented Feb 6, 2023 •

edited

Loading

jakobbotsch commented Feb 6, 2023

EgorBo commented Feb 6, 2023

EgorBo commented Feb 6, 2023

jakobbotsch commented Feb 6, 2023

EgorBo commented Feb 6, 2023

jakobbotsch Feb 6, 2023

EgorBo Feb 6, 2023

EgorBo commented Feb 6, 2023

jakobbotsch commented Feb 6, 2023

EgorBo commented Feb 6, 2023

AndyAyersMS commented Feb 6, 2023

EgorBo commented Feb 6, 2023

EgorBo commented Feb 10, 2023

		GenTree* handleForResult =
		opts.OptimizationEnabled() ? fgInsertCommaFormTemp(&handleForNullCheck) : gtCloneExpr(handleForNullCheck);

Don't clone a large tree in impRuntimeLookupToTree #81472

Don't clone a large tree in impRuntimeLookupToTree #81472

Conversation

EgorBo commented Feb 1, 2023

ghost commented Feb 1, 2023

EgorBo commented Feb 1, 2023

EgorBo commented Feb 6, 2023 • edited Loading

jakobbotsch commented Feb 6, 2023

EgorBo commented Feb 6, 2023

EgorBo commented Feb 6, 2023

jakobbotsch commented Feb 6, 2023

EgorBo commented Feb 6, 2023

jakobbotsch Feb 6, 2023

Choose a reason for hiding this comment

EgorBo Feb 6, 2023

Choose a reason for hiding this comment

EgorBo commented Feb 6, 2023

jakobbotsch commented Feb 6, 2023

EgorBo commented Feb 6, 2023

AndyAyersMS commented Feb 6, 2023

EgorBo commented Feb 6, 2023

EgorBo commented Feb 10, 2023

EgorBo commented Feb 6, 2023 •

edited

Loading