-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't clone a large tree in impRuntimeLookupToTree #81472
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak |
no TP diffs |
Thanks to @jakobbotsch's fix for SPMI replay, I do see diffs now. I don't know when I'll land #81635 so I suggest we merge this PR. It seems to show some nice diffs and TP improvements: https://dev.azure.com/dnceng-public/public/_build/results?buildId=160683&view=ms.vss-build-web.run-extensions-tab e.g. PTAL @dotnet/jit-contrib |
Why does CSE not get these cases? Is the 'use' here in a cold block so CSE does not want to increase register pressure for the benefit of the cold block? (related: #75253) |
use's block is always taken if def's block is not cold. It's: if ((tmp = IND(X)) !=0)
return tmp;
else
return HELPER(); // expected to always be cold (has artificial weight of 0.2) |
Another potential optimization here is to never expand runtime lookups in cold blocks at all - I was thinking about it but hopefully I'll do that in #81635 |
Then I'd be interested to see why CSE doesn't get this case for us, it would probably avoid some of the regressions I see in the diffs, e.g.: @@ -8,10 +8,11 @@
; Final local variable assignments
;
; V00 this [V00,T00] ( 3, 3 ) byref -> rcx this single-def
-;* V01 TypeCtx [V01 ] ( 0, 0 ) long -> zero-ref single-def
+; V01 TypeCtx [V01,T01] ( 3, 3 ) long -> rdx single-def
;# V02 OutArgs [V02 ] ( 1, 1 ) lclBlk ( 0) [rsp+00H] "OutgoingArgSpace"
-;* V03 tmp1 [V03 ] ( 0, 0 ) byref -> zero-ref "bubbling QMark1"
-;* V04 tmp2 [V04 ] ( 0, 0 ) long -> zero-ref "spilling Runtime Lookup tree"
+; V03 tmp1 [V03,T02] ( 2, 4 ) byref -> rcx single-def "bubbling QMark1"
+;* V04 tmp2 [V04 ] ( 0, 0 ) long -> zero-ref "fgMakeTemp is creating a new local variable"
+;* V05 tmp3 [V05 ] ( 0, 0 ) long -> zero-ref "spilling Runtime Lookup tree"
;
; Lcl frame size = 0
@@ -19,14 +20,15 @@ G_M20711_IG01: ; bbWeight=1, gcrefRegs=00000000 {}, byrefRegs=00000000 {}
;; size=0 bbWeight=1 PerfScore 0.00
G_M20711_IG02: ; bbWeight=1, gcrefRegs=00000000 {}, byrefRegs=00000002 {rcx}, byref
; byrRegs +[rcx]
- mov rax, gword ptr [rcx+10H]
+ add rcx, 8
+ mov rax, gword ptr [rcx+08H]
; gcrRegs +[rax]
- ;; size=4 bbWeight=1 PerfScore 2.00
+ ;; size=8 bbWeight=1 PerfScore 2.25
G_M20711_IG03: ; bbWeight=1, epilog, nogc, extend
ret
;; size=1 bbWeight=1 PerfScore 1.00
|
I'll check. I didn't expect any size diffs at all - I was mostly interested on impact on JIT TP since I removed a tree. Btw, we do a similar thing when we xpand CASTCLASS/ISINST - we save a type handle to a local right in the importer. |
GenTree* handleForResult = | ||
opts.OptimizationEnabled() ? fgInsertCommaFormTemp(&handleForNullCheck) : gtCloneExpr(handleForNullCheck); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a comma here? Can we create a new statement given that we are in the importer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question, will check!
@jakobbotsch so a separate statement seems to produce smaller diffs and leads to overall PerfScore regression. The improvements/regressions seem all just (un)fortunate CSE/loop hoisting decisions, e.g. here it was decided not to hoist it from the loop. A lot of improvements are in cold blocks like you mentioned, e.g.
Overall, the PR is a size/perfscore improvement + TP wins |
That's a bit surprising, seems like deficiencies in downstream phases?
Decided or it hit a limitation due to the new IR? Overall the change makes sense to me given that the value is reused in the hot code path, so it seems fine to me. Would just like to understand why CSE does not get these on its own and with all of the extra heuristics it uses to try to make a good decision. |
Can't find that diff, analyzed around 20 of them already, I'm seeing regressions like this:
Probably worth handling in EarlyProp (or introduce a new Forward sub late phase) |
Hoisting currently can't handle assignments so splitting things across statements (see eg #35735) will block opportunties. |
The change in this PR will be overwritten with #81635 at some point, I'm just wondering if this results in perf improvements (but mainly I did it for TP wins) |
Closing to do it in #81472 |
Don't clone the whole tree here (it ends up being CSE'd anyway) in importer, e.g.: