-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: Slightly refactor handling of conditions/conditional branches in the backend #82020
Conversation
…e backend Slightly refactor how compares are represented and processed in the backend in preparation of lifting the remaining cmov restrictions. * Change all compare nodes to always produce values and never set flags. That is, GT_EQ/GT_NE/GT_LT/GT_LE/GT_GE/GT_GT/GT_TEST_EQ/GT_TEST_NE now are all "normal" nodes. * Stop optimizing compares by looking for a user JTRUE node. Instead, we optimize the compare when we see it (for example, EQ/NE(AND(x, y), 0) is turned into GT_TEST_EQ/GT_TEST_NE(x, y)), but we do not change the user JTRUE node to a JCC node. We now never create SETCC nodes when we see the compare. * Handle the transformation from JTRUE to JCC when we actually get to JCC. The transformation that now happens is: 1. The compare node is turned into a GT_CMP (for GT_EQ to GT_GT) or GT_TEST node (for GT_TEST_EQ/GT_TEST_NE) node. These nodes are always void-typed and do not produce values; they only set the CPU flags, and correspond directly to low-level target instructions (like cmp/test). GT_CMP already existed, GT_TEST is added as part of this PR. 2. The JTRUE is turned into JCC with the condition from the compare. For xarch/arm64 we handle optimizing EQ/NE(node_that_sets_zf, 0) at this stage too now. * Introduce new xarch backend-only GT_BITTEST_EQ/GT_BITTEST_NE that semantically correspond to (x & (1 << y)) != 0. The corresponding "set flags" node is GT_BT and already exists. We would previously transform (x & (1 << y)) != 0 into GT_BT + GT_SETCC (when the user was not a JTRUE). This is now transformed into GT_BITTEST_EQ/GT_BITTEST_NE instead, which are completely normal nodes. The processing of JTRUE handles the transformation into BT + JCC (in the same way as described above). What I was after is moving all handling of the conditions to when we process the JTRUE node, in anticipation of doing the same for GT_SELECT. For GT_SELECT we are going to have to reorder the condition with the true/false values, and when we were handling some of these transformations as part of the compare that seemed dangerous.
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak Issue DetailsSlightly refactor how compares are represented and processed in the backend in preparation of lifting the remaining cmov restrictions.
What I was after is moving all handling of the condition to when we process the JTRUE node, in anticipation of doing the same for GT_SELECT. For GT_SELECT we are going to have to reorder the condition with the true/false values, and when we were handling this as part of the compares in some cases that seemed dangerous. Some example LIR diffs: Generating: N033 (???,???) [000136] ----------- IL_OFFSET void INLRT @ 0x00A[E-] REG NA
Generating: N035 ( 1, 1) [000009] ----------- t9 = LCL_VAR int V02 loc1 u:1 ecx REG ecx $141
Generating: N037 ( 1, 1) [000011] -c--------- t11 = CNS_INT int 16 REG NA $42
┌──▌ t9 int
├──▌ t11 int
- Generating: N039 ( 3, 3) [000012] N------N-U- ▌ GE void REG NA $142
+ Generating: N039 ( 3, 3) [000012] -------N-U- ▌ CMP void REG NA
IN0005: cmp ecx, 16
- Generating: N041 ( 5, 5) [000013] ----------- ▌ JTRUE void REG NA $VN.Void
+ Generating: N041 (???,???) [000144] ----------- JCC void cond=UGE REG NA
IN0006: jae L_M20112_BB03 ┌──▌ t112 ubyte
├──▌ t54 int
- Generating: N061 ( 22, 14) [000060] ---XGO----- ▌ BT void REG NA
+ Generating: N061 ( 22, 14) [000060] ---XGO----- t60 = ▌ BITTEST_NE int REG eax
IN0009: bt eax, esi
- Generating: N063 (???,???) [000144] ----------- t144 = SETCC int cond=C REG eax
IN000a: setb al
IN000b: movzx eax, al
┌──▌ t144 int
Generating: N065 ( 23, 15) [000061] ---XGO----- ▌ RETURN int REG NA <l:$1c8, c:$1c7>
|
Looks like I need to update a few peeps, and there's an arm32 issue. |
Given this is a very common node, avoid this allocation
Some diffs are because we treat @@ -18,15 +18,15 @@
G_M16052_IG01: ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, nogc <-- Prolog IG
;; size=0 bbWeight=1 PerfScore 0.00
G_M16052_IG02: ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref
+ xor eax, eax
bt rcx, rdx
setb al
- movzx rax, al
- ;; size=10 bbWeight=1 PerfScore 1.75
+ ;; size=9 bbWeight=1 PerfScore 1.75
G_M16052_IG03: ; bbWeight=1, epilog, nogc, extend
ret
;; size=1 bbWeight=1 PerfScore 1.00
-; Total bytes of code 11, prolog size 0, PerfScore 3.85, instruction count 4, allocated bytes for code 11 (MethodHash=2849c14b) for method Microsoft.CodeAnalysis.BitVector:IsTrue(ulong,int):bool
+; Total bytes of code 10, prolog size 0, PerfScore 3.75, instruction count 4, allocated bytes for code 10 (MethodHash=2849c14b) for method Microsoft.CodeAnalysis.BitVector:IsTrue(ulong,int):bool |
Some other diffs are because the new "staged" approach composes better, so some optimizations can now enable new optimizations: @@ -29,14 +29,9 @@ G_M26981_IG05: ; bbWeight=0.50, gcVars=0000000000000000 {}, gcrefRegs=000
blsr eax, ecx
setne al
movzx rax, al
- mov edx, 1
- xor r8d, r8d
- test eax, eax
- cmove edx, r8d
- mov eax, edx
cmp ecx, 1
je SHORT G_M26981_IG07
- ;; size=32 bbWeight=0.50 PerfScore 2.12
+ ;; size=16 bbWeight=0.50 PerfScore 1.50
G_M26981_IG06: ; bbWeight=4, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, isz
inc eax
sar ecx, 1
@@ -47,7 +42,7 @@ G_M26981_IG07: ; bbWeight=0.50, gcrefRegs=0000 {}, byrefRegs=0000 {}, byr
ret
;; size=1 bbWeight=0.50 PerfScore 0.50
-; Total bytes of code 49, prolog size 0, PerfScore 17.40, instruction count 19, allocated bytes for code 49 (MethodHash=e65b969a) for method ILCompiler.Reflection.ReadyToRun.GcInfoTypes:CeilOfLog2(int):uint
+; Total bytes of code 33, prolog size 0, PerfScore 15.18, instruction count 14, allocated bytes for code 33 (MethodHash=e65b969a) for method ILCompiler.Reflection.ReadyToRun.GcInfoTypes:CeilOfLog2(int):uint We have the following LIR: [000073] ----------- IL_OFFSET void INLRT @ 0x00C[E-]
N001 ( 1, 1) [000004] ----------- t4 = LCL_VAR int V00 arg0 u:1 $80
N002 ( 1, 1) [000005] ----------- t5 = LCL_VAR int V00 arg0 u:1 $80
N003 ( 1, 1) [000006] ----------- t6 = CNS_INT int -1 $41
┌──▌ t5 int
├──▌ t6 int
N004 ( 3, 3) [000007] ----------- t7 = ▌ ADD int $101
┌──▌ t4 int
├──▌ t7 int
N005 ( 5, 5) [000008] ----------- t8 = ▌ AND int $102
N006 ( 1, 1) [000009] ----------- t9 = CNS_INT int 0 $40
┌──▌ t8 int
├──▌ t9 int
N007 ( 7, 7) [000010] J------N--- t10 = ▌ NE int $103
N008 ( 1, 1) [000012] ----------- t12 = CNS_INT int 1 $42
N009 ( 1, 1) [000035] ----------- t35 = CNS_INT int 0 $40
┌──▌ t10 int
├──▌ t12 int
├──▌ t35 int
N010 ( 10, 10) [000067] ----------- t67 = ▌ SELECT int
┌──▌ t67 int We first transform the With this PR we instead end up with |
Some other of the improvements are because we now support some dead code elimination for some arithmetic + JCC pairs that we emit in some cases. In those cases, even before this change, we are emitting: N023 ( 13, 13) [000087] ----GO----- t87 = ▌ XOR int <l:$500, c:$501>
N026 ( 17, 17) [000090] ----GO----- JCC void cond=UNE We did not handle this in |
…o be tracked Otherwise we may track it after lowering and invalidate the assumptions made by the optimization. Fix dotnet#82020
// The emitter's general logic handles op1/op2 for bt reversed. As a | ||
// small hack we reverse it in codegen instead of special casing the | ||
// emitter throughout. | ||
std::swap(op1, op2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is preexisting to this PR, the old genCodeForBT
had this handling too.
GTF_RELOP_SJUMP_OPT = 0x04000000, // GT_<relop> -- Swap signed jl/jge with js/jns during emitter, reuses flags | ||
// from previous instruction. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We handle this peephole fully when we see the flag setting instruction now, so we don't need this flag to communicate it along.
// Arguments: | ||
// other - The range to move from. | ||
// | ||
LIR::ReadOnlyRange& LIR::ReadOnlyRange::operator=(ReadOnlyRange&& other) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have the move constructor on LIR::ReadOnlyRange
and I needed this version to write some logic cleanly.
@@ -2122,7 +2122,6 @@ void Compiler::fgComputeLifeLIR(VARSET_TP& life, BasicBlock* block, VARSET_VALAR | |||
case GT_STORE_BLK: | |||
case GT_STORE_DYN_BLK: | |||
case GT_JCMP: | |||
case GT_CMP: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what now allows us to do DCE for GT_CMP
nodes if they don't have GTF_SET_FLAGS
set. Even before this PR we allowed that for the GT_EQ
family of nodes, which were exactly the same as GT_CMP
when void typed with GTF_SET_FLAGS
set.
@@ -1673,7 +1673,7 @@ int LinearScan::ComputeOperandDstCount(GenTree* operand) | |||
// Stores and void-typed operands may be encountered when processing call nodes, which contain | |||
// pointers to argument setup stores. | |||
assert(operand->OperIsStore() || operand->OperIsBlkOp() || operand->OperIsPutArgStk() || | |||
operand->OperIsCompare() || operand->OperIs(GT_CMP) || operand->TypeGet() == TYP_VOID); | |||
operand->TypeIs(TYP_VOID)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This OperIs(GT_CMP)
was seemingly only here because of code in LowerModPow2
that created a non-void typed GT_CMP
, but GT_CMP
does not produce a value so that wasn't expected.
The OperIsCompare()
check here is unnecessary as those nodes always produce a value now.
/azp run runtime-coreclr libraries-jitstress |
Azure Pipelines successfully started running 1 pipeline(s). |
relop->gtType = TYP_VOID; | ||
relop->gtFlags |= GTF_SET_FLAGS; | ||
|
||
if (relop->OperIs(GT_EQ, GT_NE, GT_LT, GT_LE, GT_GE, GT_GT)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this code, a JTRUE(LT)
is always turned into a JCC(CMP)
with the JCC
containing a condition code within gtCondition. In codegen for JCC
, we don't need to check the child for the condition. I'm happy with that.
In my and chains patch, I let optimise bools create a chain of the form NE(AND(LT,EQ),0)
. LowerConstCompare() turns it into TEST_NE(LT,EQ)
and then new function lowerTestCompare()
turns it into CCMP_LT(EQ)
. A JTRUE
node is allowed to have either a standard compare child or a CCMP child, eg JTRUE(CCMP_LT(EQ)
.
When combing both worlds, the lowering stage gets a little complicated, but everything else feels simple:
The CCMP_LT
has a gtCondition, plus there is also a CCMP
node. lowerTestCompare()
turns TEST_NE(LT,EQ)
into CCMP_LT(CMP)
. Then LowerJTrue()
would turn it into CCMP(CMP)
, giving us JCC(CCMP(CMP)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. After this PR the plan is that we then introduce GT_SELECTCC
which is to GT_SELECT
as GT_JCC
is to GT_JTRUE
(with the additional caveat that we never see GT_JTRUE
in codegen, because we have an invariant that there is nothing between the relop and GT_JTRUE
).
When lowering GT_SELECT
and GT_JTRUE
we will invoke a common function to transform compare chains into CMP + CCMP + ...
, followed up by the transformation into GT_SELECTCC
and GT_JCC
. For GT_SELECT
this function also needs to verify that reordering the compare to happen before GT_SELECT
is safe (if it isn't we'll leave GT_SELECT
as GT_SELECT
instead of transforming into GT_SELECTCC
). This should put everything on the same plan and I think it will unify the last of the handling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, it sounds like CCMP_LT
would not be necessary with generic CCMP
. I.e. we can transform TEST_NE(LT,EQ)
into SETCC(CCMP(CMP(x, y)))
and teach LowerJTrue
to pass through SETCC
nodes.
I actually considered using this approach instead of having GT_TEST_NE/GT_TEST_EQ/GT_BITTEST_NE/GT_BITTEST_EQ
but it creates an additional node and will require some bespoke interference logic for GT_SELECT
(but I think that logic will be necessary on arm64 in that case regardless).
I can try to see what it would look like in a follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can transform
TEST_NE(LT,EQ)
intoSETCC(CCMP(CMP(x, y)))
That's better. Reduces the number of node types.
When lowering GT_SELECT and GT_JTRUE we will invoke a common function to transform compare chains into CMP + CCMP + ...
If the chain parsing is also done at the CMP and AND nodes, then we should be able to only check the parent and child at each stage, so no probably no need to recursively go through the chain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, my question then is what to do with my and chains patch? I should wait until this patch is merged before continuing with it. Should I also wait for the GT_SELECTCC or any other patches - I don't think I need to, they should be mostly independent, but will merge conflict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think waiting for this one is a good idea, I hope we can merge it by today or tomorrow. Not sure how much this one will conflict with your PR.
I don't think you need to wait for GT_SELECTCC
either. I will probably move the xarch backend to GT_SELECTCC
as the first step and leave arm64 alone, so then they shouldn't conflict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For
GT_SELECT
this function also needs to verify that reordering the compare to happen beforeGT_SELECT
is safe (if it isn't we'll leaveGT_SELECT
asGT_SELECT
instead of transforming intoGT_SELECTCC
). This should put everything on the same plan and I think it will unify the last of the handling.
This will drop all the containing within SELECT
? ie -SELECT
will always load flags from a register, but SELECTCC will require the previous node to be a compare (like JCC) and always rely of the condition flags being set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, exactly.
cc @dotnet/jit-contrib PTAL @BruceForstall This turned out to be a bit more than a 'slight' refactoring, but I think it's an overall simplification for how conditions are handled, and should allow us to get some more principled handling for SELECT and the work that @a74nh is doing in the future. Note that I ran outerloop, jitstress, libraries-jitstress and Fuzzlyn on this (I deleted some of the comments if you're wondering where I did that) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Looks like a nice simplification.
cc @shushanhf -- the LA backend might need some fixups for this. |
Thanks very much. |
Slightly refactor how compares are represented and processed in the backend in preparation of lifting the remaining cmov restrictions.
Change all compare nodes to always produce values and never set flags. That is, GT_EQ/GT_NE/GT_LT/GT_LE/GT_GE/GT_GT/GT_TEST_EQ/GT_TEST_NE now are all "normal" nodes, always.
Stop optimizing compares by looking for a user JTRUE node. Instead, we optimize the compare when we see it (for example, EQ/NE(AND(x, y), 0) is turned into GT_TEST_EQ/GT_TEST_NE(x, y)), but we do not change the user JTRUE node to a JCC node. We now never create SETCC nodes when we see the compare, and we always turn JTRUE nodes into JCC nodes with a proper condition.
Handle the transformation from JTRUE to JCC when we actually get to JTRUE. The transformation that now happens is:
For arm64/xarch, we handle optimizing EQ/NE(node_that_sets_zero_flag, 0) at this stage too now.
Introduce new xarch backend-only GT_BITTEST_EQ/GT_BITTEST_NE that semantically correspond to (x & (1 << y)) != 0. The corresponding "set flags" node is GT_BT and already exists. We would previously transform (x & (1 << y)) != 0 into GT_BT + GT_SETCC (when the user was not a JTRUE). This is now transformed into GT_BITTEST_EQ/GT_BITTEST_NE instead, which are completely normal nodes. The processing of JTRUE handles the transformation into BT + JCC.
Change some of the emitter compare peepholes to work with this scheme. This requires some lookahead when we see them, but allows us to get rid of an "ugly" flag.
Allow liveness to DCE
GT_CMP/GT_TEST
when they have hadGTF_SET_FLAGS
unset (due to FG optimizations). We already allowed this forGT_EQ
, so this showed up as some regressions.Some example LIR diffs: