-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: Remove BBJ_NONE #94239
JIT: Remove BBJ_NONE #94239
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsNext step for #93020, per conversation on #93772. Replacing I've added a small peephole optimization to skip emitting unconditional branches to the next block during codegen.
|
Failures look like #91757. |
CC @dotnet/jit-contrib, @AndyAyersMS PTAL. I tried to rein in the asmdiffs as much as possible without adding too many weird edge cases. The code size increases in the
|
Very interesting. I would not have expected massive code size improvements from something like this, and I'd like to understand this aspect a bit better (especially the min opts cases). Can we pick a few examples for case studies? I will need some time to go through the changes -- will try to get you a first pass later today. |
What is this from? You need to pass For this change: it looks like the "jump to next BB" optimization is kicking in a few places in MinOpts. I don't see an easy way to make the behavior the same as before, but we should ensure that this doesn't impact debugging. |
Ah, thanks for catching that. I'm rerunning the PerfScore measurement now; will update with the results here.
By "jump to next BB optimization," do you mean the peephole optimization for jumping to the next block during codegen, or one of the flowgraph optimizations (like |
Sure thing. There are a couple methods in the
I don't fully understand the requirements for deciding to clone a loop, but I'm guessing some slightly different decisions made in In this case, do we trust the perfScore improvements, or does this amount of cloning seem extreme? I'll update shortly with an example of a code size improvement in MinOpts. |
I took a look at an example similar to the one Jakob screenshotted above; see
The conditions of these
The
Normally, the JIT would convert these jumps to the next block into I imagine it would be easy to disable this optimization in MinOpts the same way we disable the layout optimization phase, if we find it interferes with debugging unoptimized code. |
I've seen cloning "unexpectedly" kick in from changes like this. It is something we'll just have to tolerate for now. It would be helpful to see how much of the code growth comes in methods where more cloning happens, and whether aside from that there are other things going on that would be worth understanding. However it may be a little bit painful to gather this data. For instance you could add output to the jit's disasm footer line indicating the number of cloned loops, then update jit-analyze or similar to parse this and aggregate data separately for methods where # of clones matches vs those where # of clones differs...
For now you should try and see if you can get to zero diffs (or close) for MinOpts, we can always come back later to see if this optimization can be safely enabled in modes where we are generally (and intentionally) not optimizing the code. For instance, we might want to turn it on for Tier0 but not MinOpts or Debuggable Code. |
|
Thanks for pointing this out. I'll share some metrics on loop cloning for the collections with the biggest size regressions for FullOpts.
Sure thing, I'll try disabling the optimization for MinOpts. FYI that while disabling it will probably reduce/remove the size improvements, I expect us to get some pretty big diffs in the opposite direction for MinOpts, as all the blocks that used to be |
Interesting ... so we seemingly have lost something important by removing One idea is to look at the associated IL offset info. If the block end IL offset is valid and different than the last statement in the block end IL offset, and also different from the next block's start IL offset, then the jump may be significant for source-level debugging, and so it should result in some instruction (though perhaps emitting a If we can't get this to zero diff for MinOpts/Debuggable code, we'll have to verify behavior with the debugger tests. I suppose we could also try and look at the debug info we generate for some of these methods ((say pass |
src/coreclr/jit/codegenlinear.cpp
Outdated
@@ -737,7 +737,9 @@ void CodeGen::genCodeForBBlist() | |||
{ | |||
// Peephole optimization: If this block jumps to the next one, skip emitting the jump | |||
// (unless we are jumping between hot/cold sections, or if we need the jump for EH reasons) | |||
const bool skipJump = block->JumpsToNext() && !(block->bbFlags & BBF_KEEP_BBJ_ALWAYS) && | |||
// (Skip this optimization in MinOpts) | |||
const bool skipJump = !compiler->opts.MinOpts() && block->JumpsToNext() && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You probably want compiler->optimizationsEnabled()
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some notes.
Not sure about some of the changes to fgReorderBlocks
but that method is complex enough that I'd need to walk through some cases to understand it better.
I still think the goal here for now should be to try and minimize diffs. If there are opportunities to improve, we can do those as follow-ups. We should do is pick a few relatively small / simple methods for case studies and make sure we understand what is leading to the diffs there.
Happy to help with this.
// If that happens, make sure a NOP is emitted as the last instruction in the block. | ||
emitNop = true; | ||
break; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we ever hit the case in the old code where we had BBJ_NONE
on the last block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added an assert(false)
to that case in the old code to see if I could get it to hit during a SuperPMI replay, and it never hit across all collections. Also in the new code, I added an assert that BBJ_ALWAYS
has a jump before trying to emit the jump, so that we never have a BBJ_ALWAYS
that "falls into" nothing at the end of the block list -- that also never hit.
src/coreclr/jit/fgbasic.cpp
Outdated
@@ -3157,7 +3155,7 @@ unsigned Compiler::fgMakeBasicBlocks(const BYTE* codeAddr, IL_OFFSET codeSize, F | |||
|
|||
jmpDist = (sz == 1) ? getI1LittleEndian(codeAddr) : getI4LittleEndian(codeAddr); | |||
|
|||
if ((jmpDist == 0) && (opcode == CEE_BR || opcode == CEE_BR_S) && opts.DoEarlyBlockMerging()) | |||
if ((jmpDist == 0) && (jmpKind == BBJ_ALWAYS) && opts.DoEarlyBlockMerging()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this handle the CEE_LEAVE
cases like the old code did?
Suspect you may want to change this back to what it was before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so, since jmpKind
is set to BBJ_ALWAYS
only when the opcode is CEE_BR
or CEE_BR_S
, though I can change this back for simplicity.
@@ -1432,19 +1428,6 @@ void Compiler::fgDebugCheckTryFinallyExits() | |||
} | |||
} | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also update the comment above this code since case (d) is no longer possible? Instead of compacting (e) and (f) maybe just do something like
// ~~(d) via a fallthrough to an empty block to (b)~~ [no longer possible]
src/coreclr/jit/fgehopt.cpp
Outdated
@@ -1472,7 +1455,7 @@ void Compiler::fgDebugCheckTryFinallyExits() | |||
block->bbNum, succBlock->bbNum); | |||
} | |||
|
|||
allTryExitsValid = allTryExitsValid & thisExitValid; | |||
allTryExitsValid = allTryExitsValid && thisExitValid; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: since these bools are almost always true, &
is possibly a bit more efficient than &&
.
const bool fallThroughIsTruePred = BlockSetOps::IsMember(this, jti.m_truePreds, jti.m_fallThroughPred->bbNum); | ||
const bool predJumpsToNext = jti.m_fallThroughPred->KindIs(BBJ_ALWAYS) && jti.m_fallThroughPred->JumpsToNext(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code can likely be simplified quite a bit now that there are no implicit fall throughs. That is, there is no longer any reason to set jti.m_fallThroughPred
to true.
Can you leave a todo comment here and maybe a note in the meta-issue that we should revisit this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing.
|
||
// TODO: Now that block has a jump to bNext, | ||
// can we relax this requirement? | ||
assert(!fgInDifferentRegions(block, bNext)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not, unless the block whose IR is moving is empty or can't cause an exception.
@@ -2750,6 +2733,13 @@ void Compiler::optRedirectBlock(BasicBlock* blk, BlockToBlockMap* redirectMap, R | |||
break; | |||
|
|||
case BBJ_ALWAYS: | |||
// Fall-through successors are assumed correct and are not modified |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this new logic really necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so. That comment is copied from the notes in the doc comment for Compiler::optRedirectBlock
; I added this check in to emulate the no-op behavior it previously had for BBJ_NONE
. If I remove it, we hit assert(h->HasJumpTo(t) || !h->KindIs(BBJ_ALWAYS))
in Compiler::optCanonicalizeLoopCore
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic seems wrong with this added code here -- I would expect optRedirectBlock
to always redirect a BBJ_ALWAYS
based on the map. The behavior now doesn't match the documentation. Maybe some update to redirectMap
is needed somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To preserve the original behavior of this method, which was to not redirect BBJ_NONE
, would it be ok to use BBF_NONE_QUIRK
here instead to determine if we should redirect a BBJ_ALWAYS
? This seems to work locally. e.g:
if (blk->JumpsToNext() && ((blk->bbFlags & BBF_NONE_QUIRK) != 0)) // Functionally equivalent to BBJ_NONE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better, even though I generally think quirks should not affect behavior in this way. It seems like there is some form of bug here around how the redirection happens or how the map is constructed.
src/coreclr/jit/optimizer.cpp
Outdated
{ | ||
preHead = BasicBlock::bbNewBasicBlock(this, BBJ_ALWAYS, entry); | ||
} | ||
BasicBlock* preHead = BasicBlock::New(this, BBJ_ALWAYS, (isTopEntryLoop ? top : entry)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can just always branch to entry now, the logic before was trying to optimize the case for top-entry loops and fall through, but we don't need to do that anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's cool to see how much code is getting deleted (and how much more will probably follow).
I agree with Andy that you should get as close to zero diffs as possible with this change, putting in temporary workarounds if necessary for cases we can later remove.
In addition:
- Comment unrelated to this change: I don't like how
HasJump()
determines if we have a jump usingbbJumpDest != nullptr
. Note thatbbJumpDest
is part of a union. We should never access that value without first checkingbbJumpKind
(or asserting thatbbJumpKind
is one where we expectbbJumpDest
to be set.) It would probably be useful to have functions:
HasJumpDest() => KindIs(BBJ_ALWAYS,<others>) // bbJumpDest is valid
HasJumpSwt() => KindIs(BBJ_SWITCH) // bbJumpSwt is valid
HasJumpEhf() => KindIs(BBJ_EHFINALLYRET) // bbJumpEhf is valid
(and presumably bbJumpOffs
is only valid during a limited time during importing?)
These could be used in appropriate asserts as well.
- Questions about next steps: (a) do we get rid of the BBJ_COND "fall through"? (b) do we get rid of "JumpsToNext()"? (c) do we get rid of
bbFallsThrough()
? (d) do we get rid offgConnectFallThrough()
? (e) do we rename/removefgIsBetterFallThrough()
?
src/coreclr/jit/fgehopt.cpp
Outdated
@@ -1101,15 +1097,15 @@ PhaseStatus Compiler::fgCloneFinally() | |||
{ | |||
BasicBlock* newBlock = blockMap[block]; | |||
// Jump kind/target should not be set yet | |||
assert(newBlock->KindIs(BBJ_NONE)); | |||
assert(!newBlock->HasJump()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be:
assert(!newBlock->HasJump()); | |
assert(newBlock->KindIs(BBJ_ALWAYS)); |
? Or do you also want:
assert(newBlock->KindIs(BBJ_ALWAYS) && !newBlock->HasJump());
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The second is what I was going for; I'll update it.
src/coreclr/jit/fgopt.cpp
Outdated
{ | ||
return true; | ||
noway_assert(b1->KindIs(BBJ_ALWAYS)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assert seems unnecessary. At least, make it a normal assert
(not a noway_assert
)
BBJ_THROW, // SCK_ARG_EXCPN | ||
BBJ_THROW, // SCK_ARG_RNG_EXCPN | ||
BBJ_THROW, // SCK_FAIL_FAST | ||
BBJ_ALWAYS, // SCK_NONE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems odd. Does SCK_NONE ever get used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess not; I added an assert in to test if add->acdKind == SCK_NONE
is ever true, and the SuperPMI replay was clean. A quick search of the source code doesn't yield any places where we assign SCK_NONE
, so maybe I can add an assert in here that assures we don't use SCK_NONE
? Then I can remove the conditional logic below for setting the jump target if newBlk
is a BBJ_ALWAYS
.
src/coreclr/jit/flowgraph.cpp
Outdated
if (newBlk->KindIs(BBJ_ALWAYS)) | ||
{ | ||
assert(add->acdKind == SCK_NONE); | ||
newBlk->SetJumpDest(newBlk->Next()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if newBlk is the last block of the function? Then is bbJumpDest == nullptr
an indication of "fall off the end?" (previously, we'd have a BBJ_NONE and generate an int3
/ breakpoint if that happened)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. When emitting jumps, I added in assert to see if we ever get a BBJ_ALWAYS
with a null jump target, and it never hit during SuperPMI replays. Per your note on SCK_NONE
, I think we can get rid of this if statement altogether.
@AndyAyersMS @BruceForstall thank you both for the code reviews! I'll address your feedback shortly.
I tried this with the peephole optimization always enabled, and only found differences in the number of IL offsets when the number of blocks differed (usually from flowgraph optimizations like
I replayed
I'll take a look at the JIT dumps for the top improvements next to see where else the decreases are coming from (particularly for MinOpts). |
I'd like to understand why there is more (or less) cloning. There's a lot of code that likes "top entry loops" so perhaps those aren't being distinguished as before? More fundamentally: are a different set/number of loops being recognized by the loop recognizer, without "fall through"? Does loop inversion happen the same amount? |
I think this is the case. Looking at the |
Is this different from #69041? |
@kunalspathak they seem to have the same goal, but I think my approach is more aggressive, in that it checks to see if a BBJ_ALWAYS is functionally equivalent to BBJ_NONE during codegen. Without the opt I added, I saw plenty of unnecessary jumps to next after replacing |
Looks like that was the problem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for hanging in there through a number of revisions.
Thank you for all the reviews! |
There is no point to allow the BBJ_ALWAYS to be split up from (live in a different place from; have code in between; etc.) its paired BBJ_CALLFINALLY. All the codegen happens when processing the BBJ_CALLFINALLY, which uses the data from the BBJ_ALWAYS. The BBJ_ALWAYS itself generates nothing. |
Yeah, the main benefit would be to allow us to get rid of all code in the JIT that has to deal with the possibility of implicit fallthrough ( |
Thanks for clarifying @BruceForstall. Maybe later on, we can consider replacing |
It's very weird to think about non-retless CALLFINALLY as "fall through": you can't insert a block between the CALLFINALLY and ALWAYS, and control flow doesn't "fall through" to the block after the ALWAYS since the finally returns to the ALWAYS target. I'm not sure what it means.
If we could make that work, it would be great. Currently, the finally's EHFINALLYRET returns each ALWAYS block as a successor, and that ALWAYS block has its EHFINALLYRET as a predecessor. What you suggest would simplify a lot of current logic but might add additional special logic to flow graph interpretation. E.g., the EHFINALLYRET would I suppose yield all the continuation blocks as successors (makes sense), and each of them would have an EHFINALLYRET as predecessor, as well as any non-finally block predecessors. |
The function has the comment "Can a BasicBlock be inserted after this without altering the flowgraph" and the current design definitely means retless CALLFINALLY falls in the category (well, the opposite meaning is implied I'm sure). I guess that's why it returns true. I think the representation where we store the continuation in the CALLFINALLY sounds natural. Finding the successors of the |
And based on the return values, the comment should probably be something like "Would inserting after this block alter the flowgraph", since it returns true for blocks with implicit fallthrough. |
The "side table" is mostly for performance, simplicity, use by the iterators, and consistency with how switches are represented. What you say about using the regular predecessors of the finally entry makes sense. It wasn't done that way before, possibly because the previous code was written before predecessors were always available. |
I think we can indeed get rid of these, and it would nice to no longer have all that special case handling everywhere. We might need to constrain layout and/or add to codegen to handle the case where the retfinallly target block does not end up immediately after the corresponding callfinally block. I don't know if codegen is flexible enough today to do the latter (basically introducing a label that's not at a block begin, and after that, a branch to the right spot). |
I wrote a proposal to reconsider the BBJ_CALLFINALLY/BBJ_ALWAYS representation: #95355 Please add comments there about what might be required to make that work. |
Follow-up to #94239. In MinOpts scenarios, we should remove branches to the next block regardless of whether BBF_NONE_QUIRK is set, as this yields code size and TP improvements.
Collated set of improvements/regressions (lower is better) as of 12/12/2023.
|
Next step for #93020, per conversation on #93772. Replacing
BBJ_NONE
withBBJ_ALWAYS
to the next block helps limit our use of implicit fall-through (though we still expectBBJ_COND
to fall through when its false branch is taken; #93772 should eventually address this).I've added a small peephole optimization to skip emitting unconditional branches to the next block during codegen.