-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: Add 3-opt implementation for improving upon RPO-based layout #103450
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
src/coreclr/jit/fgopt.cpp
Outdated
improvedLayout = false; | ||
BasicBlock* const exitBlock = blockVector[blockCount - 1]; | ||
|
||
for (unsigned i = 1; i < (blockCount - 1); i++) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the root of the TP cost is here -- we want to avoid having to search for possible cut points.
One approach is to just pick randomly, but I think we can do better for now. Roughly speaking in the pass above we should find all blocks that are either not just before their optimal successor and/or not just after their optimal successor.
We can rank these by the difference in the current vs optimal score. Then greedily pick the worst, that gives the first cut point. For the second cut point you can pick the best pred for the first cut point's current next block, or the best succ for the current pred of the first cut point's ideal successor. That is, if we have
S ~~~~ 1|2 ~~~ 3|4 ~~~ 5|6 ~~~ E
1's ideal succ is 4
reordering is
S ~~~~ 1|4 ~~~ 5|2 ~~~ 3|6 ~~~ E
So we either try and find a 5 which is the ideal pred of 2, or a 6 which is the ideal succ of 3.
Failing that we might pick some other block that is not currently followed by its ideal succ.
So one idea is to keep 3 values for each block: its min score, current score, and best score (lower is better). Order the blocks by current-min. Pick of the best as the first split, and then see if any of the next few provide a good second split.
Likely though this ends up needing a priority queue or similar as once we accept an arrangement we need to update some of the costings...
Draft Pull Request was automatically closed for 30 days of inactivity. Please let us know if you'd like to reopen it. |
/azp run runtime-coreclr outerloop, Fuzzlyn, Antigen |
cc @dotnet/jit-contrib, @AndyAyersMS PTAL. Diffs are a bit broken on Linux at the moment, though other platforms show less dramatic churn than I expected -- it's possible the current cost model isn't sophisticated enough to incentivize larger bets. PerfScore diffs look like a net improvement, and TP cost is pretty cheap. Antigen and Fuzzlyn failures don't look related to layout. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some preliminary notes.
I find it a little confusing to use "cost" if the goal of the optimization is to maximize. Maybe consider "benefit"? Or rework things to be minimizing cost?
I also think it would be useful to build the overall score for a layout; you should be able to show at each step this value is increasing (/decreasing if you minimize).
When we reorder it seems like there are six impacted blocks (3 have new preds, 3 have new succ), so I'm surprised we are just reconsidering 3 sets of edges. Maybe this is related to not wanting to mess with the boundaries? But even so I'd expect to see 4 sets...
I would recommend adding a doc with some simple examples worked out in pseudocode.
src/coreclr/jit/fgopt.cpp
Outdated
// | ||
/* static */ bool Compiler::ThreeOptLayout::EdgeCmp(const FlowEdge* left, const FlowEdge* right) | ||
{ | ||
return left->getLikelyWeight() < right->getLikelyWeight(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably a good idea to give this an equality tie breaker, so that EdgeCmp(a,b) = !EdgeCmp(b,a)
You can use the bbID
s of sources and targets, for instance.
src/coreclr/jit/fgopt.cpp
Outdated
return; | ||
} | ||
|
||
// Don't waste time reordering within handler regions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe note that we expect if a finally is "hot" we should be cloning it.
{ | ||
assert(edge != nullptr); | ||
|
||
BasicBlock* const srcBlk = edge->getSourceBlock(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to keep track of edges that can never be handled? Perhaps a second tracked set?
Alternatively, we could keep some state on the edge itself, there is likely some padding void there already that could hold bits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to keep track of edges that can never be handled? Perhaps a second tracked set?
I suppose I could just move the set insertion/membership check to the beginning of ConsiderEdge
? This is a no-diff change for me locally, and it seems like a convenient way to short-circuit all the checks.
Alternatively, we could keep some state on the edge itself, there is likely some padding void there already that could hold bits.
That sounds worth pursuing. I'll try this out.
@AndyAyersMS thanks for the review!
Agreed that the semantics looked weird. I've reframed the partition "cost" as its "score," such that we want to maximize the score of a layout by creating fallthrough for the heaviest edges. Personally, I find the cost-minimizing framing of this problem more intuitive, but if/when the model deviates from just using edge weights, I think it will be easier to compute the benefit rather than the cost of an edge (assuming the cost is computed by subtracting the edge's score from all other flow out of the source block, which may not be the same as the source block's weight).
Do you intend for this to be a debug-only utility, i.e. to improve the JitDump info reported by 3-opt? Or is there some strategic advantage to knowing the total layout score during 3-opt?
Yeah, this bit of logic hasn't aged well over the last few iterations. I've simplified it a bit, and diffs against the last revision seem to show we're capturing more reordering opportunities.
Is it alright if I split that out into a separate PR? |
/azp run runtime-coreclr outerloop, Fuzzlyn, Antigen |
Mostly debug, I guess. When we did this way back when we built the global scorer first and tried to ensure that a good score looked like a good layout, and that a bad score looked like a bad layout, and that there were sufficient methods with bad layouts (and a set of "bad" candidates to look at). But if you know how close the score is to a good score, you can also use this for early stopping. |
Yes, sure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
unsigned position = 0; | ||
|
||
// Initialize current block order | ||
for (BasicBlock* const block : compiler->Blocks(compiler->fgFirstBB, finalBlock)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason not to fuse this with the loop just above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Splitting the loops up allows us to calculate the exact size of blockOrder
/tempOrder
before they're needed. Instead of allocating the exact size needed, I can use fgBBcount
for the array sizes (and maybe trim this upper bound as we search for the last hot block). It'll waste some memory in some cases, but then we only need one loop.
} | ||
#endif // DEBUG | ||
|
||
ThreeOptLayout layoutRunner(this); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be able to save a tiny bit of TP by not bothering to search when there is just one or two blocks, since there is nothing to reorder.
(likewise in the above if there is just a try with one or two blocks)
TP diffs show about 0.02% improvement compared to the last run. |
@AndyAyersMS thanks for the reviews! Are you ok with this revision? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
/ba-g macOS pipelines are blocked by deprecated images |
Issue tracking regressions from this PR (since GitHub doesn't seem to want to do autolinks at the moment): #109613 |
Part of #107749. After establishing a baseline layout, where hot blocks are in relative reverse post-order and cold blocks are out-of-line, run a 3-opt pass over each try region, as well as the main method body. Profitability of each partition is currently measured with edge weights only, without considering the type ([un]conditional, forward/backward, etc) of the branch. To minimize the search space of partition points, we maintain a priority queue of flow edges between non-contiguous blocks, such that we can attempt to create fallthrough for the next most-profitable edge.
To avoid disturbing EH semantics too much, we currently only reorder blocks within EH regions. This could be relaxed somewhat to at least move region entries up to their hottest predecessors, effectively moving the entire region once we re-establish contiguity -- I plan to try this in a follow-up PR. Note that we don't consider reordering handler regions beyond the initial RPO-based layout, under the assumption that the overhead of exception handling usually minimizes the potential gains from better layout.