-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: Enable RPO-based block layout by default #102343
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
Hmm, this makes me a bit leery. It is very common to lay out loops in this way to avoid this extra branch; Roslyn does that in IL for us, which the layout algorithm then effectively undoes. I think this affects all loops that we don't enter at the top. Here is a simple example: private static int Sum(int[] arr)
{
int i = 0;
int sum = 0;
while (i < arr.Length && arr[i] != 0)
{
sum += arr[i];
i++;
}
return sum;
} Base: G_M57365_IG02: ;; offset=0x0004
xor eax, eax
mov edx, dword ptr [rcx+0x08]
xor edx, edx
jmp SHORT G_M57365_IG04
;; size=9 bbWeight=1 PerfScore 4.50
G_M57365_IG03: ;; offset=0x000D
add eax, dword ptr [rcx+4*rdx+0x10]
inc edx
;; size=6 bbWeight=2 PerfScore 6.50
G_M57365_IG04: ;; offset=0x0013
cmp dword ptr [rcx+0x08], edx
jle SHORT G_M57365_IG06
;; size=5 bbWeight=8 PerfScore 32.00
G_M57365_IG05: ;; offset=0x0018
cmp dword ptr [rcx+4*rdx+0x10], 0
jne SHORT G_M57365_IG03
;; size=7 bbWeight=4 PerfScore 16.00 Diff: G_M57365_IG02: ;; offset=0x0004
xor eax, eax
mov edx, dword ptr [rcx+0x08]
xor edx, edx
;; size=7 bbWeight=1 PerfScore 2.50
G_M57365_IG03: ;; offset=0x000B
cmp dword ptr [rcx+0x08], edx
jle SHORT G_M57365_IG06
;; size=5 bbWeight=8 PerfScore 32.00
G_M57365_IG04: ;; offset=0x0010
cmp dword ptr [rcx+4*rdx+0x10], 0
je SHORT G_M57365_IG06
;; size=7 bbWeight=4 PerfScore 16.00
G_M57365_IG05: ;; offset=0x0017
add eax, dword ptr [rcx+4*rdx+0x10]
inc edx
jmp SHORT G_M57365_IG03
;; size=8 bbWeight=2 PerfScore 10.50 |
Yeah, I'm a bit concerned about this too, considering it affected the pretty idiomatic example you gave. In the @jakobbotsch are there other collections you'd like me to specifically look at? And if you think it's worth addressing this in the layout's implementation, what should our merge strategy look like? Would you want to run the experiment for a bit longer with the loop inversion fix before enabling to see what the improvements look like? |
I reran my analysis across all |
Now that I'm thinking about it, I think we ought to try something similar to what Phoenix does after creating the RPO-based layout where we move a block's hottest predecessor to just before it, so that loop heads don't end up at the end of loop bodies. Looking around various GitHub issues like #9304, I think this is some low-hanging fruit we can address up-front. |
@EgorBot --disasm using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkRunner.Run<MyBench>(args: args);
public class MyBench
{
private int _int = 0;
private long _long = 0;
private string _location, _newValue, _comparand;
[GlobalSetup(Target = nameof(CompareExchange_object_NoMatch))]
public void Setup_CompareExchange_object_NoMatch()
{
_location = "Hello";
_newValue = "World";
_comparand = "What?";
}
[Benchmark]
public string CompareExchange_object_NoMatch()
=> Interlocked.CompareExchange(ref _location, _newValue, _comparand);
} |
|
I was just testing my bot, although, @amanasifkhalid you mentioned that |
Although, by looking at the BDN_Artifacts it seems that there is ASM difference, namely: https://www.diffchecker.com/hu5vJF0i/ (no idea where the baseline and the PR in that diff) |
@EgorBo yes, that was on Windows x64. That number also came from me looking at min execution times, so it's possible the baseline had an unusually good run? Nice bot, by the way |
I've opened #102461 to address the loop inversion issue. If we decide we want that change, should we let the layout experiment run for another week or so? I'm fine with punting this change to Preview 6. @dotnet/jit-contrib |
…ayout (#102461) Part of #93020. In #102343, we noticed the RPO-based layout sometimes makes suboptimal decisions in terms of placing a block's hottest predecessor before it -- in particular, this affects loops that aren't entered at the top. To address this, after establishing a baseline RPO layout, fgMoveBackwardJumpsToSuccessors will try to move backward unconditional jumps to right behind their targets to create fallthrough, if the predecessor block is sufficiently hot.
Updated diffs. @EgorBo since we addressed the mis-rotated loop issue with #102461, are you ok with merging this as-is? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of the diffs look very nice! Looking forward to dotnet/performance results 🙂
Nice to see this enabled! Thanks for digging into the original set of diffs and fixing problems. |
@AndyAyersMS @jakobbotsch @EgorBo thank you all for your help with getting this merged! |
…ayout (dotnet#102461) Part of dotnet#93020. In dotnet#102343, we noticed the RPO-based layout sometimes makes suboptimal decisions in terms of placing a block's hottest predecessor before it -- in particular, this affects loops that aren't entered at the top. To address this, after establishing a baseline RPO layout, fgMoveBackwardJumpsToSuccessors will try to move backward unconditional jumps to right behind their targets to create fallthrough, if the predecessor block is sufficiently hot.
Part of #93020. Enables the new greedy RPO-based block layout by default. By fully switching over to the new layout algorithm, we can get rid of a lot of code that probably isn't useful anymore -- aside from the old layout, we should consider removing code that prematurely tries to maintain a certain ordering, like
fgFindInsertPoint
. I'm not going to do any of this removal just yet, just in case we want to keep the old implementation around for now.We now have about a week of useful data from the
rpolayout
experiment in the perf lab. Here's a PDF/CDF of the minimum benchmark execution times from the last 5 days, on Windows x64:Many (most?) of those datapoints are within the realm of noise. Here's a brief breakdown of the nontrivial improvements/regressions on x64, using the min/median/max benchmark execution times from the last 5 days:
Windows x64, min execution time
22.85% improved by >=2%; 14.64% regressed
9.54% improved by >=5%; 6.85% regressed
2.97% improved by >=10%; 2.97% regressed
Ubuntu x64, min execution time
27.53% improved by >=2%; 12.91% regressed
11.11% improved by >=5%; 5.69% regressed
3.68% improved by >=10%; 2.19% regressed
Windows x64, median execution time
26.62% improved by >=2%; 13.73% regressed
12.23% improved by >=5%; 6.89% regressed
4.32% improved by >=10%; 2.83% regressed
Ubuntu x64, median execution time
26.20% improved by >=2%; 12.01% regressed
11.17% improved by >=5%; 5.92% regressed
3.96% improved by >=10%; 2.31% regressed
Windows x64, max execution time
30.45% improved by >=2%; 18.67% regressed
17.01% improved by >=5%; 10.51% regressed
7.06% improved by >=10%; 5.11% regressed
Ubuntu x64, max execution time
31.38% improved by >=2%; 22.89% regressed
16.02% improved by >=5%; 11.26% regressed
6.92% improved by >=10%; 5.60% regressed
As of writing, 145 of 4,879 benchmarks regressed by 10% or more on Windows x64, when looking at their minimum execution times from the last 5 days. Block layout churn can have far-reaching consequences, so narrowing down which methods to look at when triaging regressions can be tricky. I've highlighted a few regressed benchmarks below with simple enough call graphs that the offending method is obvious; I think these examples highlight a few expected trends from the new layout algorithm:
System.Numerics.Tests.Perf_Matrix4x4.IsIdentityBenchmark
Base layout:
Diff layout:
The new layout places
BB06
afterBB05
, breaking up the fallthrough fromBB04
toBB06
. This has the benefit of removing a backward jump fromBB05
toBB06
, though in the case of this benchmark, it looks like the return pathBB03->BB04->BB06
is taken, so we're penalized by the new jump overBB05
. This benchmark is quite small, so the impact of the jump is big, regressing it by about 27%.We could tweak the RPO-based layout by moving blocks up to just after their hottest predecessor to address this (@AndyAyersMS showed me something similar he did in Phoenix), though in this case, the block weights suggest
BB05
isBB06
's hottest predecessor, so I don't think there's anything worth changing here, in terms of the block layout algorithm itself.System.Numerics.Tests.Perf_Matrix3x2.InequalityOperatorBenchmark regressed by about 19% for similar reasons.
Base layout:
Diff layout:
Based on the PGO data available, the new layout seems to be making better decisions. We could iterate on this by synthesizing likelihoods and/or repairing the profile pre-layout, and by adding a post-RPO layout heuristic that moves blocks up to their hottest predecessor.
System.Threading.Tests.Perf_Interlocked.CompareExchange_object_Match regressed by over 40%.
Base layout:
Diff layout:
This is an interesting case, where the exceptional path
BB01->BB03
is actually the more likely path, if the edge likelihoods are to be trusted. However, the JIT tends to assume throw blocks are always cold, henceBB03
's block weight of 0. The new layout does not make any such assumption about throw blocks: After generating a greedy RPO-based layout ofBB01->BB02->BB03
, the new layout moves all rarely-run blocks (i.e. anything with a weight of 0) to the end of the method, hence the finalBB01->BB03->BB02
layout. This case would be fixed by propagating weight toBB02
fromBB01
that is proportional to their edge's likelihood, such thatBB02
would no longer be considered rarely-run; thus, running profile repair before block layout would probably fix this. Though considering the perf cost of exception handling, perhaps we don't have much to gain from removing this expectation that throw blocks are cold.I should note that the old layout didn't do anything clever on purpose here. It left
BB02
afterBB01
because it still expects the false target of a conditional block to be its next block, and not because it is the more likely successor. This invariant has been removed elsewhere in the JIT, so switching over to the new layout for good would allows us to remove the last bits of cruft around this implicit fallthrough requirement.System.Threading.Tests.Perf_Interlocked.CompareExchange_object_NoMatch also regressed by over 40% for the same reason. There seem to be a few of these benchmark pairs inflating the improvement/regression counts.
System.Collections.Tests.Perf_BitArray.BitArrayNot(Size: 512) regressed by about 14%, due to layout differences in
System.Collections.BitArray:Not
:Base layout:
Diff layout:
In terms of edge likelihoods, the new layout seems to get the critical paths right, though note that the "greedy" part of the RPO only applies to conditional blocks when deciding which successor to place next; other multi-successor block kinds, like switch blocks, don't seem to be common enough to be worth extending the layout's greediness to, though this could be done as a follow-up quite easily (see #101935). I believe the hot loop
BB11<->BB12
is to blame for the regression:BB12
is reachable fromBB11
andBB10
, andBB11
is reachable only fromBB12
. When we start the RPO fromBB01
, we end up visitingBB10
, thenBB12
, and thenBB11
, hence why the new layout placesBB12
beforeBB11
. This introduces more branches: We need a backward jump fromBB11
toBB12
within the loop, and onceBB12
's condition is false, we need to jump overBB11
to get to the former's false target. If we placeBB11
beforeBB12
, thenBB11
can fall intoBB12
, andBB12
can eventually fall into its false target after the loop; we only need the single backward jump fromBB12
toBB11
.Perhaps we could re-canonicalize loops post-layout to fix these cases, though I hesitate to purposefully break the RPO. Cases like this one could be tackled by a heuristic that optimizes for some optimal layout score, as described in #93020.
System.Collections.Tests.Perf_BitArray.BitArrayCopyToByteArray(Size: 512) regressed by over 40% for seemingly the same reason, though the problematic loop shapes are all in the cold section. Take a look at
BB51<->BB52
,BB56<->BB57
, etc.Base layout:
Diff layout:
There are lots of improvements that I'm not extending the same analysis to, and I don't mean for my tone to be pessimistic. My key takeaway from this is much of the behavior we don't necessarily want in the new layout algorithm can be addressed by leveraging block weights to selectively repair various shapes -- we have Phoenix as a guide for a lot of this work. But in its current form, the new layout algorithm is certainly easier to understand, and quite a bit faster; I'm expecting TP improvements over 1% for this PR, and follow-up work (should we decide to remove the old layout entirely) will only improve this further.
cc @dotnet/jit-contrib