-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Widespread perf regressions due to RPO layout #102763
Comments
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
@EgorBo Thanks for opening this. I haven't spent too much time looking at the regressions yet, but based on the larger ones I've looked at, I think we need a more general/powerful implementation of #102461. Take
The interesting ones are
In the new layout,
I think modifying Thoughts on this? cc @AndyAyersMS for visibility. Thanks! |
I would suggest we list all the regressions in rank order and then investigate at least the top 20 or so. We also should see what the arm64 data looks like. Based on the above and the the other fixup actions, it feels like we are getting to the point where we may actually want to create the explicit cost/benefit model to assess incremental improvements. We have a layout configuration A, and some proposed alternate B. Which is better? If you want to try to build this out, we should talk. The general model is something like the following:
Since there are just 3 cut points, if you have an initial cost model it's not too hard to compute the delta to the new cost model, you (mostly) just need to recost the segment-ending blocks (if the cost model is block-size sensitive then perhaps a bit more, but I don't think we have good size estimates yet?) For the time being the "identify" step can be finding a single misplaced block and a new proposed location, and then seeing if the cost improves. |
Sorry about the delay in getting back to this; the script we typically use to collate regression data doesn't seem to work since the perflab test report URLs changed, so I hacked up my own script. Here are the top 20 regressions on Windows x64:
And on Linux x64:
I'll start digging into the overlapping benchmarks (this explains |
I'm seeing the same theme of interleaving hot and colder blocks throughout these regressions. The RPO-based layout on its own doesn't seem to leverage edge weights heavily enough, and while
Notice how
If we considered moving forward jumps, we'd have the following decision tree:
For a few other short examples, just to build my case, here's the new layout of
And for
And for
I'd be interested in pursuing this, though based on the above, I think it might be worthwhile trying to implement a more general post-RPO pass that does the moves I described above, since it seems simple enough. |
I am surprised some of these are so large. Do we have any results for AMD HW? All the reports above are for Intel. |
Randomly looking at improvements, it seems most are also intel only. But not all. Here's @DrewScoggins is looking into whether there are unfiled AMD64 reports out there. From what I can tell the newer AMD64 hardware is mostly indifferent. |
Here's a cross-arch regression: System.Linq.Tests.Perf_Enumerable.Contains_ElementNotFound(input: IEnumerable) @amanasifkhalid perhaps dig into this one since it is less likely to be caused by some microarchitectural issue. |
@AndyAyersMS Thanks for pointing out that benchmark; this one seems to suffer from the same interleaving of hot/cold paths, too. Here's the old layout for
And the new layout:
Both seem to do a good job of keeping the loop in |
It is hard to see how the return block placement could have that much impact on perf -- we are looking at a pretty big regression here. Can you copy out the two inner loop disasms? Also puzzled why BB11 doesn't have IBC data -- these sorts of mixed IBC/noIBC cases have caused us trouble in the past. |
Sure. Old layout:
New layout:
The new layout has one fewer branch on the loop path, so it seems better?
|
Seems like we ought to be aligning these loops... any idea why not? Also interesting that we don't have a peephole opt for this: mov dword ptr [rsi+0x08], ecx
mov edx, dword ptr [rsi+0x08] |
Sure. With the old layout, here's the disasm up to the loop end:
And with the new layout:
We're hitting the JCC erratum more often than I previously thought, though only one instance is in the loop body. And we're failing to align the loop correctly in both cases. Looking at the dump, the JIT is complaining this loop needs too much padding:
|
Is the first bit of asm above actually the new code too? They look the same. I am confused by the padding output too -- looks like the loop tops are at x75 and x6d, so to reach the next 16 byte boundaries at x80/x70 we would need only 11 bytes and 3 bytes of padding. So seems like we should pad in the new case, and in neither case would we need 15 bytes. |
You're right, sorry about that -- I updated it. Looking at the dump for the old layout, the JIT is overestimating the padding needed, and thus skipping alignment here, too:
I'm wondering if the block-level peephole optimization for skipping unconditional jumps, or the fact that conditional blocks may no longer fall into their false targets, is causing this overestimation? Though |
The older padding computation seems ok? Loop top is at 0x75 so we would need 0x0B == 11 bytes to reach 0x80, and we're not willing to pad by more than 8. So we bail. The new layout, though...? Do you see a perf difference locally when you run these? I am not sure why the new version ends up being so much slower. |
So it turns out the final offset of the loop beginning in the new layout is 0x6D bytes, but when deciding whether to align or not, the JIT uses the loop's estimated offset, which in this case is 0x71 -- hence why the JIT thinks we need 15 bytes to align the loop. The difference in estimated vs actual offset stems from a jump preceding the loop being predicted as 6 bytes in length, and later turning out to be only 2 bytes.
Old layout:
New layout:
|
I've collated the top 20 benchmark regressions across platforms, not double-counting any repeat offenders -- the duplicate names are from GitHub's markdown viewer not rendering templated types due to the
I opened a PR (#103450) implementing a 3-opt pass post-RPO layout, but even with a few iterations, TP impact is up to 10%. I'm concerned that too few iterations leaves too much code quality on the table, but the TP cost is significant. This has pushed me to reconsider my narrower (and cheaper) approach in #102927. I'll update with the new layouts for the biggest regressions below, but I'd like to highlight a blocker I've noticed in
Notice @AndyAyersMS I haven't investigated which phases are responsible for the profile data issues just yet, though perhaps we could consider running profile repair right before layout to expedite solving this? I can collect some metrics to get an idea of how common this issue of hot blocks' most likely successors being cold is. If we can get the profile data consistent at least in terms of the |
I'm surprised the 3-opt is looking so costly. It should be fairly cheap. I suppose it makes sense to try profile repair. Last we knew 30%+ of methods were profile inconsistent after inlining, and it can only get worse from there. If you share out the jitdump for the above I can perhaps start working on fixing the maintenance issues. |
It's possible there's some significant oversight in my implementation... I tried to keep it simple: For now, layout cost is modeled solely with edge weights, where edges with fallthrough behavior just have a cost of zero. With a sufficient number of iterations (usually no more than 5), I was getting good results for the benchmarks I looked at above in this comment. But it was costly...
Thank you! Here it is |
It looks like we're doing a good job of prioritizing fallthrough on the hottest paths. |
I think it make sense for small but non-zero to be treated like cold.. perhaps some kind of an absolute threshold (say normalized weight is > 0.01). Do you know why BB55 ends up where it does? |
|
I'm going to go ahead and try this out, but it's worth noting that if we turn hot/cold splitting on, we'll probably want to update |
It would be nice if we moved Edit: I tweaked
|
All the other regressed |
…oldBlocks (#103492) Based on feedback in #102763 (comment), define "cold" blocks based on whether their weights are below a certain threshold, rather than only considering blocks marked with BBF_RUN_RARELY, in fgMoveColdBlocks. I added a BasicBlock method for doing this weight check rather than localizing it to fgMoveColdBlocks, as I plan to use it elsewhere in the layout phase.
And the final layout:
The decision to move |
Sorry for all the comment spam. I've looked at the remaining top regressions, and I haven't seen anything noteworthy -- in general, #102927 seems to do a good job of maximizing fallthrough on hot paths, so long as the profile data is sensible. As such, I'm planning on getting that merged for now, and coming back to 3-opt after Preview 6. |
…102927) After establishing an RPO-based layout, we currently try to move any backward jumps up to their successors, if it is profitable to do so in spite of any fallthrough behavior lost. In #102763, we see many instances where the RPO layout fails to create fallthrough for forward jumps on the hot path, such as in cases where a block is reachable from many predecessors. This work addresses the RPO's limitations by also considering moving the targets of forward jumps (conditional and unconditional) to maximize fallthrough.
I've merged #102927 -- fingers crossed we see an improvement in the next triage... |
I'm punting this to .NET 10, as we're planning to continue to iterate on the new layout. We'll re-evaluate any remaining regressions here with each change. |
Looks like #102343 has a lot more regressions than improvements. cc @amanasifkhalid
Regressions:
Improvements:
NOTE: use
Test report
links, it looks like "all time history" and the images are a bit out of date.The text was updated successfully, but these errors were encountered: