JIT: Flowgraph Modernization and Improved Block Layout in .NET 10 #107749

amanasifkhalid · 2024-09-12T18:01:10Z

Continuation of #93020. During the .NET 9 development cycle, we removed much of the JIT flowgraph implementation's implicit fall-through invariants, and introduced a new block layout strategy based on a reverse post-order traversal of the graph. For .NET 10, we'd like to push this work further in both directions, with the ultimate goals of zero dependence on lexical block ordering in the JIT's frontend, and a global cost-optimizing layout algorithm in the JIT's backend. Below is an early estimate of what each item entails:

Flowgraph Modernization

Block layout
Ideally, the below items get us to a state where block layout produces the "best" ordering it can, given the profile data it has on-hand. If the layout is subpar due to missing/inconsistent profile data, we can at least eliminate the layout strategy as the culprit.

Implement 3-opt pass on top of the RPO-based layout, modeling layout cost with edge weights
- JIT: Add 3-opt implementation for improving upon RPO-based layout #103450
- JIT: Do greedy 4-opt for backward jumps in 3-opt layout #110277
Consider modeling cost of (un)conditional and forward/backward branches in layout cost for 3-opt
Consider how 3-opt's layout decisions may affect hot/cold splitting
Consider how we can achieve acceptable throughput, while running for enough iterations to achieve near-optimal layout
Continued deferred .NET 9 items

Profile Maintenance

Continue expanding profile consistency checks through the JIT's frontend. Currently, we bail after inlining.
Consider replacing optSetBlockWeights with the new profile synthesis implementation. The former frequently produces nonsensical weights for loops, as it relies on a lexical traversal of the block list to identify loops. Fixing this may improve JitOptRepeat performance.
Consider running profile synthesis right before layout.
Allow profile data to override the JIT's heuristics more explicitly. For example, if profile data suggests a BBJ_THROW block is hot, then order it as such (this particular example is not as perf-sensitive, though).
- Enforcing profile consistency checks seems to have fixed this. The BBJ_THROW example in particular was largely handled by JIT: Move profile consistency checks to after morph #111253.

cc @dotnet/jit-contrib, @AndyAyersMS

The text was updated successfully, but these errors were encountered:

dotnet-policy-service · 2024-09-12T18:01:36Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Part of #107749, and follow-up to #107927. When computing a RPO of the flow graph, ensuring that the entirety of a loop body is visited before any of the loop's successors has the benefit of keeping the loop body compact in the traversal. This is certainly ideal when computing an initial block layout, and may be preferable for register allocation, too. Thus, this change formalizes loop-aware RPO creation as part of the flowgraph API surface, and uses it for LSRA's block sequence.

) Part of dotnet#107749, and follow-up to dotnet#107927. When computing a RPO of the flow graph, ensuring that the entirety of a loop body is visited before any of the loop's successors has the benefit of keeping the loop body compact in the traversal. This is certainly ideal when computing an initial block layout, and may be preferable for register allocation, too. Thus, this change formalizes loop-aware RPO creation as part of the flowgraph API surface, and uses it for LSRA's block sequence.

…ntiguity later (#108914) Part of #107749. `Compiler::fgMoveColdBlocks` currently moves cold try blocks to the end of their innermost regions. This is problematic for our 3-opt layout plans: When identifying a candidate span of blocks to reorder, assuming that all cold blocks are at the end of the method vastly simplifies our implementation. However, if we have EH regions with their own cold sections, `fgMoveColdBlocks` will interleave hot and cold blocks. To facilitate later layout passes, we can simplify `fgMoveColdBlocks` to naively move all cold blocks to the end of the method, regardless of EH region, and rely on a "fixup" pass for making EH regions contiguous again. To start, I've tweaked `fgMoveColdBlocks` to break up try regions only. When handlers are placed in the funclet section, we don't need to do anything extra to get cold EH blocks out of the main method body's hot section. However, for jitted x86 code, we don't use the funclet model (yet), so cold handler blocks can still litter the main method body, hindering 3-opt's candidate space. I'd rather not expand on this PR's logic to rebuild handler regions if we can do something simpler, such as getting #101613 merged in, and using `fgRelocateEHRegions` to move all handlers to the end of the method under the assumption that they're cold (i.e. a pseudo-funclet region). Moving try entry blocks in `fgMoveColdBlocks` proved painful enough that I think we're better off leaving them as-is. Leaving each try region's entry in-place gives us a nice breadcrumb for reinserting the remaining blocks, and it might be beneficial to leave these entries in the candidate span of blocks for 3-opt, so we can effectively move entire try regions just by moving the entry. For try regions that are entirely cold, I can look into calling `fgRelocateEHRegions` before `fgMoveColdBlocks` on all platforms to quickly get these out of the way. All of this would be unnecessary if we could remove the VM's requirement of contiguous EH regions, and the codegen improvements would likely outweigh the additional VM complexity, though that's a conversation for another day.

Part of #107749. Follow-up to #103450. This refactors 3-opt to minimize layout cost instead of maximizing layout score. This is arguably more intuitive, and it should facilitate implementing a more sophisticated cost model. This PR also adds a mechanism for evaluating the total cost of a given layout, which means we can assert at each move that we actually improved the global layout cost.

…lly (#109788) Part of #107749. I noticed while working on profile consistency that we make a nontrivial effort to place cloned finally regions right after their corresponding try regions to prematurely create fallthrough. Removing this had small diffs locally.

Fixes #107076. Part of #107749. Instead of relying on the "not equal" target of each test block being the next block, explicitly follow the chain of "not equal" targets.

Part of #107749. Follow-up to #103450. Greedy 3-opt (i.e. an implementation that requires each move to be profitable on its own) is not well-suited for discovering profitable moves for backward jumps, as such movement requires an unrelated move to first place the source block lexically behind the destination block. Thus, the 3-opt implementation added in #103450 incorporates a 4-opt move for backward jumps, where we partition 1) before the destination block, 2) before the source block, and 3) directly after the source block. This 4-opt implementation can be expanded to search for the best cut point between the destination and source blocks to maximize its efficacy.

Part of #107749. Follow-up to #103450. If 3-opt fails to create fallthrough on an edge because it isn't initially profitable, allow the edge to be considered again, in case future moves make it profitable.

Fixes dotnet#107076. Part of dotnet#107749. Instead of relying on the "not equal" target of each test block being the next block, explicitly follow the chain of "not equal" targets.

Part of dotnet#107749. Follow-up to dotnet#103450. Greedy 3-opt (i.e. an implementation that requires each move to be profitable on its own) is not well-suited for discovering profitable moves for backward jumps, as such movement requires an unrelated move to first place the source block lexically behind the destination block. Thus, the 3-opt implementation added in dotnet#103450 incorporates a 4-opt move for backward jumps, where we partition 1) before the destination block, 2) before the source block, and 3) directly after the source block. This 4-opt implementation can be expanded to search for the best cut point between the destination and source blocks to maximize its efficacy.

…t#109534) Part of dotnet#107749. Follow-up to dotnet#103450. If 3-opt fails to create fallthrough on an edge because it isn't initially profitable, allow the edge to be considered again, in case future moves make it profitable.

Part of #107749. Now that hot/cold splitting runs after layout in the backend, where the flowgraph is expected to never change, we shouldn't need to check for the presence of a cold code section in the frontend.

) Part of dotnet#107749, and follow-up to dotnet#107927. When computing a RPO of the flow graph, ensuring that the entirety of a loop body is visited before any of the loop's successors has the benefit of keeping the loop body compact in the traversal. This is certainly ideal when computing an initial block layout, and may be preferable for register allocation, too. Thus, this change formalizes loop-aware RPO creation as part of the flowgraph API surface, and uses it for LSRA's block sequence.

…ntiguity later (dotnet#108914) Part of dotnet#107749. `Compiler::fgMoveColdBlocks` currently moves cold try blocks to the end of their innermost regions. This is problematic for our 3-opt layout plans: When identifying a candidate span of blocks to reorder, assuming that all cold blocks are at the end of the method vastly simplifies our implementation. However, if we have EH regions with their own cold sections, `fgMoveColdBlocks` will interleave hot and cold blocks. To facilitate later layout passes, we can simplify `fgMoveColdBlocks` to naively move all cold blocks to the end of the method, regardless of EH region, and rely on a "fixup" pass for making EH regions contiguous again. To start, I've tweaked `fgMoveColdBlocks` to break up try regions only. When handlers are placed in the funclet section, we don't need to do anything extra to get cold EH blocks out of the main method body's hot section. However, for jitted x86 code, we don't use the funclet model (yet), so cold handler blocks can still litter the main method body, hindering 3-opt's candidate space. I'd rather not expand on this PR's logic to rebuild handler regions if we can do something simpler, such as getting dotnet#101613 merged in, and using `fgRelocateEHRegions` to move all handlers to the end of the method under the assumption that they're cold (i.e. a pseudo-funclet region). Moving try entry blocks in `fgMoveColdBlocks` proved painful enough that I think we're better off leaving them as-is. Leaving each try region's entry in-place gives us a nice breadcrumb for reinserting the remaining blocks, and it might be beneficial to leave these entries in the candidate span of blocks for 3-opt, so we can effectively move entire try regions just by moving the entry. For try regions that are entirely cold, I can look into calling `fgRelocateEHRegions` before `fgMoveColdBlocks` on all platforms to quickly get these out of the way. All of this would be unnecessary if we could remove the VM's requirement of contiguous EH regions, and the codegen improvements would likely outweigh the additional VM complexity, though that's a conversation for another day.

Part of dotnet#107749. Follow-up to dotnet#103450. This refactors 3-opt to minimize layout cost instead of maximizing layout score. This is arguably more intuitive, and it should facilitate implementing a more sophisticated cost model. This PR also adds a mechanism for evaluating the total cost of a given layout, which means we can assert at each move that we actually improved the global layout cost.

…lly (dotnet#109788) Part of dotnet#107749. I noticed while working on profile consistency that we make a nontrivial effort to place cloned finally regions right after their corresponding try regions to prematurely create fallthrough. Removing this had small diffs locally.

Part of dotnet#107749. Removes the last "TODO-NoFallThrough" in the JIT source. During switch recognition, we should only need to modify flow edges -- changing the lexical order of blocks to create fallthrough into one of the switch's successors is unnecessary, now that we're running block layout in the backend.

…otnet#109792) Part of dotnet#107749. The next few opt phases alter flow substantially, such that we need to propagate new weight throughout the flowgraph. That will probably justify running profile synthesis after, in a later PR.

Part of dotnet#107749. Follow-up to dotnet#103450. To facilitate implementing a global variant of 3-opt alongside the greedy variant, this moves some shared components to helper methods. I want to do this as a separate PR to ensure this change is truly no-diff, and has minimal (if any) TP impact.

…0034) Part of dotnet#107749. Prerequisite for dotnet#110026. Use postorder number-based BitVecs in RBO and block layout. Use bbID-based BitVecs in fgIncorporateProfileData. This runs early enough in the JIT frontend such that I would expect bbIDs and bbNums to be 1:1, so I don't expect any TP impact from this change. Switch descriptor creation still uses bbNums as a key into a BitVec as a workaround for BB epoch invariants -- I'll try switching this over to bbID in a follow-up to evaluate the TP cost of a sparser bitset.

Fixes dotnet#107076. Part of dotnet#107749. Instead of relying on the "not equal" target of each test block being the next block, explicitly follow the chain of "not equal" targets.

Part of dotnet#107749. Follow-up to dotnet#103450. Greedy 3-opt (i.e. an implementation that requires each move to be profitable on its own) is not well-suited for discovering profitable moves for backward jumps, as such movement requires an unrelated move to first place the source block lexically behind the destination block. Thus, the 3-opt implementation added in dotnet#103450 incorporates a 4-opt move for backward jumps, where we partition 1) before the destination block, 2) before the source block, and 3) directly after the source block. This 4-opt implementation can be expanded to search for the best cut point between the destination and source blocks to maximize its efficacy.

…t#109534) Part of dotnet#107749. Follow-up to dotnet#103450. If 3-opt fails to create fallthrough on an edge because it isn't initially profitable, allow the edge to be considered again, in case future moves make it profitable.

Part of #107749.

Part of #107749. Enables profile checks for morph and post-morph phases. For benchmarks.run_pgo, 45383 methods are consistent before inlining; after, we're down to 37215, or 82%. By the time we make it to morph, 33461 methods (~74% of the original) are consistent; after morph, we're down to 29402 (~65%). The decline isn't too dramatic for this collection, though I imagine we fare worse elsewhere. --------- Co-authored-by: Andy Ayers <[email protected]>

Part of #107749.

amanasifkhalid added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Sep 12, 2024

amanasifkhalid added this to the 10.0.0 milestone Sep 12, 2024

amanasifkhalid self-assigned this Sep 12, 2024

JulieLeeMSFT added the User Story A single user-facing feature. Can be grouped under an epic. label Sep 12, 2024

This was referenced Sep 17, 2024

JIT: Visit blocks in RPO during LSRA #107927

Merged

JIT: Add loop-aware RPO, and use as LSRA's block sequence #108086

Merged

JIT: Remove fallthrough checks in Compiler::TryLowerSwitchToBitTest #108106

Merged

amanasifkhalid mentioned this issue Sep 30, 2024

JIT: Don't run fgRenumberBlocks after switch recognition #108402

Merged

amanasifkhalid mentioned this issue Oct 11, 2024

Block layout of partition loop not ideal #108794

Open

BruceForstall mentioned this issue Oct 15, 2024

Improve JIT loop optimizations (.NET 10) #108901

Open

7 tasks

This was referenced Oct 15, 2024

JIT: Use loop-aware RPO for initial block layout #108903

Merged

JIT: Break up try regions in Compiler::fgMoveColdBlocks, and fix contiguity later #108914

Merged

JulieLeeMSFT mentioned this issue Oct 17, 2024

JIT Focus Area for .NET 10 #108988

Open

13 tasks

JulieLeeMSFT moved this to Team User Stories in .NET Core CodeGen Oct 17, 2024

JulieLeeMSFT added this to .NET Core CodeGen Oct 17, 2024

AndyAyersMS mentioned this issue Oct 25, 2024

JIT: empty array enumerator opt #109237

Merged

jakobbotsch mentioned this issue Oct 29, 2024

JIT: Make loop inversion graph based #109346

Draft

This was referenced Oct 30, 2024

JIT: Add 3-opt implementation for improving upon RPO-based layout #103450

Merged

JIT: Always aggressively compact blocks #109521

Merged

amanasifkhalid mentioned this issue Nov 12, 2024

JIT: Optimize for cost instead of score in 3-opt layout #109741

Merged

This was referenced Nov 13, 2024

JIT: Don't try to create fallthrough from try region into cloned finally #109788

Merged

JIT: Continue profile consistency checks until after finally cloning #109792

Merged

JIT: Remove switch recognition fallthrough quirk #109796

Merged

amanasifkhalid mentioned this issue Nov 19, 2024

JIT: Refactor 3-opt utilities to facilitate expansion #109982

Merged

amanasifkhalid mentioned this issue Nov 28, 2024

JIT: Remove lexical dependencies in switch recognition #110253

Merged

amanasifkhalid mentioned this issue Dec 2, 2024

JIT: Do greedy 4-opt for backward jumps in 3-opt layout #110277

Merged

amanasifkhalid mentioned this issue Dec 4, 2024

JIT: Allow flow edges to be considered more than once by 3-opt #109534

Merged

amanasifkhalid mentioned this issue Dec 5, 2024

JIT: Remove fgFirstColdBlock checks in frontend phases #110452

Merged

This was referenced Jan 3, 2025

JIT: Enable profile consistency checking up to morph #111047

Merged

JIT: Add helpers for increasing/decreasing block weights #111135

Merged

amanasifkhalid added a commit that referenced this issue Jan 7, 2025

JIT: Enable profile consistency checking up to morph (#111047)

aecae2c

Part of #107749.

This was referenced Jan 9, 2025

JIT: Move profile consistency checks to after morph #111253

Merged

JIT: Move profile consistency checks to after loop opts #111285

Merged

amanasifkhalid mentioned this issue Jan 16, 2025

JIT: Enable profile consistency checks throughout JIT frontend #111498

Open

amanasifkhalid added a commit that referenced this issue Jan 21, 2025

JIT: Move profile consistency checks to after loop opts (#111285)

ccc9c52

Part of #107749.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT: Flowgraph Modernization and Improved Block Layout in .NET 10 #107749

JIT: Flowgraph Modernization and Improved Block Layout in .NET 10 #107749

amanasifkhalid commented Sep 12, 2024 •

edited

Loading

dotnet-policy-service bot commented Sep 12, 2024

JIT: Flowgraph Modernization and Improved Block Layout in .NET 10 #107749

JIT: Flowgraph Modernization and Improved Block Layout in .NET 10 #107749

Comments

amanasifkhalid commented Sep 12, 2024 • edited Loading

dotnet-policy-service bot commented Sep 12, 2024

amanasifkhalid commented Sep 12, 2024 •

edited

Loading