Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Widespread perf regressions due to RPO layout #102763

Open
EgorBo opened this issue May 28, 2024 · 31 comments
Open

Widespread perf regressions due to RPO layout #102763

EgorBo opened this issue May 28, 2024 · 31 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Milestone

Comments

@EgorBo
Copy link
Member

EgorBo commented May 28, 2024

@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label May 28, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label May 28, 2024
@EgorBo EgorBo added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels May 28, 2024
@EgorBo EgorBo added this to the 9.0.0 milestone May 28, 2024
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@dotnet-policy-service dotnet-policy-service bot removed the untriaged New issue has not been triaged by the area owner label May 28, 2024
@amanasifkhalid
Copy link
Member

@EgorBo Thanks for opening this. I haven't spent too much time looking at the regressions yet, but based on the larger ones I've looked at, I think we need a more general/powerful implementation of #102461. Take GuardedDevirtualization.TwoClassVirtual.Call for instance, which regressed by over 60% on Windows x64 (GuardedDevirtualization.ThreeClassVirtual.Call regressed for similar reasons). Before reordering, the blocks look like this:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight    IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1      18 [000..00E)-> BB07(0.00595),BB02(0.994)   ( cond )                     i IBC idxlen
BB02 [0009]  1       BB01                  0.99   18 [00E..???)-> BB03(1)                 (always)                     IBC internal idxlen
BB03 [0001]  2       BB02,BB06           167.07 3007 [00E..00E)-> BB05(0.07),BB04(0.93)   ( cond )                     i IBC loophead idxlen bwd bwd-target
BB04 [0005]  1       BB03                155.37 2797 [???..???)-> BB06(1)                 (always)                     i IBC internal idxlen nullcheck bwd
BB05 [0006]  1       BB03                 11.69  211 [???..???)-> BB06(1)                 (always)                     i IBC internal hascall gcsafe idxlen bwd
BB06 [0004]  2       BB04,BB05           167.07 3007 [00E..024)-> BB03(0.994),BB07(0.00595)   ( cond )                     i IBC idxlen bwd
BB07 [0010]  2       BB01,BB06             1.00   18 [024..026)                           (return)                     i IBC
BB09 [0011]  0                             0         [???..???)                           (throw )                     i rare keep internal
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

The interesting ones are BB03-BB06, which comprise a loop. The hot path is BB03->BB04->BB06->BB03; moving BB05 out-of-line to make the hot path branchless (except for the backward jump) would be ideal. The old layout does this:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight    IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1      18 [000..00E)-> BB07(0.00595),BB02(0.994)   ( cond )                     i IBC idxlen
BB02 [0009]  1       BB01                  0.99   18 [00E..???)-> BB03(1)                 (always)                     IBC internal idxlen
BB03 [0001]  2       BB02,BB06           167.07 3007 [00E..00E)-> BB05(0.07),BB04(0.93)   ( cond )                     i IBC loophead idxlen bwd bwd-target
BB04 [0005]  1       BB03                155.37 2797 [???..???)-> BB06(1)                 (always)                     i IBC internal idxlen nullcheck bwd
BB06 [0004]  2       BB04,BB05           167.07 3007 [00E..024)-> BB03(0.994),BB07(0.00595)   ( cond )                     i IBC idxlen bwd
BB07 [0010]  2       BB01,BB06             1.00   18 [024..026)                           (return)                     i IBC
BB05 [0006]  1       BB03                 11.69  211 [???..???)-> BB06(1)                 (always)                     i IBC internal hascall gcsafe idxlen bwd
BB09 [0011]  0                             0         [???..???)                           (throw )                     i rare keep internal
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

In the new layout, fgMoveBackwardJumpsToSuccessors only considers moving backward unconditional jumps, so BB05 remains in the way of the hot path:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight    IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1      18 [000..00E)-> BB07(0.00595),BB02(0.994)   ( cond )                     i IBC idxlen
BB02 [0009]  1       BB01                  0.99   18 [00E..???)-> BB03(1)                 (always)                     IBC internal idxlen
BB03 [0001]  2       BB02,BB06           167.07 3007 [00E..00E)-> BB05(0.07),BB04(0.93)   ( cond )                     i IBC loophead idxlen bwd bwd-target
BB04 [0005]  1       BB03                155.37 2797 [???..???)-> BB06(1)                 (always)                     i IBC internal idxlen nullcheck bwd
BB05 [0006]  1       BB03                 11.69  211 [???..???)-> BB06(1)                 (always)                     i IBC internal hascall gcsafe idxlen bwd
BB06 [0004]  2       BB04,BB05           167.07 3007 [00E..024)-> BB03(0.994),BB07(0.00595)   ( cond )                     i IBC idxlen bwd
BB07 [0010]  2       BB01,BB06             1.00   18 [024..026)                           (return)                     i IBC
BB09 [0011]  0                             0         [???..???)                           (throw )                     i rare keep internal
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

I think modifying fgMoveBackwardJumpsToSuccessors to consider forward jumps would be a good start to chipping away at these regressions. However, moving BB06 up to BB04 would break up the former's fallthrough into BB07. That's ok, since that's part of the slow path, but we probably want to update the cost check in fgMoveBackwardJumpsToSuccessors to also check if moving the target block will break up fallthrough into its next block, and whether the edge weights deem it's worth it.

Thoughts on this? cc @AndyAyersMS for visibility. Thanks!

@AndyAyersMS
Copy link
Member

I would suggest we list all the regressions in rank order and then investigate at least the top 20 or so. We also should see what the arm64 data looks like.

Based on the above and the the other fixup actions, it feels like we are getting to the point where we may actually want to create the explicit cost/benefit model to assess incremental improvements. We have a layout configuration A, and some proposed alternate B. Which is better?

If you want to try to build this out, we should talk. The general model is something like the following:

  • identify 3 cut points in the linear chain of blocks (creating 4 segments overall). First segment must contain at least the entry block; last one can be empty.
  • swap the middle two segments; if that is lower cost, make that the new configuration
  • repeat until you can't find any more improvements or give up (drive to local minima)
  • (or just limit yourself to one pass or something to keep the TP under control)
  • (the someday refinement: occasionally allow reorderings that drive the cost up, to try and find a global minimum)

Since there are just 3 cut points, if you have an initial cost model it's not too hard to compute the delta to the new cost model, you (mostly) just need to recost the segment-ending blocks (if the cost model is block-size sensitive then perhaps a bit more, but I don't think we have good size estimates yet?)

For the time being the "identify" step can be finding a single misplaced block and a new proposed location, and then seeing if the cost improves.

@amanasifkhalid
Copy link
Member

I would suggest we list all the regressions in rank order and then investigate at least the top 20 or so.

Sorry about the delay in getting back to this; the script we typically use to collate regression data doesn't seem to work since the perflab test report URLs changed, so I hacked up my own script. Here are the top 20 regressions on Windows x64:

Name Base Test Ratio
System.Collections.Sort.Array_Comparison 9.19 μs 17.66 μs 1.92
System.Collections.Sort.Array_ComparerClass 9.85 μs 17.36 μs 1.76
GuardedDevirtualization.ThreeClassVirtual.Call 3.82 ns 6.56 ns 1.72
System.Linq.Tests.Perf_Enumerable.SelectToList 125.92 ns 209.71 ns 1.67
System.Collections.Sort.LinqQuery 17.06 μs 28.35 μs 1.66
GuardedDevirtualization.TwoClassVirtual.Call 1.75 ns 2.86 ns 1.63
LinqBenchmarks.Count00ForX 91.54 ms 147.83 ms 1.61
System.Collections.Sort.LinqOrderByExtension 17.45 μs 28.14 μs 1.61
System.Text.Encodings.Web.Tests.Perf_Encoders.EncodeUtf8 53.46 ns 85.59 ns 1.6
System.Collections.Sort.LinqOrderByExtension 20.34 μs 32.52 μs 1.6
System.Memory.Span.Reverse 32.25 ns 50.97 ns 1.58
System.Tests.Perf_Random.NextBytes_unseeded 170.90 ns 244.84 ns 1.43
System.Tests.Perf_Random.NextBytes_span_unseeded 171.98 ns 244.27 ns 1.42
System.Memory.Span.Reverse 3.21 ns 4.54 ns 1.41
System.Collections.Sort.LinqOrderByExtension 16.82 μs 23.40 μs 1.39
System.Text.Perf_Ascii.ToLower_Chars 27.93 ns 38.72 ns 1.39
System.Collections.Perf_Frozen.ToFrozenDictionary 32.28 μs 44.49 μs 1.38
Benchstone.BenchI.CSieve.Test 4.25 ms 5.87 ms 1.38
System.Text.Perf_Ascii.Equals_Bytes 4.55 ns 6.23 ns 1.37
System.Text.Perf_Ascii.ToUpper_Chars 27.64 ns 37.66 ns 1.36

And on Linux x64:

Name Base Test Ratio
System.IO.Hashing.Tests.Crc32_AppendPerf.Append 335.26 ns 1.07 μs 3.2
GuardedDevirtualization.ThreeClassVirtual.Call 1.07 ns 2.90 ns 2.72
GuardedDevirtualization.ThreeClassVirtual.Call 1.01 ns 2.73 ns 2.71
GuardedDevirtualization.ThreeClassVirtual.Call 1.07 ns 2.91 ns 2.71
GuardedDevirtualization.ThreeClassVirtual.Call 1.01 ns 2.71 ns 2.69
GuardedDevirtualization.ThreeClassVirtual.Call 1.01 ns 2.71 ns 2.69
GuardedDevirtualization.ThreeClassVirtual.Call 1.07 ns 2.87 ns 2.68
System.IO.Hashing.Tests.Crc32_AppendPerf.Append 8.21 ns 21.65 ns 2.64
System.IO.Hashing.Tests.Crc32_AppendPerf.Append 15.58 ns 38.16 ns 2.45
GuardedDevirtualization.TwoClassVirtual.Call 1.46 ns 3.22 ns 2.21
System.Tests.Perf_String.TrimEnd 1.69 ns 3.18 ns 1.88
GuardedDevirtualization.TwoClassVirtual.Call 1.23 ns 2.30 ns 1.86
Benchstone.BenchF.NewtR.Test 55.69 ms 102.90 ms 1.85
LinqBenchmarks.Count00ForX 58.01 ms 103.57 ms 1.79
System.Collections.Sort.Array_ComparerClass 8.31 μs 14.71 μs 1.77
System.Collections.Sort.Array_Comparison 7.32 μs 12.29 μs 1.68
System.Tests.Perf_String.TrimStart_CharArr 2.52 ns 4.12 ns 1.64
System.Collections.Sort.Array_ComparerClass 7.15 μs 11.61 μs 1.62
System.Collections.IterateFor.IList 287.08 ns 454.48 ns 1.58
System.Text.Perf_Ascii.ToUpper_Bytes 11.49 ns 17.77 ns 1.55

I'll start digging into the overlapping benchmarks (this explains GuardedDevirtualization.TwoClassVirtual.Call and GuardedDevirtualization.ThreeClassVirtual.Call already, and I suspect the theme of hot paths being polluted by a relatively cold (but not "cold" cold) jump will continue to come up).

@amanasifkhalid
Copy link
Member

I'm seeing the same theme of interleaving hot and colder blocks throughout these regressions. The RPO-based layout on its own doesn't seem to leverage edge weights heavily enough, and while fgMoveBackwardJumpsToSuccessors does consider edge weights, its backward jump limitation is blocking lots of potentially profitable moves. System.Memory.Span<Int32>.Reverse illustrates this succinctly. Here's what the new layout does:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight   IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1    100 [000..00A)-> BB05(0.2),BB02(0.8)     ( cond )                     i IBC
BB02 [0001]  1       BB01                  0.80  80 [00A..00B)-> BB04(0.2),BB03(0.8)     ( cond )                     i IBC hascall
BB03 [0018]  1       BB02                  0.64  64 [00A..00B)-> BB04(1)                 (always)                     i IBC hascall gcsafe
BB04 [0019]  2       BB02,BB03             0.80  80 [00A..00B)-> BB06(1)                 (always)                     i IBC hascall gcsafe
BB05 [0003]  1       BB01                  0.20  20 [???..???)-> BB06(1)                 (always)                     i IBC internal hascall
BB06 [0002]  2       BB04,BB05             1    100 [01D..01E)                           (return)                     i IBC tail-succ
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

Notice how BB06 can be reached from BB04 and BB05, and the former is hotter, but doesn't fall into BB06. If fgMoveBackwardJumpsToSuccessors considered forward jumps, it would move BB06 up. For a more involved example, let's look at System.Text.Perf_Ascii.Equals_Bytes. Here's the new layout:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight   IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1    100 [000..001)-> BB03(0.99),BB02(0.00994)  ( cond )                     i IBC
BB03 [0004]  1       BB01                  0.99  99 [000..001)-> BB04(1)                 (always)                     i IBC idxlen nullcheck
BB02 [0003]  1       BB01                  0.01   1 [000..001)-> BB04(1)                 (always)                     i IBC
BB04 [0005]  2       BB02,BB03             1    100 [000..001)-> BB06(0.99),BB05(0.00994)  ( cond )                     i IBC
BB06 [0009]  1       BB04                  0.99  99 [000..001)-> BB07(1)                 (always)                     i IBC idxlen nullcheck
BB05 [0008]  1       BB04                  0.01   1 [000..001)-> BB07(1)                 (always)                     i IBC
BB07 [0010]  2       BB05,BB06             1    100 [000..000)-> BB09(0),BB08(1)         ( cond )                     i IBC
BB08 [0012]  1       BB07                  1    100 [000..000)-> BB10(1)                 (always)                     i IBC internal hascall gcsafe
BB10 [0014]  2       BB08,BB09             1    100 [01C..01C)                           (return)                     i IBC
BB09 [0013]  1       BB07                  0      0 [000..000)-> BB10(1)                 (always)                     i IBC rare internal
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

If we considered moving forward jumps, we'd have the following decision tree:

  • BB03->BB04 is hotter than BB02->BB04, so move BB04 up.
  • BB04->BB06 is hot, and BB04 is the only pred of BB06, but BB02 is now between them, so move BB06 up; we're basically bubbling BB02 down the block list. Now we have this:
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight   IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1    100 [000..001)-> BB03(0.99),BB02(0.00994)  ( cond )                     i IBC
BB03 [0004]  1       BB01                  0.99  99 [000..001)-> BB04(1)                 (always)                     i IBC idxlen nullcheck
BB04 [0005]  2       BB02,BB03             1    100 [000..001)-> BB06(0.99),BB05(0.00994)  ( cond )                     i IBC
BB06 [0009]  1       BB04                  0.99  99 [000..001)-> BB07(1)                 (always)                     i IBC idxlen nullcheck
BB02 [0003]  1       BB01                  0.01   1 [000..001)-> BB04(1)                 (always)                     i IBC
BB05 [0008]  1       BB04                  0.01   1 [000..001)-> BB07(1)                 (always)                     i IBC
BB07 [0010]  2       BB05,BB06             1    100 [000..000)-> BB09(0),BB08(1)         ( cond )                     i IBC
BB08 [0012]  1       BB07                  1    100 [000..000)-> BB10(1)                 (always)                     i IBC internal hascall gcsafe
BB10 [0014]  2       BB08,BB09             1    100 [01C..01C)                           (return)                     i IBC
BB09 [0013]  1       BB07                  0      0 [000..000)-> BB10(1)                 (always)                     i IBC rare internal
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
  • BB06->BB07 is hotter than BB05->BB07, so move BB07 up. We've bubbled BB02 and BB05 down.
  • BB07->BB08 is hot, and BB07 is the only pred of BB08, so move BB08 up.
  • BB08->BB10is hot, andBB10's other pred doesn't fall into it anyway, so move BB10` up. Here's what we have:
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight   IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1    100 [000..001)-> BB03(0.99),BB02(0.00994)  ( cond )                     i IBC
BB03 [0004]  1       BB01                  0.99  99 [000..001)-> BB04(1)                 (always)                     i IBC idxlen nullcheck
BB04 [0005]  2       BB02,BB03             1    100 [000..001)-> BB06(0.99),BB05(0.00994)  ( cond )                     i IBC
BB06 [0009]  1       BB04                  0.99  99 [000..001)-> BB07(1)                 (always)                     i IBC idxlen nullcheck
BB07 [0010]  2       BB05,BB06             1    100 [000..000)-> BB09(0),BB08(1)         ( cond )                     i IBC
BB08 [0012]  1       BB07                  1    100 [000..000)-> BB10(1)                 (always)                     i IBC internal hascall gcsafe
BB10 [0014]  2       BB08,BB09             1    100 [01C..01C)                           (return)                     i IBC
BB02 [0003]  1       BB01                  0.01   1 [000..001)-> BB04(1)                 (always)                     i IBC
BB05 [0008]  1       BB04                  0.01   1 [000..001)-> BB07(1)                 (always)                     i IBC
BB09 [0013]  1       BB07                  0      0 [000..000)-> BB10(1)                 (always)                     i IBC rare internal
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
  • BB10 has no successors.
  • BB02 is a backward jump, but BB03->BB04 is hotter, so don't move anything.
  • Ditto BB05.
  • BB09 is rarely-run, so we don't care about it.

For a few other short examples, just to build my case, here's the new layout of System.Text.Encodings.Web.Tests.Perf_Encoders.EncodeUtf8:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight   IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1    100 [000..001)-> BB03(0.99),BB02(0.00994)  ( cond )                     i IBC
BB03 [0004]  1       BB01                  0.50  50 [000..001)-> BB04(1)                 (always)                     i IBC idxlen nullcheck
BB02 [0003]  1       BB01                  0.00   0 [000..001)-> BB04(1)                 (always)                     i IBC
BB04 [0005]  2       BB02,BB03             1    100 [000..000)-> BB06(1),BB05(0)         ( cond )                     i IBC
BB06 [0009]  1       BB04                  0.50  50 [000..000)-> BB07(1)                 (always)                     i IBC internal idxlen nullcheck
BB07 [0013]  2       BB05,BB06             1    100 [027..027)                           (return)                     i IBC hascall gcsafe
BB05 [0008]  1       BB04                  0      0 [000..000)-> BB07(1)                 (always)                     i IBC rare internal
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

And for System.String.TrimWhiteSpaceHelper from System.Tests.Perf_String.TrimEnd (in particular, we could consider creating fallthrough for BB29->BB23->BB27->BB28):

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight    IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1    2261 [000..010)-> BB12(0.092),BB02(0.908) ( cond )                     i IBC idxlen
BB02 [0001]  1       BB01                  0.91 2053 [010..014)-> BB11(0),BB03(1)         ( cond )                     i IBC idxlen
BB03 [0246]  2       BB02,BB09             1.82 4120 [014..015)-> BB07(0.52),BB08(0.48)   ( cond )                     i IBC idxlen bwd
BB07 [0041]  1       BB03                  0.95 2142 [014..015)-> BB14(0.498),BB09(0.502) ( cond )                     i IBC bwd
BB08 [0055]  1       BB03                  1.57 3560 [014..022)-> BB14(0.498),BB09(0.502) ( cond )                     i IBC bwd
BB09 [0003]  2       BB07,BB08             1.82 4120 [022..02F)-> BB03(1),BB10(0)         ( cond )                     i IBC hascall idxlen bwd
BB14 [0011]  2       BB07,BB08             0.91 2053 [???..???)-> BB12(1)                 (always)                     i IBC internal hascall
BB12 [0005]  3       BB01,BB11,BB14        1.00 2261 [02F..034)-> BB23(0.000442),BB13(1)  ( cond )                     i IBC
BB13 [0006]  1       BB12                  1.00 2260 [034..03F)-> BB21(1)                 (always)                     i IBC idxlen
BB21 [0009]  2       BB13,BB20             1.01 2282 [051..055)-> BB15(1),BB22(0)         ( cond )                     i IBC bwd bwd-src
BB15 [0007]  1       BB21                  1.01 2282 [03F..040)-> BB18(0.52),BB19(0.48)   ( cond )                     i IBC loophead idxlen bwd bwd-target
BB18 [0169]  1       BB15                  0.52 1187 [03F..040)-> BB29(0.99),BB20(0.00964)  ( cond )                     i IBC bwd
BB19 [0183]  1       BB15                  0.87 1972 [03F..04D)-> BB29(0.99),BB20(0.00964)  ( cond )                     i IBC bwd
BB29 [0013]  2       BB18,BB19             1.00 2260 [???..???)-> BB23(1)                 (always)                     i IBC internal hascall
BB20 [0008]  2       BB18,BB19             0.01   22 [04D..051)-> BB21(1)                 (always)                     i IBC hascall bwd
BB23 [0010]  3       BB12,BB22,BB29        1.00 2261 [055..056)-> BB27(0.884),BB24(0.116) ( cond )                     i IBC hascall idxlen
BB27 [0243]  1       BB23                  0.88 1999 [055..056)-> BB28(1)                 (always)                     i IBC
BB24 [0240]  1       BB23                  0.12  262 [055..056)-> BB26(0.0461),BB25(0.954)  ( cond )                     i IBC
BB25 [0241]  1       BB24                  0.11  250 [055..056)-> BB28(1)                 (always)                     i IBC hascall gcsafe
BB26 [0242]  1       BB24                  0.01   12 [055..056)-> BB28(1)                 (always)                     i IBC
BB28 [0244]  3       BB25,BB26,BB27        1.00 2261 [055..05E)                           (return)                     i IBC
BB10 [0247]  1       BB09                  0       0 [???..???)-> BB11(1)                 (always)                     IBC rare internal
BB11 [0012]  2       BB02,BB10             0       0 [???..???)-> BB12(1)                 (always)                     i IBC rare internal hascall
BB22 [0014]  1       BB21                  0       0 [???..???)-> BB23(1)                 (always)                     i IBC rare internal hascall
BB30 [0248]  0                             0         [???..???)                           (throw )                     i rare keep internal
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

And for System.Tests.Perf_Random.NextBytes_unseeded (an interesting case in that the exceptional path is the hot path based on edge weights, even though the throw block has a weight of 0 due to RyuJIT's flowgraph semantics -- I guess if our post-RPO changes suggest a "rarely-run" block should be moved up because it isn't that rare, maybe we should unmark it as not rare?):

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight   IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1    100 [000..003)-> BB03(0.2),BB02(0.8)     ( cond )                     i IBC
BB03 [0002]  1       BB01                  1    100 [00A..017)                           (return)                     i IBC jmp hascall hist
BB02 [0001]  1       BB01                  0      0 [003..00A)                           (throw )                     i IBC rare hascall gcsafe
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

Based on the above and the the other fixup actions, it feels like we are getting to the point where we may actually want to create the explicit cost/benefit model to assess incremental improvements.

I'd be interested in pursuing this, though based on the above, I think it might be worthwhile trying to implement a more general post-RPO pass that does the moves I described above, since it seems simple enough.

@AndyAyersMS
Copy link
Member

I am surprised some of these are so large. Do we have any results for AMD HW? All the reports above are for Intel.

@AndyAyersMS
Copy link
Member

ADX data for the three class virtual test. Strongly suggests that this may be JCC errata kicking in.

image

Of course this is a two-way street, some of the improvements may be like this too.

@AndyAyersMS
Copy link
Member

Randomly looking at improvements, it seems most are also intel only. But not all. Here's Benchstone.BenchI.BubbleSort2.Test

image

@DrewScoggins is looking into whether there are unfiled AMD64 reports out there. From what I can tell the newer AMD64 hardware is mostly indifferent.

@AndyAyersMS
Copy link
Member

Here's a cross-arch regression: System.Linq.Tests.Perf_Enumerable.Contains_ElementNotFound(input: IEnumerable)

image

@amanasifkhalid perhaps dig into this one since it is less likely to be caused by some microarchitectural issue.

@amanasifkhalid
Copy link
Member

@AndyAyersMS Thanks for pointing out that benchmark; this one seems to suffer from the same interleaving of hot/cold paths, too. Here's the old layout for System.Linq.Enumerable:Contains:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight       IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1      21340 [000..003)-> BB02(0),BB03(1)         ( cond )                     i IBC
BB03 [0002]  1       BB01                  1      21340 [00A..00D)-> BB45(0),BB04(1)         ( cond )                     i IBC
BB04 [0003]  1       BB03                  1      21340 [00D..00D)-> BB06(0),BB05(1)         ( cond )                     i IBC
BB05 [0032]  1       BB04                  1      21340 [???..???)-> BB07(1)                 (always)                     i IBC internal hascall gcsafe
BB07 [0031]  2       BB05,BB06             1      21340 [00D..014)-> BB08(1)                 (always)                     i IBC internal
BB08 [0004]  1  0    BB07                  1      21340 [014..016)-> BB22(0.01),BB10(0.99)   ( cond ) T0      try {       i IBC keep
BB10 [0085]  1  0    BB08                  1      21340 [???..???)-> BB22(0.01),BB11(0.99)   ( cond ) T0                  IBC internal
BB11 [0072]  1  0    BB10                  1            [???..???)-> BB17(1)                 (always) T0                  internal
BB12 [0005]  1  0    BB19                 97.88 2088852 [016..017)-> BB34(0),BB14(1)         ( cond ) T0                  i IBC loophead bwd bwd-target
BB14 [0046]  1  0    BB12                 97.88 2088852 [016..02B)-> BB63(0),BB17(1)         ( cond ) T0                  i IBC idxlen bwd
BB17 [0007]  2  0    BB11,BB14            98.87 2109979 [02F..030)-> BB20(0.00994),BB19(0.99)  ( cond ) T0                  i IBC bwd bwd-src
BB19 [0059]  1  0    BB17                 97.89 2088998 [02F..030)-> BB12(1)                 (always) T0                  i IBC bwd
BB27 [0078]  2  0    BB25,BB26             0.99   21100 [016..02B)-> BB63(0),BB28(1)         ( cond ) T0                  i IBC bwd
BB28 [0079]  2  0    BB22,BB27             1.00   21313 [02F..02F)-> BB32(0),BB29(1)         ( cond ) T0                  i IBC bwd bwd-src
BB29 [0080]  1  0    BB28                  1.00   21313 [02F..030)-> BB31(0.00994),BB30(0.99)  ( cond ) T0                  i IBC bwd
BB30 [0081]  1  0    BB29                  0.99   21101 [016..030)-> BB34(0),BB25(1)         ( cond ) T0                  i IBC bwd
BB25 [0076]  1  0    BB30                  0.99   21100 [016..017)-> BB27(1)                 (always) T0                  i IBC idxlen bwd
BB20 [0060]  1  0    BB17                  0.98   20982 [02F..030)-> BB41(1)                 (always) T0                  i IBC bwd
BB62 [0092]  0  0                          0            [???..???)                           (throw ) T0                  i rare keep internal
BB22 [0073]  2  0    BB08,BB10             0            [???..???)-> BB28(1)                 (always) T0                  rare internal
BB26 [0077]  1  0    BB32                  0          0 [???..???)-> BB27(1)                 (always) T0                  i IBC rare internal hascall gcsafe bwd
BB32 [0083]  1  0    BB28                  0          0 [02F..037)-> BB26(0.99),BB41(0.01)   ( cond ) T0                  i IBC rare internal hascall bwd
BB34 [0090]  2  0    BB12,BB30             0          0 [016..017)                           (throw ) T0                  i IBC rare gcsafe bwd
BB63 [0093]  2  0    BB14,BB27             0          0 [02B..02F)-> BB43(1)                 (always) T0                  i IBC rare bwd
BB31 [0082]  1  0    BB29                  0.01     212 [02F..030)-> BB41(1)                 (always) T0      }           i IBC bwd
BB41 [0063]  3       BB20,BB31,BB32        1.00   21340 [039..03C)-> BB42(0),BB53(1)         ( cond )                     i IBC keep cfb
BB53 [0021]  3       BB41,BB42,BB50        1.00   21340 [077..079)                           (return)                     i IBC
BB42 [0066]  1       BB41                  0          0 [???..???)-> BB53(1)                 (always)                     i IBC rare internal hascall gcsafe
BB43 [0027]  1       BB63                  0          0 [???..???)-> BB55(1)                 (callf )                     i IBC rare internal
BB44 [0028]  1       BB58                  0          0 [???..???)-> BB54(1)                 (callfr)                     i IBC rare internal xentry
BB45 [0012]  1       BB03                  0          0 [043..04A)-> BB46(1)                 (always)                     i IBC rare hascall gcsafe
BB46 [0013]  1  1    BB45                  0          0 [04A..04C)-> BB48(1)                 (always) T1      try {       i IBC rare keep
BB47 [0014]  1  1    BB48                  0          0 [04C..05F)-> BB49(0.1),BB48(0.9)     ( cond ) T1                  i IBC rare loophead hascall gcsafe bwd bwd-target
BB48 [0016]  2  1    BB46,BB47             0          0 [063..06B)-> BB47(0.9),BB50(0.1)     ( cond ) T1                  i IBC rare hascall bwd bwd-src
BB49 [0015]  1  1    BB47                  0          0 [05F..063)-> BB51(1)                 (always) T1      }           i IBC rare bwd
BB50 [0069]  1       BB48                  0          0 [06D..077)-> BB53(1)                 (always)                     i IBC rare keep gcsafe cfb
BB51 [0023]  1       BB49                  0          0 [???..???)-> BB59(1)                 (callf )                     i IBC rare internal
BB52 [0024]  1       BB61                  0          0 [???..???)-> BB54(1)                 (callfr)                     i IBC rare internal xentry
BB54 [0022]  2       BB44,BB52             0          0 [079..07B)                           (return)                     i IBC rare
BB02 [0001]  1       BB01                  0          0 [003..00A)                           (throw )                     i IBC rare hascall gcsafe
BB06 [0033]  1       BB04                  0          0 [???..???)-> BB07(1)                 (always)                     i IBC rare internal hascall gcsafe
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ funclets follow
BB55 [0009]  2     0 BB43                  0.00       0 [039..03C)-> BB58(0.000997),BB56(0.999)    ( cond )    H0 F finally {   i IBC keep xentry flet
BB56 [0010]  1     0 BB55                  0.00       0 [03C..???)-> BB58(1),BB57(0)         ( cond )    H0               i IBC xentry
BB57 [0042]  1     0 BB56                  0          0 [???..???)-> BB58(1)                 (always)    H0               i IBC rare internal hascall xentry gcsafe
BB58 [0040]  3     0 BB55,BB56,BB57        0.00       0 [042..043)-> BB44(0.5)               (finret)    H0   }           i IBC xentry
BB59 [0018]  2     1 BB51                  0.00       0 [06D..070)-> BB61(0.48),BB60(0.52)   ( cond )    H1 F finally {   i IBC keep xentry flet
BB60 [0019]  1     1 BB59                  0.00       0 [070..076)-> BB61(1)                 (always)    H1               i IBC hascall xentry gcsafe
BB61 [0020]  2     1 BB59,BB60             0.00       0 [076..077)-> BB52(0.5)               (finret)    H1   }           i IBC xentry
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

And the new layout:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight       IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1      22444 [000..003)-> BB03(1),BB02(0)         ( cond )                     i IBC
BB03 [0002]  1       BB01                  1      22444 [00A..00D)-> BB45(0),BB04(1)         ( cond )                     i IBC
BB04 [0003]  1       BB03                  1      22444 [00D..00D)-> BB06(0),BB05(1)         ( cond )                     i IBC
BB05 [0032]  1       BB04                  1      22444 [???..???)-> BB07(1)                 (always)                     i IBC internal hascall gcsafe
BB07 [0031]  2       BB05,BB06             1      22444 [00D..014)-> BB08(1)                 (always)                     i IBC internal
BB08 [0004]  1  0    BB07                  1      22444 [014..016)-> BB22(0.01),BB10(0.99)   ( cond ) T0      try {       i IBC keep
BB10 [0085]  1  0    BB08                  1      22444 [???..???)-> BB22(0.01),BB11(0.99)   ( cond ) T0                  IBC internal
BB11 [0072]  1  0    BB10                  1            [???..???)-> BB17(1)                 (always) T0                  internal
BB17 [0007]  2  0    BB11,BB14            99.16 2225627 [02F..030)-> BB20(0.00991),BB19(0.99)  ( cond ) T0                  i IBC bwd bwd-src
BB19 [0059]  1  0    BB17                 98.18 2203568 [016..030)-> BB14(1),BB34(0)         ( cond ) T0                  i IBC bwd
BB14 [0046]  1  0    BB19                 98.17 2203407 [016..02B)-> BB17(1),BB37(0)         ( cond ) T0                  i IBC idxlen bwd
BB20 [0060]  1  0    BB17                  0.98   22059 [02F..030)-> BB41(1)                 (always) T0                  i IBC bwd
BB28 [0079]  2  0    BB22,BB27             1.00   22481 [02F..02F)-> BB32(0),BB29(1)         ( cond ) T0                  i IBC bwd bwd-src
BB29 [0080]  1  0    BB28                  1.00   22481 [02F..030)-> BB31(0.00991),BB30(0.99)  ( cond ) T0                  i IBC bwd
BB30 [0081]  1  0    BB29                  0.99   22258 [016..030)-> BB25(1),BB34(0)         ( cond ) T0                  i IBC bwd
BB25 [0076]  1  0    BB30                  0.99   22257 [016..017)-> BB27(1)                 (always) T0                  i IBC idxlen bwd
BB31 [0082]  1  0    BB29                  0.01     223 [02F..030)-> BB41(1)                 (always) T0                  i IBC bwd
BB27 [0078]  2  0    BB25,BB26             0.99   22257 [016..02B)-> BB28(1),BB37(0)         ( cond ) T0                  i IBC bwd
BB22 [0073]  2  0    BB08,BB10             0            [???..???)-> BB28(1)                 (always) T0                  rare internal
BB34 [0090]  2  0    BB19,BB30             0          0 [016..017)                           (throw ) T0                  i IBC rare gcsafe bwd
BB32 [0083]  1  0    BB28                  0          0 [02F..037)-> BB26(0.99),BB41(0.00998)  ( cond ) T0                  i IBC rare internal hascall bwd
BB26 [0077]  1  0    BB32                  0          0 [???..???)-> BB27(1)                 (always) T0                  i IBC rare internal hascall gcsafe bwd
BB37 [0091]  2  0    BB14,BB27             0          0 [02B..02F)-> BB43(1)                 (always) T0                  i IBC rare bwd
BB62 [0092]  0  0                          0            [???..???)                           (throw ) T0      }           i rare keep internal
BB41 [0063]  3       BB20,BB31,BB32        1.00   22444 [039..03C)-> BB53(1),BB42(0)         ( cond )                     i IBC keep cfb
BB46 [0013]  2  1    BB45,BB47             0          0 [04A..06B)-> BB47(0.9),BB50(0.1)     ( cond ) T1      try {       i IBC rare keep bwd
BB47 [0014]  1  1    BB46                  0          0 [04C..05F)-> BB46(0.9),BB49(0.1)     ( cond ) T1                  i IBC rare loophead hascall gcsafe bwd bwd-target
BB49 [0015]  1  1    BB47                  0          0 [05F..063)-> BB51(1)                 (always) T1      }           i IBC rare bwd
BB53 [0021]  3       BB41,BB42,BB50        1.00   22444 [077..079)                           (return)                     i IBC
BB06 [0033]  1       BB04                  0          0 [???..???)-> BB07(1)                 (always)                     i IBC rare internal hascall gcsafe
BB43 [0027]  1       BB37                  0          0 [???..???)-> BB55(1)                 (callf )                     i IBC rare internal
BB44 [0028]  1       BB58                  0          0 [???..???)-> BB54(1)                 (callfr)                     i IBC rare internal xentry
BB42 [0066]  1       BB41                  0          0 [???..???)-> BB53(1)                 (always)                     i IBC rare internal hascall gcsafe
BB45 [0012]  1       BB03                  0          0 [043..04A)-> BB46(1)                 (always)                     i IBC rare hascall gcsafe
BB51 [0023]  1       BB49                  0          0 [???..???)-> BB59(1)                 (callf )                     i IBC rare internal
BB52 [0024]  1       BB61                  0          0 [???..???)-> BB54(1)                 (callfr)                     i IBC rare internal xentry
BB54 [0022]  2       BB44,BB52             0          0 [079..07B)                           (return)                     i IBC rare
BB50 [0069]  1       BB46                  0          0 [06D..077)-> BB53(1)                 (always)                     i IBC rare keep gcsafe cfb
BB02 [0001]  1       BB01                  0          0 [003..00A)                           (throw )                     i IBC rare hascall gcsafe
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ funclets follow
BB55 [0009]  2     0 BB43                  0.00       0 [039..03C)-> BB58(0.00178),BB56(0.998)   ( cond )    H0 F finally {   i IBC keep xentry flet
BB56 [0010]  1     0 BB55                  0.00       0 [03C..???)-> BB58(1),BB57(0)         ( cond )    H0               i IBC xentry
BB57 [0042]  1     0 BB56                  0          0 [???..???)-> BB58(1)                 (always)    H0               i IBC rare internal hascall xentry gcsafe
BB58 [0040]  3     0 BB55,BB56,BB57        0.00       0 [042..043)-> BB44(0.5)               (finret)    H0   }           i IBC xentry
BB59 [0018]  2     1 BB51                  0.00       0 [06D..070)-> BB61(0.48),BB60(0.52)   ( cond )    H1 F finally {   i IBC keep xentry flet
BB60 [0019]  1     1 BB59                  0.00       0 [070..076)-> BB61(1)                 (always)    H1               i IBC hascall xentry gcsafe
BB61 [0020]  2     1 BB59,BB60             0.00       0 [076..077)-> BB52(0.5)               (finret)    H1   }           i IBC xentry
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

Both seem to do a good job of keeping the loop in T0 compact, though notice the new layout doesn't create any fallthrough into BB53, the hot return block. I'm guessing the RPO layout didn't move it up to BB41, it's hottest pred, because of the interleaving of try regions. Enabling movement of forward branches post-RPO layout would fix this, and push the cold T1 region further down the method.

@AndyAyersMS
Copy link
Member

It is hard to see how the return block placement could have that much impact on perf -- we are looking at a pretty big regression here. Can you copy out the two inner loop disasms?

Also puzzled why BB11 doesn't have IBC data -- these sorts of mixed IBC/noIBC cases have caused us trouble in the past.

@amanasifkhalid
Copy link
Member

Can you copy out the two inner loop disasms?

Sure. Old layout:

G_M23587_IG05:  ;; offset=0x0075
       mov      r8d, dword ptr [rsi+0x08]
       cmp      r8d, dword ptr [rsi+0x0C]
       jae      G_M23587_IG16
       mov      r14, gword ptr [rsi+0x10]
       cmp      r8d, dword ptr [r14+0x08]
       jae      SHORT G_M23587_IG12
       mov      eax, r8d
       mov      r15d, dword ptr [r14+4*rax+0x10]
       cmp      r15d, ebx
       je       G_M23587_IG17
						;; size=41 bbWeight=100.08 PerfScore 1551.23
G_M23587_IG06:  ;; offset=0x009E
       mov      edx, dword ptr [rsi+0x08]
       inc      edx
       cmp      edx, dword ptr [rsi+0x0C]
       jae      SHORT G_M23587_IG11
						;; size=10 bbWeight=101.07 PerfScore 631.69
G_M23587_IG07:  ;; offset=0x00A8
       mov      dword ptr [rsi+0x08], edx
       jmp      SHORT G_M23587_IG05
						;; size=5 bbWeight=100.07 PerfScore 300.21

New layout:

G_M23587_IG05:  ;; offset=0x006D
       mov      ecx, dword ptr [rsi+0x08]
       inc      ecx
       cmp      ecx, dword ptr [rsi+0x0C]
       jae      SHORT G_M23587_IG08
						;; size=10 bbWeight=97.07 PerfScore 606.67
G_M23587_IG06:  ;; offset=0x0077
       mov      dword ptr [rsi+0x08], ecx
       mov      edx, dword ptr [rsi+0x08]
       cmp      edx, dword ptr [rsi+0x0C]
       jae      SHORT G_M23587_IG15
						;; size=11 bbWeight=96.10 PerfScore 672.67
G_M23587_IG07:  ;; offset=0x0082
       mov      r8, gword ptr [rsi+0x10]
       cmp      edx, dword ptr [r8+0x08]
       jae      G_M23587_IG18
       mov      eax, edx
       mov      r14d, dword ptr [r8+4*rax+0x10]
       cmp      r14d, ebx
       jne      SHORT G_M23587_IG05
       jmp      G_M23587_IG17
						;; size=31 bbWeight=96.08 PerfScore 1104.88

The new layout has one fewer branch on the loop path, so it seems better?

Also puzzled why BB11 doesn't have IBC data -- these sorts of mixed IBC/noIBC cases have caused us trouble in the past.

BB11 and BB22 are fast/slow loop preheaders. When creating these in Compiler::optCloneLoop, it looks like we're setting the preheaders' weights without propagating BBF_PROF_WEIGHT from their preds. I can open a fix for this.

@AndyAyersMS
Copy link
Member

Seems like we ought to be aligning these loops... any idea why not?
Can you add the alignment boundaries?

Also interesting that we don't have a peephole opt for this:

       mov      dword ptr [rsi+0x08], ecx
       mov      edx, dword ptr [rsi+0x08]

@amanasifkhalid
Copy link
Member

amanasifkhalid commented May 30, 2024

Can you add the alignment boundaries?

Sure. With the old layout, here's the disasm up to the loop end:

G_M23587_IG01:  ;; offset=0x0000
       push     rbp
       push     r15
       push     r14
       push     rdi
       push     rsi
       push     rbx
       sub      rsp, 72
       lea      rbp, [rsp+0x70]
       mov      qword ptr [rbp-0x50], rsp
       mov      ebx, edx
       mov      rsi, r8
						;; size=26 bbWeight=1 PerfScore 8.25
G_M23587_IG02:  ;; offset=0x001A
       test     rcx, rcx
       je       G_M23587_IG35
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (je: 3 ; jcc erratum) 32B boundary ...............................
       test     rsi, rsi
       jne      G_M23587_IG25
       mov      r11, 0x7FFE3ED73A80      ; System.Linq.Tests.LinqTestData+IEnumerableWrapper`1[int]
       cmp      qword ptr [rcx], r11
       jne      G_M23587_IG36
       mov      rcx, gword ptr [rcx+0x08]
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (mov: 3) 32B boundary ...............................
       mov      r11, 0x7FFE3DCF0AF8      ; code for System.Collections.Generic.IEnumerable`1[int]:GetEnumerator():System.Collections.Generic.IEnumerator`1[int]:this
       call     [r11]System.Collections.Generic.IEnumerable`1[int]:GetEnumerator():System.Collections.Generic.IEnumerator`1[int]:this
       mov      rsi, rax
						;; size=57 bbWeight=1 PerfScore 12.25
G_M23587_IG03:  ;; offset=0x0053
       mov      gword ptr [rbp-0x38], rsi
						;; size=4 bbWeight=1 PerfScore 1.00
G_M23587_IG04:  ;; offset=0x0057
       test     rsi, rsi
       je       G_M23587_IG13
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (je: 0 ; jcc erratum) 32B boundary ...............................
       mov      rdi, 0x7FFE3F0A8D68      ; System.SZGenericArrayEnumerator`1[int]
       cmp      qword ptr [rsi], rdi
       jne      G_M23587_IG13
       jmp      SHORT G_M23587_IG06
       align    [0 bytes for IG05]
						;; size=30 bbWeight=1 PerfScore 7.50
G_M23587_IG05:  ;; offset=0x0075
       mov      r8d, dword ptr [rsi+0x08]
       cmp      r8d, dword ptr [rsi+0x0C]
       jae      G_M23587_IG16
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (jae: 3 ; jcc erratum) 32B boundary ...............................
       mov      r14, gword ptr [rsi+0x10]
       cmp      r8d, dword ptr [r14+0x08]
       jae      SHORT G_M23587_IG12
       mov      eax, r8d
       mov      r15d, dword ptr [r14+4*rax+0x10]
       cmp      r15d, ebx
       je       G_M23587_IG17
						;; size=41 bbWeight=101.42 PerfScore 1571.97
G_M23587_IG06:  ;; offset=0x009E
       mov      edx, dword ptr [rsi+0x08]
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (mov: 1) 32B boundary ...............................
       inc      edx
       cmp      edx, dword ptr [rsi+0x0C]
       jae      SHORT G_M23587_IG11
						;; size=10 bbWeight=102.41 PerfScore 640.05
G_M23587_IG07:  ;; offset=0x00A8
       mov      dword ptr [rsi+0x08], edx
       jmp      SHORT G_M23587_IG05
						;; size=5 bbWeight=101.39 PerfScore 304.16

And with the new layout:

G_M23587_IG01:  ;; offset=0x0000
       push     rbp
       push     r14
       push     rdi
       push     rsi
       push     rbx
       sub      rsp, 64
       lea      rbp, [rsp+0x60]
       mov      qword ptr [rbp-0x40], rsp
       mov      ebx, edx
       mov      rsi, r8
						;; size=24 bbWeight=1 PerfScore 7.25
G_M23587_IG02:  ;; offset=0x0018
       test     rcx, rcx
       je       G_M23587_IG33
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (je: 1 ; jcc erratum) 32B boundary ...............................
       test     rsi, rsi
       jne      G_M23587_IG27
       mov      r11, 0x7FFE51746400      ; System.Linq.Tests.LinqTestData+IEnumerableWrapper`1[int]
       cmp      qword ptr [rcx], r11
       jne      G_M23587_IG23
       mov      rcx, gword ptr [rcx+0x08]
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (mov: 1) 32B boundary ...............................
       mov      r11, 0x7FFE506E0AF0      ; code for System.Collections.Generic.IEnumerable`1[int]:GetEnumerator():System.Collections.Generic.IEnumerator`1[int]:this
       call     [r11]System.Collections.Generic.IEnumerable`1[int]:GetEnumerator():System.Collections.Generic.IEnumerator`1[int]:this
       mov      rsi, rax
						;; size=57 bbWeight=1 PerfScore 12.25
G_M23587_IG03:  ;; offset=0x0051
       mov      gword ptr [rbp-0x30], rsi
						;; size=4 bbWeight=1 PerfScore 1.00
G_M23587_IG04:  ;; offset=0x0055
       test     rsi, rsi
       je       G_M23587_IG14
       mov      rdi, 0x7FFE51A8B620      ; System.SZGenericArrayEnumerator`1[int]
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (mov: 8) 32B boundary ...............................
       cmp      qword ptr [rsi], rdi
       jne      SHORT G_M23587_IG14
       align    [0 bytes for IG05]
						;; size=24 bbWeight=1 PerfScore 5.50
G_M23587_IG05:  ;; offset=0x006D
       mov      ecx, dword ptr [rsi+0x08]
       inc      ecx
       cmp      ecx, dword ptr [rsi+0x0C]
       jae      SHORT G_M23587_IG08
						;; size=10 bbWeight=98.43 PerfScore 615.19
G_M23587_IG06:  ;; offset=0x0077
       mov      dword ptr [rsi+0x08], ecx
       mov      edx, dword ptr [rsi+0x08]
       cmp      edx, dword ptr [rsi+0x0C]
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (cmp: 0 ; jcc erratum) 32B boundary ...............................
       jae      SHORT G_M23587_IG15
						;; size=11 bbWeight=97.47 PerfScore 682.27
G_M23587_IG07:  ;; offset=0x0082
       mov      r8, gword ptr [rsi+0x10]
       cmp      edx, dword ptr [r8+0x08]
       jae      G_M23587_IG18
       mov      eax, edx
       mov      r14d, dword ptr [r8+4*rax+0x10]
       cmp      r14d, ebx
       jne      SHORT G_M23587_IG05
       jmp      G_M23587_IG17
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (jmp: 1 ; jcc erratum) 32B boundary ...............................

We're hitting the JCC erratum more often than I previously thought, though only one instance is in the loop body. And we're failing to align the loop correctly in both cases. Looking at the dump, the JIT is complaining this loop needs too much padding:

*************** In emitLoopAlignAdjustments()
compJitAlignLoopAdaptive       = true
compJitAlignLoopBoundary       = 32
compJitAlignLoopMinBlockWeight = 3
compJitAlignLoopForJcc         = false
compJitAlignLoopMaxCodeSize    = 96
compJitAlignPaddingLimit       = 15
  Adjusting 'align' instruction in IG04 that is targeted for IG05 
*************** In getLoopSize() for G_M23587_IG05
   G_M23587_IG05 has 10 bytes.
   G_M23587_IG06 has 11 bytes.
   G_M23587_IG07 has 31 bytes. -- Found the back edge.
loopSize of G_M23587_IG05 = 52 bytes.
;; Skip alignment: 'Loop at G_M23587_IG05 PaddingNeeded= 15, MaxPadding= 8, LoopSize= 52, AlignmentBoundary= 16B.'
;; Calculated padding to add 0 bytes to align G_M23587_IG05 at 16B boundary.
Adjusted alignment for G_M23587_IG05 from 15 to 0.

@AndyAyersMS
Copy link
Member

Is the first bit of asm above actually the new code too? They look the same.

I am confused by the padding output too -- looks like the loop tops are at x75 and x6d, so to reach the next 16 byte boundaries at x80/x70 we would need only 11 bytes and 3 bytes of padding. So seems like we should pad in the new case, and in neither case would we need 15 bytes.

@amanasifkhalid
Copy link
Member

Is the first bit of asm above actually the new code too? They look the same.

You're right, sorry about that -- I updated it.

Looking at the dump for the old layout, the JIT is overestimating the padding needed, and thus skipping alignment here, too:

*************** In emitLoopAlignAdjustments()
compJitAlignLoopAdaptive       = true
compJitAlignLoopBoundary       = 32
compJitAlignLoopMinBlockWeight = 3
compJitAlignLoopForJcc         = false
compJitAlignLoopMaxCodeSize    = 96
compJitAlignPaddingLimit       = 15
  Adjusting 'align' instruction in IG04 that is targeted for IG05 
*************** In getLoopSize() for G_M23587_IG05
   G_M23587_IG05 has 41 bytes.
   G_M23587_IG06 has 10 bytes.
   G_M23587_IG07 has 5 bytes. -- Found the back edge.
loopSize of G_M23587_IG05 = 56 bytes.
;; Skip alignment: 'Loop at G_M23587_IG05 PaddingNeeded= 11, MaxPadding= 8, LoopSize= 56, AlignmentBoundary= 16B.'
;; Calculated padding to add 0 bytes to align G_M23587_IG05 at 16B boundary.

I'm wondering if the block-level peephole optimization for skipping unconditional jumps, or the fact that conditional blocks may no longer fall into their false targets, is causing this overestimation? Though emitLoopAlignAdjustments happens late enough that those block-level changes shouldn't be a problem...

@AndyAyersMS
Copy link
Member

The older padding computation seems ok? Loop top is at 0x75 so we would need 0x0B == 11 bytes to reach 0x80, and we're not willing to pad by more than 8. So we bail.

The new layout, though...?

Do you see a perf difference locally when you run these? I am not sure why the new version ends up being so much slower.

@amanasifkhalid
Copy link
Member

amanasifkhalid commented May 31, 2024

So it turns out the final offset of the loop beginning in the new layout is 0x6D bytes, but when deciding whether to align or not, the JIT uses the loop's estimated offset, which in this case is 0x71 -- hence why the JIT thinks we need 15 bytes to align the loop. The difference in estimated vs actual offset stems from a jump preceding the loop being predicted as 6 bytes in length, and later turning out to be only 2 bytes.

Do you see a perf difference locally when you run these? I am not sure why the new version ends up being so much slower.

I see a perf diff when the benchmark input is an ICollection, but not when it's an IEnumerable like in the graph you shared: I reran it a few times, and the initial slowdown I saw with the new layout is gone. They look about the same on my win-x64 machine (which has an AMD CPU, by the way, so I'm assuming there's no JCC erratum mitigation affecting perf):

Old layout:

Method input Mean Error StdDev Median Min Max Gen0 Allocated
Contains_ElementNotFound ICollection 9.221 ns 0.0877 ns 0.0821 ns 9.228 ns 9.078 ns 9.331 ns - -
Contains_ElementNotFound IEnumerable 210.609 ns 0.5856 ns 0.5191 ns 210.582 ns 210.030 ns 211.789 ns 0.0017 32 B

New layout:

Method input Mean Error StdDev Median Min Max Gen0 Allocated
Contains_ElementNotFound ICollection 9.475 ns 0.0592 ns 0.0554 ns 9.486 ns 9.338 ns 9.531 ns - -
Contains_ElementNotFound IEnumerable 208.610 ns 0.8711 ns 0.7274 ns 208.437 ns 208.041 ns 210.256 ns 0.0017 32 B

@amanasifkhalid
Copy link
Member

amanasifkhalid commented Jun 14, 2024

I've collated the top 20 benchmark regressions across platforms, not double-counting any repeat offenders -- the duplicate names are from GitHub's markdown viewer not rendering templated types due to the <> syntax. Here they are:

Name Base Test Ratio
System.IO.Hashing.Tests.Crc32_AppendPerf:Append 335.26 ns 1.07 μs 3.2
GuardedDevirtualization.TwoClassVirtual:Call 1.90 ns 5.70 ns 3.0
GuardedDevirtualization.ThreeClassVirtual:Call 1.07 ns 2.90 ns 2.72
GuardedDevirtualization.TwoClassInterface:Call 1.45 ns 2.97 ns 2.05
GuardedDevirtualization.ThreeClassInterface:Call 2.12 ns 4.22 ns 1.99
System.Collections.Sort:Array_Comparison 9.19 μs 17.66 μs 1.92
System.Text.Json.Document.Tests.Perf_EnumerateArray:EnumerateArray 707.17 ns 1.34 μs 1.9
System.Tests.Perf_String:TrimEnd 1.69 ns 3.18 ns 1.88
Benchstone.BenchF.NewtR:Test 55.69 ms 102.90 ms 1.85
LinqBenchmarks:Count00ForX 58.01 ms 103.57 ms 1.79
System.Collections.Sort:Array_ComparerClass 8.31 μs 14.71 μs 1.77
System.Collections.Sort:Array_ComparerClass 9.85 μs 17.36 μs 1.76
System.Collections.Sort:Array_Comparison 7.32 μs 12.29 μs 1.68
System.Linq.Tests.Perf_Enumerable:SelectToList 125.92 ns 209.71 ns 1.67
System.Collections.Sort:LinqQuery 17.06 μs 28.35 μs 1.66
System.Tests.Perf_String:TrimStart_CharArr 2.52 ns 4.12 ns 1.64
System.Collections.Sort:Array_ComparerClass 7.15 μs 11.61 μs 1.62
System.Collections.Sort:LinqOrderByExtension 17.45 μs 28.14 μs 1.61
System.Text.Encodings.Web.Tests.Perf_Encoders:EncodeUtf8 53.46 ns 85.59 ns 1.6
System.Collections.Sort:LinqOrderByExtension 20.34 μs 32.52 μs 1.6

I opened a PR (#103450) implementing a 3-opt pass post-RPO layout, but even with a few iterations, TP impact is up to 10%. I'm concerned that too few iterations leaves too much code quality on the table, but the TP cost is significant. This has pushed me to reconsider my narrower (and cheaper) approach in #102927. I'll update with the new layouts for the biggest regressions below, but I'd like to highlight a blocker I've noticed in System.IO.Hashing.Tests.Crc32_AppendPerf:Append. Here is the initial RPO layout of System.IO.Hashing.NonCryptographicHashAlgorithm:Append:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight   IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1    100 [000..003)-> BB03(1),BB02(0)         ( cond )                     i IBC
BB03 [0008]  1       BB01                  1    100 [000..01A)-> BB19(0),BB04(1)         ( cond )                     i IBC idxlen nullcheck
BB04 [0004]  1       BB03                  1    100 [015..016)-> BB17(0),BB05(1)         ( cond )                     i IBC
BB05 [0012]  1       BB04                  1    100 [015..016)-> BB09(0.999),BB06(0.000981)    ( cond )                     i IBC
BB09 [0033]  1       BB05                  1.00 100 [015..016)-> BB11(1)                 (always)                     i IBC
BB06 [0030]  1       BB05                  0.00   0 [015..016)-> BB07(1)                 (always)                     i IBC
BB07 [0031]  2       BB06,BB07             0.00   0 [015..016)-> BB07(0),BB08(1)         ( cond )                     i IBC loophead hascall gcsafe bwd bwd-target bwd-src
BB08 [0032]  1       BB07                  0.00   0 [015..016)-> BB12(1)                 (always)                     i IBC hascall gcsafe
BB12 [0035]  2       BB08,BB11             1.00 100 [015..016)-> BB10(0),BB13(1)         ( cond )                     i IBC bwd bwd-src
BB13 [0036]  1       BB12                  1.00 100 [015..016)-> BB15(0),BB14(1)         ( cond )                     i IBC hascall gcsafe
BB14 [0037]  1       BB13                  1.00 100 [015..016)-> BB16(1)                 (always)                     i IBC
BB15 [0038]  1       BB13                  0      0 [015..016)-> BB16(1)                 (always)                     i IBC rare hascall gcsafe
BB16 [0039]  2       BB14,BB15             1    100 [015..016)-> BB18(1)                 (always)                     i IBC
BB10 [0034]  1       BB12                  0      0 [015..015)-> BB11(1)                 (always)                     i IBC rare loophead hascall gcsafe bwd bwd-target
BB11 [0041]  2       BB09,BB10             0      0 [015..016)-> BB12(1)                 (always)                     i IBC rare bwd
BB17 [0013]  1       BB04                  0      0 [015..016)-> BB18(1)                 (always)                     i IBC rare hascall gcsafe
BB18 [0014]  2       BB16,BB17             1    100 [015..???)-> BB20(1)                 (always)                     i IBC internal
BB19 [0005]  1       BB03                  0      0 [???..???)-> BB20(1)                 (always)                     i IBC rare internal hascall gcsafe
BB20 [0003]  2       BB18,BB19             1    100 [01A..01B)                           (return)                     i IBC internal
BB02 [0001]  1       BB01                  0      0 [003..00E)                           (throw )                     i IBC rare hascall gcsafe newobj
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

BB09->BB11 is the first non-fallthrough instance we have the opportunity to fix up. However, the profile data indicates BB11 is rarely-run, which does not make sense, considering BB09 is hot and always jumps to it. Our hands are tied here though, as fgMoveColdBlocks will undo any fallthrough we create in these cases. Here's the final layout:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight   IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1    100 [000..003)-> BB03(1),BB02(0)         ( cond )                     i IBC
BB03 [0008]  1       BB01                  1    100 [000..01A)-> BB19(0),BB04(1)         ( cond )                     i IBC idxlen nullcheck
BB04 [0004]  1       BB03                  1    100 [015..016)-> BB17(0),BB05(1)         ( cond )                     i IBC
BB05 [0012]  1       BB04                  1    100 [015..016)-> BB09(0.999),BB06(0.000981)    ( cond )                     i IBC
BB09 [0033]  1       BB05                  1.00 100 [015..016)-> BB11(1)                 (always)                     i IBC
BB06 [0030]  1       BB05                  0.00   0 [015..016)-> BB07(1)                 (always)                     i IBC
BB07 [0031]  2       BB06,BB07             0.00   0 [015..016)-> BB07(0),BB08(1)         ( cond )                     i IBC loophead hascall gcsafe bwd bwd-target bwd-src
BB08 [0032]  1       BB07                  0.00   0 [015..016)-> BB12(1)                 (always)                     i IBC hascall gcsafe
BB12 [0035]  2       BB08,BB11             1.00 100 [015..016)-> BB10(0),BB13(1)         ( cond )                     i IBC bwd bwd-src
BB13 [0036]  1       BB12                  1.00 100 [015..016)-> BB15(0),BB14(1)         ( cond )                     i IBC hascall gcsafe
BB14 [0037]  1       BB13                  1.00 100 [015..016)-> BB16(1)                 (always)                     i IBC
BB16 [0039]  2       BB14,BB15             1    100 [015..016)-> BB18(1)                 (always)                     i IBC
BB18 [0014]  2       BB16,BB17             1    100 [015..???)-> BB20(1)                 (always)                     i IBC internal
BB20 [0003]  2       BB18,BB19             1    100 [01A..01B)                           (return)                     i IBC internal
BB15 [0038]  1       BB13                  0      0 [015..016)-> BB16(1)                 (always)                     i IBC rare hascall gcsafe
BB10 [0034]  1       BB12                  0      0 [015..015)-> BB11(1)                 (always)                     i IBC rare loophead hascall gcsafe bwd bwd-target
BB11 [0041]  2       BB09,BB10             0      0 [015..016)-> BB12(1)                 (always)                     i IBC rare bwd
BB17 [0013]  1       BB04                  0      0 [015..016)-> BB18(1)                 (always)                     i IBC rare hascall gcsafe
BB19 [0005]  1       BB03                  0      0 [???..???)-> BB20(1)                 (always)                     i IBC rare internal hascall gcsafe
BB02 [0001]  1       BB01                  0      0 [003..00E)                           (throw )                     i IBC rare hascall gcsafe newobj
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

Notice BB06, BB07, and BB08 also have weights of 0, but aren't flagged as rarely-run, and thus aren't moved, but fixing BB11's weight would fix this issue too: creating fallthrough from BB09 to BB11 would enable us to also move BB12-BB20 up, thus pushing the cold blocks out of the way. I suspect this poor interleaving explains the large benchmark regression. k-opt isn't any less resilient to this, as it relies on profile data to compute layout costs. So if the profile data is misleading, we probably won't get a good layout either way.

@AndyAyersMS I haven't investigated which phases are responsible for the profile data issues just yet, though perhaps we could consider running profile repair right before layout to expedite solving this? I can collect some metrics to get an idea of how common this issue of hot blocks' most likely successors being cold is. If we can get the profile data consistent at least in terms of the BBF_RUN_RARELY flag usage, I suspect #102927 will fix many of these regressions.

@AndyAyersMS
Copy link
Member

I'm surprised the 3-opt is looking so costly. It should be fairly cheap.

I suppose it makes sense to try profile repair. Last we knew 30%+ of methods were profile inconsistent after inlining, and it can only get worse from there.

If you share out the jitdump for the above I can perhaps start working on fixing the maintenance issues.

@amanasifkhalid
Copy link
Member

I'm surprised the 3-opt is looking so costly. It should be fairly cheap.

It's possible there's some significant oversight in my implementation... I tried to keep it simple: For now, layout cost is modeled solely with edge weights, where edges with fallthrough behavior just have a cost of zero. With a sufficient number of iterations (usually no more than 5), I was getting good results for the benchmarks I looked at above in this comment. But it was costly...

If you share out the jitdump for the above I can perhaps start working on fixing the maintenance issues.

Thank you! Here it is

@amanasifkhalid
Copy link
Member

System.Text.Json.Document.Tests.Perf_EnumerateArray.EnumerateArray is another interesting example, though not necessarily because of profile inconsistency issues. Here's the layout, with changes from #102927:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight        IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1       64216 [000..001)-> BB03(1),BB02(0)         ( cond )                     i IBC nullcheck
BB03 [0015]  1       BB01                  1       64216 [000..001)-> BB06(1),BB18(0)         ( cond )                     i IBC
BB06 [0024]  1       BB03                  1       64216 [000..001)-> BB09(1),BB07(0)         ( cond )                     i IBC
BB09 [0032]  1       BB06                  1       64216 [000..001)-> BB10(1),BB23(0)         ( cond )                     i IBC idxlen
BB10 [0037]  1       BB09                  1       64216 [000..001)-> BB12(0.2),BB11(0.8)     ( cond )                     i IBC idxlen nullcheck
BB11 [0042]  1       BB10                  0.80    51373 [000..001)-> BB12(1)                 (always)                     i IBC hascall gcsafe
BB12 [0043]  2       BB10,BB11             1       64216 [000..001)-> BB13(1)                 (always)                     i IBC idxlen nullcheck
BB13 [0038]  2       BB08,BB12             1.00    64216 [000..001)-> BB14(1),BB28(0)         ( cond )                     i IBC
BB14 [0056]  1       BB13                  1.00    64216 [000..001)-> BB17(1),BB16(0)         ( cond )                     i IBC
BB17 [0011]  1       BB14                  1       64216 [000..001)-> BB19(1),BB18(0)         ( cond )                     i IBC nullcheck
BB19 [0068]  1       BB17                  1       64216 [000..001)-> BB22(1),BB20(0)         ( cond )                     i IBC
BB22 [0076]  1       BB19                  1       64216 [000..001)-> BB24(1),BB23(0)         ( cond )                     i IBC idxlen
BB24 [0081]  1       BB22                  1       64216 [000..001)-> BB26(0.2),BB25(0.8)     ( cond )                     i IBC idxlen nullcheck
BB25 [0086]  1       BB24                  0.80    51373 [000..001)-> BB26(1)                 (always)                     i IBC hascall gcsafe
BB26 [0087]  2       BB24,BB25             1       64216 [000..001)-> BB27(1)                 (always)                     i IBC idxlen nullcheck
BB27 [0082]  2       BB21,BB26             1.00    64216 [000..001)-> BB29(1),BB28(0)         ( cond )                     i IBC
BB29 [0100]  1       BB27                  1.00    64216 [000..001)-> BB31(0.00345),BB30(0.997)   ( cond )                     i IBC
BB30 [0061]  1       BB29                  1.00    63994 [000..001)-> BB32(1)                 (always)                     i IBC
BB32 [0065]  2       BB30,BB31             1       64216 [000..014)-> BB33(1)                 (always)                     i IBC nullcheck
BB31 [0062]  1       BB29                  0.00      222 [000..001)-> BB32(1)                 (always)                     i IBC
BB33 [0001]  2  0    BB32,BB50           297.92 19131096 [014..01F)-> BB35(1),BB34(0)         ( cond ) T0      try {       i IBC keep loophead bwd
BB35 [0114]  1  0    BB33                297.92 19131096 [01E..01F)-> BB37(0.999),BB36(0.000993)    ( cond ) T0                  i IBC bwd
BB37 [0116]  1  0    BB35                297.62 19112096 [01E..01F)-> BB38(1),BB52(0)         ( cond ) T0                  i IBC nullcheck bwd
BB38 [0127]  1  0    BB37                297.62 19112096 [01E..01F)-> BB41(1),BB39(0)         ( cond ) T0                  i IBC bwd
BB41 [0135]  1  0    BB38                297.62 19112096 [01E..01F)-> BB42(1),BB53(0)         ( cond ) T0                  i IBC idxlen bwd
BB42 [0140]  1  0    BB41                297.62 19112096 [01E..01F)-> BB44(0.2),BB43(0.8)     ( cond ) T0                  i IBC idxlen nullcheck bwd
BB43 [0145]  1  0    BB42                238.10 15289676 [01E..01F)-> BB44(1)                 (always) T0                  i IBC hascall gcsafe bwd
BB44 [0146]  2  0    BB42,BB43           297.62 19112096 [01E..01F)-> BB45(1)                 (always) T0                  i IBC idxlen nullcheck bwd
BB45 [0141]  2  0    BB40,BB44           297.62 19112096 [01E..01F)-> BB46(1),BB54(0)         ( cond ) T0                  i IBC bwd
BB46 [0159]  1  0    BB45                297.62 19112096 [01E..01F)-> BB48(0.00345),BB47(0.997)   ( cond ) T0                  i IBC bwd
BB47 [0120]  1  0    BB46                296.59 19046097 [01E..01F)-> BB49(1)                 (always) T0                  i IBC bwd
BB49 [0124]  2  0    BB47,BB48           297.62 19112096 [01E..01F)-> BB50(1)                 (always) T0                  i IBC nullcheck bwd
BB50 [0117]  2  0    BB36,BB49           297.92 19131096 [01E..01F)-> BB33(0.997),BB55(0.00336)   ( cond ) T0                  i IBC bwd
BB48 [0121]  1  0    BB46                  1.03    65998 [01E..01F)-> BB49(1)                 (always) T0                  i IBC bwd
BB36 [0115]  1  0    BB35                  0.30    19000 [01E..01F)-> BB50(1)                 (always) T0                  i IBC bwd
BB39 [0132]  1  0    BB38                  0           0 [01E..01F)-> BB40(0.2),BB53(0.8)     ( cond ) T0                  i IBC rare bwd
BB53 [0139]  2  0    BB39,BB41             0           0 [01E..01F)                           (throw ) T0                  i IBC rare hascall gcsafe bwd
BB40 [0134]  1  0    BB39                  0           0 [01E..01F)-> BB45(1)                 (always) T0                  i IBC rare bwd
BB54 [0158]  1  0    BB45                  0           0 [01E..01F)                           (throw ) T0                  i IBC rare hascall gcsafe bwd
BB52 [0126]  1  0    BB37                  0           0 [01E..01F)                           (throw ) T0                  i IBC rare hascall gcsafe bwd
BB34 [0113]  1  0    BB33                  0           0 [01E..01F)-> BB55(1)                 (always) T0      }           i IBC rare bwd
BB55 [0166]  2       BB34,BB50             1.00    64216 [029..038)                           (return)                     i IBC keep cfb cfe
BB07 [0029]  1       BB06                  0           0 [000..001)-> BB08(0.2),BB23(0.8)     ( cond )                     i IBC rare
BB08 [0031]  1       BB07                  0           0 [000..001)-> BB13(1)                 (always)                     i IBC rare
BB20 [0073]  1       BB19                  0           0 [000..001)-> BB21(0.2),BB23(0.8)     ( cond )                     i IBC rare
BB23 [0080]  4       BB07,BB09,BB20,BB22   0           0 [000..001)                           (throw )                     i IBC rare hascall gcsafe
BB21 [0075]  1       BB20                  0           0 [000..001)-> BB27(1)                 (always)                     i IBC rare
BB16 [0010]  1       BB14                  0           0 [000..001)                           (throw )                     i IBC rare hascall gcsafe
BB28 [0099]  2       BB13,BB27             0           0 [000..001)                           (throw )                     i IBC rare hascall gcsafe
BB18 [0067]  2       BB03,BB17             0           0 [000..001)                           (throw )                     i IBC rare hascall gcsafe
BB02 [0014]  1       BB01                  0           0 [000..001)                           (throw )                     i IBC rare hascall gcsafe newobj
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ funclets follow
BB56 [0005]  1     0                       0           0 [029..037)                           (falret)    H0 F fault { }   i IBC rare keep xentry flet
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

It looks like we're doing a good job of prioritizing fallthrough on the hottest paths. BB31 is an odd case: It isn't marked as rarely-run because of its non-zero IBC, but its scaled weight is close to zero. In such cases, should BB31 be considered cold by fgMoveColdBlocks, and moved to the end of the method to enable fallthrough on the hot path? Should fgMoveColdBlocks check if a block's weight is sufficiently close to zero rather than the rarely-run flag, or should we keep its definition of "cold" consistent with the rest of the JIT?

@AndyAyersMS
Copy link
Member

I think it make sense for small but non-zero to be treated like cold.. perhaps some kind of an absolute threshold (say normalized weight is > 0.01).

Do you know why BB55 ends up where it does?

@amanasifkhalid
Copy link
Member

Do you know why BB55 ends up where it does?

BB55 is reachable from blocks in T0, but it isn't in the try region, so we place it after. fgMoveColdBlocks moves cold blocks to the end of their respective regions to keep EH regions contiguous -- this has the sometimes undesirable effect of pushing non-cold blocks after try regions further away from their predecessors, though I'm not sure there's a way around this due to EH constraints.

@amanasifkhalid
Copy link
Member

I think it make sense for small but non-zero to be treated like cold.. perhaps some kind of an absolute threshold (say normalized weight is > 0.01).

I'm going to go ahead and try this out, but it's worth noting that if we turn hot/cold splitting on, we'll probably want to update fgDetermineFirstColdBlock to use the same definition of "cold," or otherwise we'll be inhibiting splitting with this change.

@amanasifkhalid
Copy link
Member

amanasifkhalid commented Jun 14, 2024

Benchstone.BenchF.NewtR:Test is another weird case. Here's the layout with #102927's changes:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight   IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0018]  1                             0.60   1 [???..???)-> BB11(0.00595),BB02(0.994)   ( cond )                     IBC internal
BB02 [0020]  1       BB01                  0.59   1 [01D..???)-> BB03(1)                 (always)                     IBC internal
BB03 [0001]  2       BB02,BB09            99.41 167 [01D..03A)-> BB05(1)                 (always)                     i IBC loophead bwd bwd-target
BB05 [0009]  2       BB03,BB08            98.65 166 [01D..01E)-> BB09(1),BB06(0.000499)  ( cond )                     i IBC loophead bwd bwd-target
BB09 [0022]  4       BB05,BB06,BB07,BB08  99.41 167 [01D..06D)-> BB03(0.994),BB11(0.00595)   ( cond )                     i IBC bwd
BB06 [0010]  1       BB05                  0.05   0 [01D..01E)-> BB09(0),BB07(1)         ( cond )                     i IBC hascall gcsafe bwd
BB07 [0011]  1       BB06                  0.05   0 [01D..01E)-> BB09(0),BB08(1)         ( cond )                     i IBC bwd
BB08 [0012]  1       BB07                 99.45 167 [01D..01E)-> BB05(0.992),BB09(0.00812)   ( cond )                     i IBC bwd
BB11 [0023]  2       BB01,BB09             0.60   1 [06D..090)                           (return)                     i IBC gcsafe newobj
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

It would be nice if we moved BB11 up to after BB09, but the loop body looks good. I don't know where BB08's weight is coming from, since it's only reachable from BB07 -- looks like another profile consistency issue? Here's the JIT dump: gist.

Edit: I tweaked fgMoveHotBlocks to consider moving up the unlikely edge of BBJ_COND blocks if the likely edge is a backedge. Updated layout looks a bit better:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight   IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0018]  1                             0.60   1 [???..???)-> BB11(0.00595),BB02(0.994)   ( cond )                     IBC internal
BB02 [0020]  1       BB01                  0.59   1 [01D..???)-> BB03(1)                 (always)                     IBC internal
BB03 [0001]  2       BB02,BB09            99.41 167 [01D..03A)-> BB05(1)                 (always)                     i IBC loophead bwd bwd-target
BB05 [0009]  2       BB03,BB08            99.08 167 [01D..01E)-> BB09(1),BB06(0.0005)    ( cond )                     i IBC loophead bwd bwd-target
BB09 [0022]  4       BB05,BB06,BB07,BB08  99.41 167 [01D..06D)-> BB03(0.994),BB11(0.00595)   ( cond )                     i IBC bwd
BB11 [0023]  2       BB01,BB09             0.60   1 [06D..090)                           (return)                     i IBC gcsafe newobj
BB06 [0010]  1       BB05                  0.05   0 [01D..01E)-> BB09(0),BB07(1)         ( cond )                     i IBC hascall gcsafe bwd
BB07 [0011]  1       BB06                  0.05   0 [01D..01E)-> BB09(0),BB08(1)         ( cond )                     i IBC bwd
BB08 [0012]  1       BB07                 99.45 167 [01D..01E)-> BB05(0.996),BB09(0.00378)   ( cond )                     i IBC bwd
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

@amanasifkhalid
Copy link
Member

amanasifkhalid commented Jun 14, 2024

GuardedDevirtualization.TwoClassInterface:Call has a similar layout shape, and the updated fgMoveHotBlocks seems to handle it better:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight    IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0009]  1                             0.60   10 [???..???)-> BB16(0.00595),BB02(0.994)   ( cond )                     IBC internal idxlen
BB02 [0010]  1       BB01                  0.59   10 [00E..???)-> BB10(0.01),BB05(0.99)   ( cond )                     IBC internal
BB05 [0012]  1       BB02                  0.59   10 [???..???)-> BB06(1)                 (always)                     IBC internal idxlen
BB06 [0001]  2       BB05,BB09            98.41 1654 [00E..00E)-> BB08(0),BB07(1)         ( cond )                     i IBC loophead idxlen bwd bwd-target
BB07 [0006]  1       BB06                 98.41 1654 [???..???)-> BB09(1)                 (always)                     i IBC internal idxlen nullcheck bwd
BB09 [0005]  2       BB07,BB08            98.41 1654 [00E..024)-> BB06(0.994),BB16(0.00595)   ( cond )                     i IBC idxlen bwd
BB16 [0021]  3       BB01,BB09,BB14        0.60   10 [024..026)                           (return)                     i IBC
BB10 [0013]  1       BB02                  0.01    0 [???..???)-> BB11(1)                 (always)                     IBC internal idxlen
BB11 [0014]  2       BB10,BB14             0.99   17 [00E..00E)-> BB13(0),BB12(1)         ( cond )                     i IBC loophead idxlen bwd bwd-target
BB12 [0015]  1       BB11                  0.99   17 [???..???)-> BB14(1)                 (always)                     i IBC internal idxlen nullcheck bwd
BB14 [0017]  2       BB12,BB13             0.99   17 [00E..024)-> BB11(0.994),BB16(0.00595)   ( cond )                     i IBC idxlen bwd
BB08 [0007]  1       BB06                  0       0 [???..???)-> BB09(1)                 (always)                     i IBC rare internal hascall gcsafe idxlen bwd
BB13 [0016]  1       BB11                  0       0 [???..???)-> BB14(1)                 (always)                     i IBC rare internal hascall gcsafe idxlen bwd
BB19 [0022]  0                             0         [???..???)                           (throw )                     i rare keep internal
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

All the other regressed GuardedDevirtualization benchmarks (TwoClassVirtual, ThreeClassVirtual, ThreeClassInterface) have the same shape that can now fall out of the hot loop into the return block. Those feature heavily in the top regressions on x64 and arm64, and the new layout looks optimal, so I'd be surprised if these regressions aren't resolved.

amanasifkhalid added a commit that referenced this issue Jun 15, 2024
…oldBlocks (#103492)

Based on feedback in #102763 (comment), define "cold" blocks based on whether their weights are below a certain threshold, rather than only considering blocks marked with BBF_RUN_RARELY, in fgMoveColdBlocks. I added a BasicBlock method for doing this weight check rather than localizing it to fgMoveColdBlocks, as I plan to use it elsewhere in the layout phase.
@amanasifkhalid
Copy link
Member

amanasifkhalid commented Jun 15, 2024

LinqBenchmarks:Count00ForX highlights a case where our loop misrotation fix makes the wrong decision. Here's the initial RPO layout:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight    IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0011]  1                             1      98 [???..???)-> BB06(1)                 (always)                     i IBC internal
BB06 [0005]  4       BB01,BB03,BB04,BB05 100    9778 [02B..02C)-> BB14(0.00016),BB07(1)   ( cond )                     i IBC bwd bwd-src osr-entry
BB07 [0020]  1       BB06                 99.98 9776 [02B..02C)-> BB10(0.161),BB08(0.839) ( cond )                     i IBC bwd
BB08 [0021]  1       BB07                 83.93 8207 [02B..02C)-> BB11(1)                 (always)                     i IBC idxlen bwd
BB10 [0026]  1       BB07                 16.07 1571 [02B..02C)-> BB11(1)                 (always)                     i IBC bwd
BB11 [0023]  2       BB08,BB10           100    9778 [02B..034)-> BB04(0.983),BB12(0.0171)  ( cond )                     i IBC bwd bwd-src
BB04 [0003]  1       BB11                 98.29 9611 [015..027)-> BB06(0.933),BB05(0.0666)  ( cond )                     i IBC loophead bwd bwd-target
BB05 [0004]  1       BB04                  6.55  640 [027..02B)-> BB06(1)                 (always)                     i IBC bwd
BB12 [0006]  1       BB11                  1.71  167 [034..050)-> BB02(0.994),BB13(0.00595)   ( cond )                     i IBC bwd
BB02 [0001]  1       BB12                  1.71  167 [00C..013)-> BB03(1)                 (always)                     i IBC loophead nullcheck bwd bwd-target
BB03 [0002]  1       BB02                  1.71  167 [???..???)-> BB06(1)                 (always)                     i IBC keep internal bwd
BB13 [0010]  1       BB12                  0.01    1 [050..059)                           (return)                     i IBC
BB14 [0025]  1       BB06                  0       0 [02B..02C)                           (throw )                     i IBC rare hascall gcsafe bwd
BB15 [0030]  0                             0         [???..???)                           (throw )                     i rare keep internal
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

And the final layout:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight    IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0011]  1                             1      98 [???..???)-> BB06(1)                 (always)                     i IBC internal
BB05 [0004]  1       BB04                  6.55  640 [027..02B)-> BB06(1)                 (always)                     i IBC bwd
BB06 [0005]  4       BB01,BB03,BB04,BB05 100    9778 [02B..02C)-> BB14(0.00016),BB07(1)   ( cond )                     i IBC bwd bwd-src osr-entry
BB07 [0020]  1       BB06                 99.98 9776 [02B..02C)-> BB10(0.161),BB08(0.839) ( cond )                     i IBC bwd
BB08 [0021]  1       BB07                 83.93 8207 [02B..02C)-> BB11(1)                 (always)                     i IBC idxlen bwd
BB11 [0023]  2       BB08,BB10           100    9778 [02B..034)-> BB04(0.983),BB12(0.0171)  ( cond )                     i IBC bwd bwd-src
BB04 [0003]  1       BB11                 98.29 9611 [015..027)-> BB06(0.933),BB05(0.0666)  ( cond )                     i IBC loophead bwd bwd-target
BB10 [0026]  1       BB07                 16.07 1571 [02B..02C)-> BB11(1)                 (always)                     i IBC bwd
BB12 [0006]  1       BB11                  1.71  167 [034..050)-> BB02(0.994),BB13(0.00595)   ( cond )                     i IBC bwd
BB02 [0001]  1       BB12                  1.71  167 [00C..013)-> BB03(1)                 (always)                     i IBC loophead nullcheck bwd bwd-target
BB03 [0002]  1       BB02                  1.71  167 [???..???)-> BB06(1)                 (always)                     i IBC keep internal bwd
BB13 [0010]  1       BB12                  0.01    1 [050..059)                           (return)                     i IBC
BB14 [0025]  1       BB06                  0       0 [02B..02C)                           (throw )                     i IBC rare hascall gcsafe bwd
BB15 [0030]  0                             0         [???..???)                           (throw )                     i rare keep internal
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

The decision to move BB05 ends up breaking fallthrough from BB01 into BB06 without introducing any new fallthrough. We could fix this in fgMoveHotBlocks by not moving backward jumps if the previous block falls into the block with the jump, and breaking this fallthrough won't introduce new fallthrough into the next block. But this starts to get a bit pattern-matchy, so perhaps we accept this for now until we have a more general (i.e. 3-opt) solution.

@amanasifkhalid
Copy link
Member

Sorry for all the comment spam. I've looked at the remaining top regressions, and I haven't seen anything noteworthy -- in general, #102927 seems to do a good job of maximizing fallthrough on hot paths, so long as the profile data is sensible. As such, I'm planning on getting that merged for now, and coming back to 3-opt after Preview 6.

amanasifkhalid added a commit that referenced this issue Jun 17, 2024
…102927)

After establishing an RPO-based layout, we currently try to move any backward jumps up to their successors, if it is profitable to do so in spite of any fallthrough behavior lost. In #102763, we see many instances where the RPO layout fails to create fallthrough for forward jumps on the hot path, such as in cases where a block is reachable from many predecessors. This work addresses the RPO's limitations by also considering moving the targets of forward jumps (conditional and unconditional) to maximize fallthrough.
@amanasifkhalid
Copy link
Member

I've merged #102927 -- fingers crossed we see an improvement in the next triage...

@amanasifkhalid
Copy link
Member

I'm punting this to .NET 10, as we're planning to continue to iterate on the new layout. We'll re-evaluate any remaining regressions here with each change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

No branches or pull requests

3 participants