Widespread perf regressions due to RPO layout #102763

EgorBo · 2024-05-28T14:31:29Z

Looks like #102343 has a lot more regressions than improvements. cc @amanasifkhalid

Regressions:

[Perf] Windows/x64: 246 Regressions on 5/22/2024 9:27:51 PM perf-autofiling-issues#35156
[Perf] Linux/x64: 242 Regressions on 5/22/2024 9:27:51 PM perf-autofiling-issues#35095
[Perf] Linux/x64: 28 Regressions on 5/21/2024 3:15:28 PM perf-autofiling-issues#35092
[Perf] Linux/x64: 3 Regressions on 5/22/2024 5:24:07 PM perf-autofiling-issues#35094
[Perf] Windows/arm64: 9 Regressions on 5/22/2024 9:27:51 PM perf-autofiling-issues#35381
[Perf] Windows/arm64: 22 Regressions on 5/22/2024 5:54:51 PM perf-autofiling-issues#35380
[Perf] Linux/arm64: 135 Regressions on 5/22/2024 5:54:51 PM perf-autofiling-issues#35370

Improvements:

NOTE: use Test report links, it looks like "all time history" and the images are a bit out of date.

The text was updated successfully, but these errors were encountered:

dotnet-policy-service · 2024-05-28T14:32:18Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

amanasifkhalid · 2024-05-28T20:06:40Z

@EgorBo Thanks for opening this. I haven't spent too much time looking at the regressions yet, but based on the larger ones I've looked at, I think we need a more general/powerful implementation of #102461. Take GuardedDevirtualization.TwoClassVirtual.Call for instance, which regressed by over 60% on Windows x64 (GuardedDevirtualization.ThreeClassVirtual.Call regressed for similar reasons). Before reordering, the blocks look like this:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight    IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1      18 [000..00E)-> BB07(0.00595),BB02(0.994)   ( cond )                     i IBC idxlen
BB02 [0009]  1       BB01                  0.99   18 [00E..???)-> BB03(1)                 (always)                     IBC internal idxlen
BB03 [0001]  2       BB02,BB06           167.07 3007 [00E..00E)-> BB05(0.07),BB04(0.93)   ( cond )                     i IBC loophead idxlen bwd bwd-target
BB04 [0005]  1       BB03                155.37 2797 [???..???)-> BB06(1)                 (always)                     i IBC internal idxlen nullcheck bwd
BB05 [0006]  1       BB03                 11.69  211 [???..???)-> BB06(1)                 (always)                     i IBC internal hascall gcsafe idxlen bwd
BB06 [0004]  2       BB04,BB05           167.07 3007 [00E..024)-> BB03(0.994),BB07(0.00595)   ( cond )                     i IBC idxlen bwd
BB07 [0010]  2       BB01,BB06             1.00   18 [024..026)                           (return)                     i IBC
BB09 [0011]  0                             0         [???..???)                           (throw )                     i rare keep internal
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

The interesting ones are BB03-BB06, which comprise a loop. The hot path is BB03->BB04->BB06->BB03; moving BB05 out-of-line to make the hot path branchless (except for the backward jump) would be ideal. The old layout does this:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight    IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1      18 [000..00E)-> BB07(0.00595),BB02(0.994)   ( cond )                     i IBC idxlen
BB02 [0009]  1       BB01                  0.99   18 [00E..???)-> BB03(1)                 (always)                     IBC internal idxlen
BB03 [0001]  2       BB02,BB06           167.07 3007 [00E..00E)-> BB05(0.07),BB04(0.93)   ( cond )                     i IBC loophead idxlen bwd bwd-target
BB04 [0005]  1       BB03                155.37 2797 [???..???)-> BB06(1)                 (always)                     i IBC internal idxlen nullcheck bwd
BB06 [0004]  2       BB04,BB05           167.07 3007 [00E..024)-> BB03(0.994),BB07(0.00595)   ( cond )                     i IBC idxlen bwd
BB07 [0010]  2       BB01,BB06             1.00   18 [024..026)                           (return)                     i IBC
BB05 [0006]  1       BB03                 11.69  211 [???..???)-> BB06(1)                 (always)                     i IBC internal hascall gcsafe idxlen bwd
BB09 [0011]  0                             0         [???..???)                           (throw )                     i rare keep internal
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

In the new layout, fgMoveBackwardJumpsToSuccessors only considers moving backward unconditional jumps, so BB05 remains in the way of the hot path:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight    IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1      18 [000..00E)-> BB07(0.00595),BB02(0.994)   ( cond )                     i IBC idxlen
BB02 [0009]  1       BB01                  0.99   18 [00E..???)-> BB03(1)                 (always)                     IBC internal idxlen
BB03 [0001]  2       BB02,BB06           167.07 3007 [00E..00E)-> BB05(0.07),BB04(0.93)   ( cond )                     i IBC loophead idxlen bwd bwd-target
BB04 [0005]  1       BB03                155.37 2797 [???..???)-> BB06(1)                 (always)                     i IBC internal idxlen nullcheck bwd
BB05 [0006]  1       BB03                 11.69  211 [???..???)-> BB06(1)                 (always)                     i IBC internal hascall gcsafe idxlen bwd
BB06 [0004]  2       BB04,BB05           167.07 3007 [00E..024)-> BB03(0.994),BB07(0.00595)   ( cond )                     i IBC idxlen bwd
BB07 [0010]  2       BB01,BB06             1.00   18 [024..026)                           (return)                     i IBC
BB09 [0011]  0                             0         [???..???)                           (throw )                     i rare keep internal
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

I think modifying fgMoveBackwardJumpsToSuccessors to consider forward jumps would be a good start to chipping away at these regressions. However, moving BB06 up to BB04 would break up the former's fallthrough into BB07. That's ok, since that's part of the slow path, but we probably want to update the cost check in fgMoveBackwardJumpsToSuccessors to also check if moving the target block will break up fallthrough into its next block, and whether the edge weights deem it's worth it.

Thoughts on this? cc @AndyAyersMS for visibility. Thanks!

AndyAyersMS · 2024-05-28T20:46:33Z

I would suggest we list all the regressions in rank order and then investigate at least the top 20 or so. We also should see what the arm64 data looks like.

Based on the above and the the other fixup actions, it feels like we are getting to the point where we may actually want to create the explicit cost/benefit model to assess incremental improvements. We have a layout configuration A, and some proposed alternate B. Which is better?

If you want to try to build this out, we should talk. The general model is something like the following:

identify 3 cut points in the linear chain of blocks (creating 4 segments overall). First segment must contain at least the entry block; last one can be empty.
swap the middle two segments; if that is lower cost, make that the new configuration
repeat until you can't find any more improvements or give up (drive to local minima)
(or just limit yourself to one pass or something to keep the TP under control)
(the someday refinement: occasionally allow reorderings that drive the cost up, to try and find a global minimum)

Since there are just 3 cut points, if you have an initial cost model it's not too hard to compute the delta to the new cost model, you (mostly) just need to recost the segment-ending blocks (if the cost model is block-size sensitive then perhaps a bit more, but I don't think we have good size estimates yet?)

For the time being the "identify" step can be finding a single misplaced block and a new proposed location, and then seeing if the cost improves.

amanasifkhalid · 2024-05-29T15:58:41Z

I would suggest we list all the regressions in rank order and then investigate at least the top 20 or so.

Sorry about the delay in getting back to this; the script we typically use to collate regression data doesn't seem to work since the perflab test report URLs changed, so I hacked up my own script. Here are the top 20 regressions on Windows x64:

Name	Base	Test	Ratio
System.Collections.Sort.Array_Comparison	9.19 μs	17.66 μs	1.92
System.Collections.Sort.Array_ComparerClass	9.85 μs	17.36 μs	1.76
GuardedDevirtualization.ThreeClassVirtual.Call	3.82 ns	6.56 ns	1.72
System.Linq.Tests.Perf_Enumerable.SelectToList	125.92 ns	209.71 ns	1.67
System.Collections.Sort.LinqQuery	17.06 μs	28.35 μs	1.66
GuardedDevirtualization.TwoClassVirtual.Call	1.75 ns	2.86 ns	1.63
LinqBenchmarks.Count00ForX	91.54 ms	147.83 ms	1.61
System.Collections.Sort.LinqOrderByExtension	17.45 μs	28.14 μs	1.61
System.Text.Encodings.Web.Tests.Perf_Encoders.EncodeUtf8	53.46 ns	85.59 ns	1.6
System.Collections.Sort.LinqOrderByExtension	20.34 μs	32.52 μs	1.6
System.Memory.Span.Reverse	32.25 ns	50.97 ns	1.58
System.Tests.Perf_Random.NextBytes_unseeded	170.90 ns	244.84 ns	1.43
System.Tests.Perf_Random.NextBytes_span_unseeded	171.98 ns	244.27 ns	1.42
System.Memory.Span.Reverse	3.21 ns	4.54 ns	1.41
System.Collections.Sort.LinqOrderByExtension	16.82 μs	23.40 μs	1.39
System.Text.Perf_Ascii.ToLower_Chars	27.93 ns	38.72 ns	1.39
System.Collections.Perf_Frozen.ToFrozenDictionary	32.28 μs	44.49 μs	1.38
Benchstone.BenchI.CSieve.Test	4.25 ms	5.87 ms	1.38
System.Text.Perf_Ascii.Equals_Bytes	4.55 ns	6.23 ns	1.37
System.Text.Perf_Ascii.ToUpper_Chars	27.64 ns	37.66 ns	1.36

And on Linux x64:

Name	Base	Test	Ratio
System.IO.Hashing.Tests.Crc32_AppendPerf.Append	335.26 ns	1.07 μs	3.2
GuardedDevirtualization.ThreeClassVirtual.Call	1.07 ns	2.90 ns	2.72
GuardedDevirtualization.ThreeClassVirtual.Call	1.01 ns	2.73 ns	2.71
GuardedDevirtualization.ThreeClassVirtual.Call	1.07 ns	2.91 ns	2.71
GuardedDevirtualization.ThreeClassVirtual.Call	1.01 ns	2.71 ns	2.69
GuardedDevirtualization.ThreeClassVirtual.Call	1.01 ns	2.71 ns	2.69
GuardedDevirtualization.ThreeClassVirtual.Call	1.07 ns	2.87 ns	2.68
System.IO.Hashing.Tests.Crc32_AppendPerf.Append	8.21 ns	21.65 ns	2.64
System.IO.Hashing.Tests.Crc32_AppendPerf.Append	15.58 ns	38.16 ns	2.45
GuardedDevirtualization.TwoClassVirtual.Call	1.46 ns	3.22 ns	2.21
System.Tests.Perf_String.TrimEnd	1.69 ns	3.18 ns	1.88
GuardedDevirtualization.TwoClassVirtual.Call	1.23 ns	2.30 ns	1.86
Benchstone.BenchF.NewtR.Test	55.69 ms	102.90 ms	1.85
LinqBenchmarks.Count00ForX	58.01 ms	103.57 ms	1.79
System.Collections.Sort.Array_ComparerClass	8.31 μs	14.71 μs	1.77
System.Collections.Sort.Array_Comparison	7.32 μs	12.29 μs	1.68
System.Tests.Perf_String.TrimStart_CharArr	2.52 ns	4.12 ns	1.64
System.Collections.Sort.Array_ComparerClass	7.15 μs	11.61 μs	1.62
System.Collections.IterateFor.IList	287.08 ns	454.48 ns	1.58
System.Text.Perf_Ascii.ToUpper_Bytes	11.49 ns	17.77 ns	1.55

I'll start digging into the overlapping benchmarks (this explains GuardedDevirtualization.TwoClassVirtual.Call and GuardedDevirtualization.ThreeClassVirtual.Call already, and I suspect the theme of hot paths being polluted by a relatively cold (but not "cold" cold) jump will continue to come up).

amanasifkhalid · 2024-05-29T21:25:36Z

I'm seeing the same theme of interleaving hot and colder blocks throughout these regressions. The RPO-based layout on its own doesn't seem to leverage edge weights heavily enough, and while fgMoveBackwardJumpsToSuccessors does consider edge weights, its backward jump limitation is blocking lots of potentially profitable moves. System.Memory.Span<Int32>.Reverse illustrates this succinctly. Here's what the new layout does:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight   IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1    100 [000..00A)-> BB05(0.2),BB02(0.8)     ( cond )                     i IBC
BB02 [0001]  1       BB01                  0.80  80 [00A..00B)-> BB04(0.2),BB03(0.8)     ( cond )                     i IBC hascall
BB03 [0018]  1       BB02                  0.64  64 [00A..00B)-> BB04(1)                 (always)                     i IBC hascall gcsafe
BB04 [0019]  2       BB02,BB03             0.80  80 [00A..00B)-> BB06(1)                 (always)                     i IBC hascall gcsafe
BB05 [0003]  1       BB01                  0.20  20 [???..???)-> BB06(1)                 (always)                     i IBC internal hascall
BB06 [0002]  2       BB04,BB05             1    100 [01D..01E)                           (return)                     i IBC tail-succ
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

Notice how BB06 can be reached from BB04 and BB05, and the former is hotter, but doesn't fall into BB06. If fgMoveBackwardJumpsToSuccessors considered forward jumps, it would move BB06 up. For a more involved example, let's look at System.Text.Perf_Ascii.Equals_Bytes. Here's the new layout:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight   IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1    100 [000..001)-> BB03(0.99),BB02(0.00994)  ( cond )                     i IBC
BB03 [0004]  1       BB01                  0.99  99 [000..001)-> BB04(1)                 (always)                     i IBC idxlen nullcheck
BB02 [0003]  1       BB01                  0.01   1 [000..001)-> BB04(1)                 (always)                     i IBC
BB04 [0005]  2       BB02,BB03             1    100 [000..001)-> BB06(0.99),BB05(0.00994)  ( cond )                     i IBC
BB06 [0009]  1       BB04                  0.99  99 [000..001)-> BB07(1)                 (always)                     i IBC idxlen nullcheck
BB05 [0008]  1       BB04                  0.01   1 [000..001)-> BB07(1)                 (always)                     i IBC
BB07 [0010]  2       BB05,BB06             1    100 [000..000)-> BB09(0),BB08(1)         ( cond )                     i IBC
BB08 [0012]  1       BB07                  1    100 [000..000)-> BB10(1)                 (always)                     i IBC internal hascall gcsafe
BB10 [0014]  2       BB08,BB09             1    100 [01C..01C)                           (return)                     i IBC
BB09 [0013]  1       BB07                  0      0 [000..000)-> BB10(1)                 (always)                     i IBC rare internal
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

If we considered moving forward jumps, we'd have the following decision tree:

BB03->BB04 is hotter than BB02->BB04, so move BB04 up.
BB04->BB06 is hot, and BB04 is the only pred of BB06, but BB02 is now between them, so move BB06 up; we're basically bubbling BB02 down the block list. Now we have this:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight   IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1    100 [000..001)-> BB03(0.99),BB02(0.00994)  ( cond )                     i IBC
BB03 [0004]  1       BB01                  0.99  99 [000..001)-> BB04(1)                 (always)                     i IBC idxlen nullcheck
BB04 [0005]  2       BB02,BB03             1    100 [000..001)-> BB06(0.99),BB05(0.00994)  ( cond )                     i IBC
BB06 [0009]  1       BB04                  0.99  99 [000..001)-> BB07(1)                 (always)                     i IBC idxlen nullcheck
BB02 [0003]  1       BB01                  0.01   1 [000..001)-> BB04(1)                 (always)                     i IBC
BB05 [0008]  1       BB04                  0.01   1 [000..001)-> BB07(1)                 (always)                     i IBC
BB07 [0010]  2       BB05,BB06             1    100 [000..000)-> BB09(0),BB08(1)         ( cond )                     i IBC
BB08 [0012]  1       BB07                  1    100 [000..000)-> BB10(1)                 (always)                     i IBC internal hascall gcsafe
BB10 [0014]  2       BB08,BB09             1    100 [01C..01C)                           (return)                     i IBC
BB09 [0013]  1       BB07                  0      0 [000..000)-> BB10(1)                 (always)                     i IBC rare internal
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

BB06->BB07 is hotter than BB05->BB07, so move BB07 up. We've bubbled BB02 and BB05 down.
BB07->BB08 is hot, and BB07 is the only pred of BB08, so move BB08 up.
BB08->BB10is hot, andBB10's other pred doesn't fall into it anyway, so move BB10` up. Here's what we have:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight   IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1    100 [000..001)-> BB03(0.99),BB02(0.00994)  ( cond )                     i IBC
BB03 [0004]  1       BB01                  0.99  99 [000..001)-> BB04(1)                 (always)                     i IBC idxlen nullcheck
BB04 [0005]  2       BB02,BB03             1    100 [000..001)-> BB06(0.99),BB05(0.00994)  ( cond )                     i IBC
BB06 [0009]  1       BB04                  0.99  99 [000..001)-> BB07(1)                 (always)                     i IBC idxlen nullcheck
BB07 [0010]  2       BB05,BB06             1    100 [000..000)-> BB09(0),BB08(1)         ( cond )                     i IBC
BB08 [0012]  1       BB07                  1    100 [000..000)-> BB10(1)                 (always)                     i IBC internal hascall gcsafe
BB10 [0014]  2       BB08,BB09             1    100 [01C..01C)                           (return)                     i IBC
BB02 [0003]  1       BB01                  0.01   1 [000..001)-> BB04(1)                 (always)                     i IBC
BB05 [0008]  1       BB04                  0.01   1 [000..001)-> BB07(1)                 (always)                     i IBC
BB09 [0013]  1       BB07                  0      0 [000..000)-> BB10(1)                 (always)                     i IBC rare internal
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

BB10 has no successors.
BB02 is a backward jump, but BB03->BB04 is hotter, so don't move anything.
Ditto BB05.
BB09 is rarely-run, so we don't care about it.

For a few other short examples, just to build my case, here's the new layout of System.Text.Encodings.Web.Tests.Perf_Encoders.EncodeUtf8:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight   IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1    100 [000..001)-> BB03(0.99),BB02(0.00994)  ( cond )                     i IBC
BB03 [0004]  1       BB01                  0.50  50 [000..001)-> BB04(1)                 (always)                     i IBC idxlen nullcheck
BB02 [0003]  1       BB01                  0.00   0 [000..001)-> BB04(1)                 (always)                     i IBC
BB04 [0005]  2       BB02,BB03             1    100 [000..000)-> BB06(1),BB05(0)         ( cond )                     i IBC
BB06 [0009]  1       BB04                  0.50  50 [000..000)-> BB07(1)                 (always)                     i IBC internal idxlen nullcheck
BB07 [0013]  2       BB05,BB06             1    100 [027..027)                           (return)                     i IBC hascall gcsafe
BB05 [0008]  1       BB04                  0      0 [000..000)-> BB07(1)                 (always)                     i IBC rare internal
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

And for System.String.TrimWhiteSpaceHelper from System.Tests.Perf_String.TrimEnd (in particular, we could consider creating fallthrough for BB29->BB23->BB27->BB28):

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight    IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1    2261 [000..010)-> BB12(0.092),BB02(0.908) ( cond )                     i IBC idxlen
BB02 [0001]  1       BB01                  0.91 2053 [010..014)-> BB11(0),BB03(1)         ( cond )                     i IBC idxlen
BB03 [0246]  2       BB02,BB09             1.82 4120 [014..015)-> BB07(0.52),BB08(0.48)   ( cond )                     i IBC idxlen bwd
BB07 [0041]  1       BB03                  0.95 2142 [014..015)-> BB14(0.498),BB09(0.502) ( cond )                     i IBC bwd
BB08 [0055]  1       BB03                  1.57 3560 [014..022)-> BB14(0.498),BB09(0.502) ( cond )                     i IBC bwd
BB09 [0003]  2       BB07,BB08             1.82 4120 [022..02F)-> BB03(1),BB10(0)         ( cond )                     i IBC hascall idxlen bwd
BB14 [0011]  2       BB07,BB08             0.91 2053 [???..???)-> BB12(1)                 (always)                     i IBC internal hascall
BB12 [0005]  3       BB01,BB11,BB14        1.00 2261 [02F..034)-> BB23(0.000442),BB13(1)  ( cond )                     i IBC
BB13 [0006]  1       BB12                  1.00 2260 [034..03F)-> BB21(1)                 (always)                     i IBC idxlen
BB21 [0009]  2       BB13,BB20             1.01 2282 [051..055)-> BB15(1),BB22(0)         ( cond )                     i IBC bwd bwd-src
BB15 [0007]  1       BB21                  1.01 2282 [03F..040)-> BB18(0.52),BB19(0.48)   ( cond )                     i IBC loophead idxlen bwd bwd-target
BB18 [0169]  1       BB15                  0.52 1187 [03F..040)-> BB29(0.99),BB20(0.00964)  ( cond )                     i IBC bwd
BB19 [0183]  1       BB15                  0.87 1972 [03F..04D)-> BB29(0.99),BB20(0.00964)  ( cond )                     i IBC bwd
BB29 [0013]  2       BB18,BB19             1.00 2260 [???..???)-> BB23(1)                 (always)                     i IBC internal hascall
BB20 [0008]  2       BB18,BB19             0.01   22 [04D..051)-> BB21(1)                 (always)                     i IBC hascall bwd
BB23 [0010]  3       BB12,BB22,BB29        1.00 2261 [055..056)-> BB27(0.884),BB24(0.116) ( cond )                     i IBC hascall idxlen
BB27 [0243]  1       BB23                  0.88 1999 [055..056)-> BB28(1)                 (always)                     i IBC
BB24 [0240]  1       BB23                  0.12  262 [055..056)-> BB26(0.0461),BB25(0.954)  ( cond )                     i IBC
BB25 [0241]  1       BB24                  0.11  250 [055..056)-> BB28(1)                 (always)                     i IBC hascall gcsafe
BB26 [0242]  1       BB24                  0.01   12 [055..056)-> BB28(1)                 (always)                     i IBC
BB28 [0244]  3       BB25,BB26,BB27        1.00 2261 [055..05E)                           (return)                     i IBC
BB10 [0247]  1       BB09                  0       0 [???..???)-> BB11(1)                 (always)                     IBC rare internal
BB11 [0012]  2       BB02,BB10             0       0 [???..???)-> BB12(1)                 (always)                     i IBC rare internal hascall
BB22 [0014]  1       BB21                  0       0 [???..???)-> BB23(1)                 (always)                     i IBC rare internal hascall
BB30 [0248]  0                             0         [???..???)                           (throw )                     i rare keep internal
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

And for System.Tests.Perf_Random.NextBytes_unseeded (an interesting case in that the exceptional path is the hot path based on edge weights, even though the throw block has a weight of 0 due to RyuJIT's flowgraph semantics -- I guess if our post-RPO changes suggest a "rarely-run" block should be moved up because it isn't that rare, maybe we should unmark it as not rare?):

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight   IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1    100 [000..003)-> BB03(0.2),BB02(0.8)     ( cond )                     i IBC
BB03 [0002]  1       BB01                  1    100 [00A..017)                           (return)                     i IBC jmp hascall hist
BB02 [0001]  1       BB01                  0      0 [003..00A)                           (throw )                     i IBC rare hascall gcsafe
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

Based on the above and the the other fixup actions, it feels like we are getting to the point where we may actually want to create the explicit cost/benefit model to assess incremental improvements.

I'd be interested in pursuing this, though based on the above, I think it might be worthwhile trying to implement a more general post-RPO pass that does the moves I described above, since it seems simple enough.

AndyAyersMS · 2024-05-29T21:26:11Z

I am surprised some of these are so large. Do we have any results for AMD HW? All the reports above are for Intel.

AndyAyersMS · 2024-05-29T21:42:42Z

ADX data for the three class virtual test. Strongly suggests that this may be JCC errata kicking in.

Of course this is a two-way street, some of the improvements may be like this too.

AndyAyersMS · 2024-05-29T22:03:30Z

Randomly looking at improvements, it seems most are also intel only. But not all. Here's Benchstone.BenchI.BubbleSort2.Test

@DrewScoggins is looking into whether there are unfiled AMD64 reports out there. From what I can tell the newer AMD64 hardware is mostly indifferent.

AndyAyersMS · 2024-05-29T22:09:25Z

Here's a cross-arch regression: System.Linq.Tests.Perf_Enumerable.Contains_ElementNotFound(input: IEnumerable)

@amanasifkhalid perhaps dig into this one since it is less likely to be caused by some microarchitectural issue.

amanasifkhalid · 2024-05-30T15:23:25Z

@AndyAyersMS Thanks for pointing out that benchmark; this one seems to suffer from the same interleaving of hot/cold paths, too. Here's the old layout for System.Linq.Enumerable:Contains:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight       IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1      21340 [000..003)-> BB02(0),BB03(1)         ( cond )                     i IBC
BB03 [0002]  1       BB01                  1      21340 [00A..00D)-> BB45(0),BB04(1)         ( cond )                     i IBC
BB04 [0003]  1       BB03                  1      21340 [00D..00D)-> BB06(0),BB05(1)         ( cond )                     i IBC
BB05 [0032]  1       BB04                  1      21340 [???..???)-> BB07(1)                 (always)                     i IBC internal hascall gcsafe
BB07 [0031]  2       BB05,BB06             1      21340 [00D..014)-> BB08(1)                 (always)                     i IBC internal
BB08 [0004]  1  0    BB07                  1      21340 [014..016)-> BB22(0.01),BB10(0.99)   ( cond ) T0      try {       i IBC keep
BB10 [0085]  1  0    BB08                  1      21340 [???..???)-> BB22(0.01),BB11(0.99)   ( cond ) T0                  IBC internal
BB11 [0072]  1  0    BB10                  1            [???..???)-> BB17(1)                 (always) T0                  internal
BB12 [0005]  1  0    BB19                 97.88 2088852 [016..017)-> BB34(0),BB14(1)         ( cond ) T0                  i IBC loophead bwd bwd-target
BB14 [0046]  1  0    BB12                 97.88 2088852 [016..02B)-> BB63(0),BB17(1)         ( cond ) T0                  i IBC idxlen bwd
BB17 [0007]  2  0    BB11,BB14            98.87 2109979 [02F..030)-> BB20(0.00994),BB19(0.99)  ( cond ) T0                  i IBC bwd bwd-src
BB19 [0059]  1  0    BB17                 97.89 2088998 [02F..030)-> BB12(1)                 (always) T0                  i IBC bwd
BB27 [0078]  2  0    BB25,BB26             0.99   21100 [016..02B)-> BB63(0),BB28(1)         ( cond ) T0                  i IBC bwd
BB28 [0079]  2  0    BB22,BB27             1.00   21313 [02F..02F)-> BB32(0),BB29(1)         ( cond ) T0                  i IBC bwd bwd-src
BB29 [0080]  1  0    BB28                  1.00   21313 [02F..030)-> BB31(0.00994),BB30(0.99)  ( cond ) T0                  i IBC bwd
BB30 [0081]  1  0    BB29                  0.99   21101 [016..030)-> BB34(0),BB25(1)         ( cond ) T0                  i IBC bwd
BB25 [0076]  1  0    BB30                  0.99   21100 [016..017)-> BB27(1)                 (always) T0                  i IBC idxlen bwd
BB20 [0060]  1  0    BB17                  0.98   20982 [02F..030)-> BB41(1)                 (always) T0                  i IBC bwd
BB62 [0092]  0  0                          0            [???..???)                           (throw ) T0                  i rare keep internal
BB22 [0073]  2  0    BB08,BB10             0            [???..???)-> BB28(1)                 (always) T0                  rare internal
BB26 [0077]  1  0    BB32                  0          0 [???..???)-> BB27(1)                 (always) T0                  i IBC rare internal hascall gcsafe bwd
BB32 [0083]  1  0    BB28                  0          0 [02F..037)-> BB26(0.99),BB41(0.01)   ( cond ) T0                  i IBC rare internal hascall bwd
BB34 [0090]  2  0    BB12,BB30             0          0 [016..017)                           (throw ) T0                  i IBC rare gcsafe bwd
BB63 [0093]  2  0    BB14,BB27             0          0 [02B..02F)-> BB43(1)                 (always) T0                  i IBC rare bwd
BB31 [0082]  1  0    BB29                  0.01     212 [02F..030)-> BB41(1)                 (always) T0      }           i IBC bwd
BB41 [0063]  3       BB20,BB31,BB32        1.00   21340 [039..03C)-> BB42(0),BB53(1)         ( cond )                     i IBC keep cfb
BB53 [0021]  3       BB41,BB42,BB50        1.00   21340 [077..079)                           (return)                     i IBC
BB42 [0066]  1       BB41                  0          0 [???..???)-> BB53(1)                 (always)                     i IBC rare internal hascall gcsafe
BB43 [0027]  1       BB63                  0          0 [???..???)-> BB55(1)                 (callf )                     i IBC rare internal
BB44 [0028]  1       BB58                  0          0 [???..???)-> BB54(1)                 (callfr)                     i IBC rare internal xentry
BB45 [0012]  1       BB03                  0          0 [043..04A)-> BB46(1)                 (always)                     i IBC rare hascall gcsafe
BB46 [0013]  1  1    BB45                  0          0 [04A..04C)-> BB48(1)                 (always) T1      try {       i IBC rare keep
BB47 [0014]  1  1    BB48                  0          0 [04C..05F)-> BB49(0.1),BB48(0.9)     ( cond ) T1                  i IBC rare loophead hascall gcsafe bwd bwd-target
BB48 [0016]  2  1    BB46,BB47             0          0 [063..06B)-> BB47(0.9),BB50(0.1)     ( cond ) T1                  i IBC rare hascall bwd bwd-src
BB49 [0015]  1  1    BB47                  0          0 [05F..063)-> BB51(1)                 (always) T1      }           i IBC rare bwd
BB50 [0069]  1       BB48                  0          0 [06D..077)-> BB53(1)                 (always)                     i IBC rare keep gcsafe cfb
BB51 [0023]  1       BB49                  0          0 [???..???)-> BB59(1)                 (callf )                     i IBC rare internal
BB52 [0024]  1       BB61                  0          0 [???..???)-> BB54(1)                 (callfr)                     i IBC rare internal xentry
BB54 [0022]  2       BB44,BB52             0          0 [079..07B)                           (return)                     i IBC rare
BB02 [0001]  1       BB01                  0          0 [003..00A)                           (throw )                     i IBC rare hascall gcsafe
BB06 [0033]  1       BB04                  0          0 [???..???)-> BB07(1)                 (always)                     i IBC rare internal hascall gcsafe
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ funclets follow
BB55 [0009]  2     0 BB43                  0.00       0 [039..03C)-> BB58(0.000997),BB56(0.999)    ( cond )    H0 F finally {   i IBC keep xentry flet
BB56 [0010]  1     0 BB55                  0.00       0 [03C..???)-> BB58(1),BB57(0)         ( cond )    H0               i IBC xentry
BB57 [0042]  1     0 BB56                  0          0 [???..???)-> BB58(1)                 (always)    H0               i IBC rare internal hascall xentry gcsafe
BB58 [0040]  3     0 BB55,BB56,BB57        0.00       0 [042..043)-> BB44(0.5)               (finret)    H0   }           i IBC xentry
BB59 [0018]  2     1 BB51                  0.00       0 [06D..070)-> BB61(0.48),BB60(0.52)   ( cond )    H1 F finally {   i IBC keep xentry flet
BB60 [0019]  1     1 BB59                  0.00       0 [070..076)-> BB61(1)                 (always)    H1               i IBC hascall xentry gcsafe
BB61 [0020]  2     1 BB59,BB60             0.00       0 [076..077)-> BB52(0.5)               (finret)    H1   }           i IBC xentry
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

And the new layout:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight       IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1      22444 [000..003)-> BB03(1),BB02(0)         ( cond )                     i IBC
BB03 [0002]  1       BB01                  1      22444 [00A..00D)-> BB45(0),BB04(1)         ( cond )                     i IBC
BB04 [0003]  1       BB03                  1      22444 [00D..00D)-> BB06(0),BB05(1)         ( cond )                     i IBC
BB05 [0032]  1       BB04                  1      22444 [???..???)-> BB07(1)                 (always)                     i IBC internal hascall gcsafe
BB07 [0031]  2       BB05,BB06             1      22444 [00D..014)-> BB08(1)                 (always)                     i IBC internal
BB08 [0004]  1  0    BB07                  1      22444 [014..016)-> BB22(0.01),BB10(0.99)   ( cond ) T0      try {       i IBC keep
BB10 [0085]  1  0    BB08                  1      22444 [???..???)-> BB22(0.01),BB11(0.99)   ( cond ) T0                  IBC internal
BB11 [0072]  1  0    BB10                  1            [???..???)-> BB17(1)                 (always) T0                  internal
BB17 [0007]  2  0    BB11,BB14            99.16 2225627 [02F..030)-> BB20(0.00991),BB19(0.99)  ( cond ) T0                  i IBC bwd bwd-src
BB19 [0059]  1  0    BB17                 98.18 2203568 [016..030)-> BB14(1),BB34(0)         ( cond ) T0                  i IBC bwd
BB14 [0046]  1  0    BB19                 98.17 2203407 [016..02B)-> BB17(1),BB37(0)         ( cond ) T0                  i IBC idxlen bwd
BB20 [0060]  1  0    BB17                  0.98   22059 [02F..030)-> BB41(1)                 (always) T0                  i IBC bwd
BB28 [0079]  2  0    BB22,BB27             1.00   22481 [02F..02F)-> BB32(0),BB29(1)         ( cond ) T0                  i IBC bwd bwd-src
BB29 [0080]  1  0    BB28                  1.00   22481 [02F..030)-> BB31(0.00991),BB30(0.99)  ( cond ) T0                  i IBC bwd
BB30 [0081]  1  0    BB29                  0.99   22258 [016..030)-> BB25(1),BB34(0)         ( cond ) T0                  i IBC bwd
BB25 [0076]  1  0    BB30                  0.99   22257 [016..017)-> BB27(1)                 (always) T0                  i IBC idxlen bwd
BB31 [0082]  1  0    BB29                  0.01     223 [02F..030)-> BB41(1)                 (always) T0                  i IBC bwd
BB27 [0078]  2  0    BB25,BB26             0.99   22257 [016..02B)-> BB28(1),BB37(0)         ( cond ) T0                  i IBC bwd
BB22 [0073]  2  0    BB08,BB10             0            [???..???)-> BB28(1)                 (always) T0                  rare internal
BB34 [0090]  2  0    BB19,BB30             0          0 [016..017)                           (throw ) T0                  i IBC rare gcsafe bwd
BB32 [0083]  1  0    BB28                  0          0 [02F..037)-> BB26(0.99),BB41(0.00998)  ( cond ) T0                  i IBC rare internal hascall bwd
BB26 [0077]  1  0    BB32                  0          0 [???..???)-> BB27(1)                 (always) T0                  i IBC rare internal hascall gcsafe bwd
BB37 [0091]  2  0    BB14,BB27             0          0 [02B..02F)-> BB43(1)                 (always) T0                  i IBC rare bwd
BB62 [0092]  0  0                          0            [???..???)                           (throw ) T0      }           i rare keep internal
BB41 [0063]  3       BB20,BB31,BB32        1.00   22444 [039..03C)-> BB53(1),BB42(0)         ( cond )                     i IBC keep cfb
BB46 [0013]  2  1    BB45,BB47             0          0 [04A..06B)-> BB47(0.9),BB50(0.1)     ( cond ) T1      try {       i IBC rare keep bwd
BB47 [0014]  1  1    BB46                  0          0 [04C..05F)-> BB46(0.9),BB49(0.1)     ( cond ) T1                  i IBC rare loophead hascall gcsafe bwd bwd-target
BB49 [0015]  1  1    BB47                  0          0 [05F..063)-> BB51(1)                 (always) T1      }           i IBC rare bwd
BB53 [0021]  3       BB41,BB42,BB50        1.00   22444 [077..079)                           (return)                     i IBC
BB06 [0033]  1       BB04                  0          0 [???..???)-> BB07(1)                 (always)                     i IBC rare internal hascall gcsafe
BB43 [0027]  1       BB37                  0          0 [???..???)-> BB55(1)                 (callf )                     i IBC rare internal
BB44 [0028]  1       BB58                  0          0 [???..???)-> BB54(1)                 (callfr)                     i IBC rare internal xentry
BB42 [0066]  1       BB41                  0          0 [???..???)-> BB53(1)                 (always)                     i IBC rare internal hascall gcsafe
BB45 [0012]  1       BB03                  0          0 [043..04A)-> BB46(1)                 (always)                     i IBC rare hascall gcsafe
BB51 [0023]  1       BB49                  0          0 [???..???)-> BB59(1)                 (callf )                     i IBC rare internal
BB52 [0024]  1       BB61                  0          0 [???..???)-> BB54(1)                 (callfr)                     i IBC rare internal xentry
BB54 [0022]  2       BB44,BB52             0          0 [079..07B)                           (return)                     i IBC rare
BB50 [0069]  1       BB46                  0          0 [06D..077)-> BB53(1)                 (always)                     i IBC rare keep gcsafe cfb
BB02 [0001]  1       BB01                  0          0 [003..00A)                           (throw )                     i IBC rare hascall gcsafe
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ funclets follow
BB55 [0009]  2     0 BB43                  0.00       0 [039..03C)-> BB58(0.00178),BB56(0.998)   ( cond )    H0 F finally {   i IBC keep xentry flet
BB56 [0010]  1     0 BB55                  0.00       0 [03C..???)-> BB58(1),BB57(0)         ( cond )    H0               i IBC xentry
BB57 [0042]  1     0 BB56                  0          0 [???..???)-> BB58(1)                 (always)    H0               i IBC rare internal hascall xentry gcsafe
BB58 [0040]  3     0 BB55,BB56,BB57        0.00       0 [042..043)-> BB44(0.5)               (finret)    H0   }           i IBC xentry
BB59 [0018]  2     1 BB51                  0.00       0 [06D..070)-> BB61(0.48),BB60(0.52)   ( cond )    H1 F finally {   i IBC keep xentry flet
BB60 [0019]  1     1 BB59                  0.00       0 [070..076)-> BB61(1)                 (always)    H1               i IBC hascall xentry gcsafe
BB61 [0020]  2     1 BB59,BB60             0.00       0 [076..077)-> BB52(0.5)               (finret)    H1   }           i IBC xentry
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

Both seem to do a good job of keeping the loop in T0 compact, though notice the new layout doesn't create any fallthrough into BB53, the hot return block. I'm guessing the RPO layout didn't move it up to BB41, it's hottest pred, because of the interleaving of try regions. Enabling movement of forward branches post-RPO layout would fix this, and push the cold T1 region further down the method.

AndyAyersMS · 2024-05-30T16:30:24Z

It is hard to see how the return block placement could have that much impact on perf -- we are looking at a pretty big regression here. Can you copy out the two inner loop disasms?

Also puzzled why BB11 doesn't have IBC data -- these sorts of mixed IBC/noIBC cases have caused us trouble in the past.

amanasifkhalid · 2024-05-30T19:02:43Z

Can you copy out the two inner loop disasms?

Sure. Old layout:

G_M23587_IG05:  ;; offset=0x0075
       mov      r8d, dword ptr [rsi+0x08]
       cmp      r8d, dword ptr [rsi+0x0C]
       jae      G_M23587_IG16
       mov      r14, gword ptr [rsi+0x10]
       cmp      r8d, dword ptr [r14+0x08]
       jae      SHORT G_M23587_IG12
       mov      eax, r8d
       mov      r15d, dword ptr [r14+4*rax+0x10]
       cmp      r15d, ebx
       je       G_M23587_IG17
						;; size=41 bbWeight=100.08 PerfScore 1551.23
G_M23587_IG06:  ;; offset=0x009E
       mov      edx, dword ptr [rsi+0x08]
       inc      edx
       cmp      edx, dword ptr [rsi+0x0C]
       jae      SHORT G_M23587_IG11
						;; size=10 bbWeight=101.07 PerfScore 631.69
G_M23587_IG07:  ;; offset=0x00A8
       mov      dword ptr [rsi+0x08], edx
       jmp      SHORT G_M23587_IG05
						;; size=5 bbWeight=100.07 PerfScore 300.21

New layout:

G_M23587_IG05:  ;; offset=0x006D
       mov      ecx, dword ptr [rsi+0x08]
       inc      ecx
       cmp      ecx, dword ptr [rsi+0x0C]
       jae      SHORT G_M23587_IG08
						;; size=10 bbWeight=97.07 PerfScore 606.67
G_M23587_IG06:  ;; offset=0x0077
       mov      dword ptr [rsi+0x08], ecx
       mov      edx, dword ptr [rsi+0x08]
       cmp      edx, dword ptr [rsi+0x0C]
       jae      SHORT G_M23587_IG15
						;; size=11 bbWeight=96.10 PerfScore 672.67
G_M23587_IG07:  ;; offset=0x0082
       mov      r8, gword ptr [rsi+0x10]
       cmp      edx, dword ptr [r8+0x08]
       jae      G_M23587_IG18
       mov      eax, edx
       mov      r14d, dword ptr [r8+4*rax+0x10]
       cmp      r14d, ebx
       jne      SHORT G_M23587_IG05
       jmp      G_M23587_IG17
						;; size=31 bbWeight=96.08 PerfScore 1104.88

The new layout has one fewer branch on the loop path, so it seems better?

Also puzzled why BB11 doesn't have IBC data -- these sorts of mixed IBC/noIBC cases have caused us trouble in the past.

BB11 and BB22 are fast/slow loop preheaders. When creating these in Compiler::optCloneLoop, it looks like we're setting the preheaders' weights without propagating BBF_PROF_WEIGHT from their preds. I can open a fix for this.

AndyAyersMS · 2024-05-30T20:30:56Z

Seems like we ought to be aligning these loops... any idea why not?
Can you add the alignment boundaries?

Also interesting that we don't have a peephole opt for this:

       mov      dword ptr [rsi+0x08], ecx
       mov      edx, dword ptr [rsi+0x08]

amanasifkhalid · 2024-05-30T21:07:55Z

Can you add the alignment boundaries?

Sure. With the old layout, here's the disasm up to the loop end:

G_M23587_IG01:  ;; offset=0x0000
       push     rbp
       push     r15
       push     r14
       push     rdi
       push     rsi
       push     rbx
       sub      rsp, 72
       lea      rbp, [rsp+0x70]
       mov      qword ptr [rbp-0x50], rsp
       mov      ebx, edx
       mov      rsi, r8
						;; size=26 bbWeight=1 PerfScore 8.25
G_M23587_IG02:  ;; offset=0x001A
       test     rcx, rcx
       je       G_M23587_IG35
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (je: 3 ; jcc erratum) 32B boundary ...............................
       test     rsi, rsi
       jne      G_M23587_IG25
       mov      r11, 0x7FFE3ED73A80      ; System.Linq.Tests.LinqTestData+IEnumerableWrapper`1[int]
       cmp      qword ptr [rcx], r11
       jne      G_M23587_IG36
       mov      rcx, gword ptr [rcx+0x08]
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (mov: 3) 32B boundary ...............................
       mov      r11, 0x7FFE3DCF0AF8      ; code for System.Collections.Generic.IEnumerable`1[int]:GetEnumerator():System.Collections.Generic.IEnumerator`1[int]:this
       call     [r11]System.Collections.Generic.IEnumerable`1[int]:GetEnumerator():System.Collections.Generic.IEnumerator`1[int]:this
       mov      rsi, rax
						;; size=57 bbWeight=1 PerfScore 12.25
G_M23587_IG03:  ;; offset=0x0053
       mov      gword ptr [rbp-0x38], rsi
						;; size=4 bbWeight=1 PerfScore 1.00
G_M23587_IG04:  ;; offset=0x0057
       test     rsi, rsi
       je       G_M23587_IG13
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (je: 0 ; jcc erratum) 32B boundary ...............................
       mov      rdi, 0x7FFE3F0A8D68      ; System.SZGenericArrayEnumerator`1[int]
       cmp      qword ptr [rsi], rdi
       jne      G_M23587_IG13
       jmp      SHORT G_M23587_IG06
       align    [0 bytes for IG05]
						;; size=30 bbWeight=1 PerfScore 7.50
G_M23587_IG05:  ;; offset=0x0075
       mov      r8d, dword ptr [rsi+0x08]
       cmp      r8d, dword ptr [rsi+0x0C]
       jae      G_M23587_IG16
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (jae: 3 ; jcc erratum) 32B boundary ...............................
       mov      r14, gword ptr [rsi+0x10]
       cmp      r8d, dword ptr [r14+0x08]
       jae      SHORT G_M23587_IG12
       mov      eax, r8d
       mov      r15d, dword ptr [r14+4*rax+0x10]
       cmp      r15d, ebx
       je       G_M23587_IG17
						;; size=41 bbWeight=101.42 PerfScore 1571.97
G_M23587_IG06:  ;; offset=0x009E
       mov      edx, dword ptr [rsi+0x08]
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (mov: 1) 32B boundary ...............................
       inc      edx
       cmp      edx, dword ptr [rsi+0x0C]
       jae      SHORT G_M23587_IG11
						;; size=10 bbWeight=102.41 PerfScore 640.05
G_M23587_IG07:  ;; offset=0x00A8
       mov      dword ptr [rsi+0x08], edx
       jmp      SHORT G_M23587_IG05
						;; size=5 bbWeight=101.39 PerfScore 304.16

And with the new layout:

G_M23587_IG01:  ;; offset=0x0000
       push     rbp
       push     r14
       push     rdi
       push     rsi
       push     rbx
       sub      rsp, 64
       lea      rbp, [rsp+0x60]
       mov      qword ptr [rbp-0x40], rsp
       mov      ebx, edx
       mov      rsi, r8
						;; size=24 bbWeight=1 PerfScore 7.25
G_M23587_IG02:  ;; offset=0x0018
       test     rcx, rcx
       je       G_M23587_IG33
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (je: 1 ; jcc erratum) 32B boundary ...............................
       test     rsi, rsi
       jne      G_M23587_IG27
       mov      r11, 0x7FFE51746400      ; System.Linq.Tests.LinqTestData+IEnumerableWrapper`1[int]
       cmp      qword ptr [rcx], r11
       jne      G_M23587_IG23
       mov      rcx, gword ptr [rcx+0x08]
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (mov: 1) 32B boundary ...............................
       mov      r11, 0x7FFE506E0AF0      ; code for System.Collections.Generic.IEnumerable`1[int]:GetEnumerator():System.Collections.Generic.IEnumerator`1[int]:this
       call     [r11]System.Collections.Generic.IEnumerable`1[int]:GetEnumerator():System.Collections.Generic.IEnumerator`1[int]:this
       mov      rsi, rax
						;; size=57 bbWeight=1 PerfScore 12.25
G_M23587_IG03:  ;; offset=0x0051
       mov      gword ptr [rbp-0x30], rsi
						;; size=4 bbWeight=1 PerfScore 1.00
G_M23587_IG04:  ;; offset=0x0055
       test     rsi, rsi
       je       G_M23587_IG14
       mov      rdi, 0x7FFE51A8B620      ; System.SZGenericArrayEnumerator`1[int]
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (mov: 8) 32B boundary ...............................
       cmp      qword ptr [rsi], rdi
       jne      SHORT G_M23587_IG14
       align    [0 bytes for IG05]
						;; size=24 bbWeight=1 PerfScore 5.50
G_M23587_IG05:  ;; offset=0x006D
       mov      ecx, dword ptr [rsi+0x08]
       inc      ecx
       cmp      ecx, dword ptr [rsi+0x0C]
       jae      SHORT G_M23587_IG08
						;; size=10 bbWeight=98.43 PerfScore 615.19
G_M23587_IG06:  ;; offset=0x0077
       mov      dword ptr [rsi+0x08], ecx
       mov      edx, dword ptr [rsi+0x08]
       cmp      edx, dword ptr [rsi+0x0C]
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (cmp: 0 ; jcc erratum) 32B boundary ...............................
       jae      SHORT G_M23587_IG15
						;; size=11 bbWeight=97.47 PerfScore 682.27
G_M23587_IG07:  ;; offset=0x0082
       mov      r8, gword ptr [rsi+0x10]
       cmp      edx, dword ptr [r8+0x08]
       jae      G_M23587_IG18
       mov      eax, edx
       mov      r14d, dword ptr [r8+4*rax+0x10]
       cmp      r14d, ebx
       jne      SHORT G_M23587_IG05
       jmp      G_M23587_IG17
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (jmp: 1 ; jcc erratum) 32B boundary ...............................

We're hitting the JCC erratum more often than I previously thought, though only one instance is in the loop body. And we're failing to align the loop correctly in both cases. Looking at the dump, the JIT is complaining this loop needs too much padding:

*************** In emitLoopAlignAdjustments()
compJitAlignLoopAdaptive       = true
compJitAlignLoopBoundary       = 32
compJitAlignLoopMinBlockWeight = 3
compJitAlignLoopForJcc         = false
compJitAlignLoopMaxCodeSize    = 96
compJitAlignPaddingLimit       = 15
  Adjusting 'align' instruction in IG04 that is targeted for IG05 
*************** In getLoopSize() for G_M23587_IG05
   G_M23587_IG05 has 10 bytes.
   G_M23587_IG06 has 11 bytes.
   G_M23587_IG07 has 31 bytes. -- Found the back edge.
loopSize of G_M23587_IG05 = 52 bytes.
;; Skip alignment: 'Loop at G_M23587_IG05 PaddingNeeded= 15, MaxPadding= 8, LoopSize= 52, AlignmentBoundary= 16B.'
;; Calculated padding to add 0 bytes to align G_M23587_IG05 at 16B boundary.
Adjusted alignment for G_M23587_IG05 from 15 to 0.

AndyAyersMS · 2024-05-30T21:39:34Z

Is the first bit of asm above actually the new code too? They look the same.

I am confused by the padding output too -- looks like the loop tops are at x75 and x6d, so to reach the next 16 byte boundaries at x80/x70 we would need only 11 bytes and 3 bytes of padding. So seems like we should pad in the new case, and in neither case would we need 15 bytes.

amanasifkhalid · 2024-05-30T22:17:36Z

Is the first bit of asm above actually the new code too? They look the same.

You're right, sorry about that -- I updated it.

Looking at the dump for the old layout, the JIT is overestimating the padding needed, and thus skipping alignment here, too:

*************** In emitLoopAlignAdjustments()
compJitAlignLoopAdaptive       = true
compJitAlignLoopBoundary       = 32
compJitAlignLoopMinBlockWeight = 3
compJitAlignLoopForJcc         = false
compJitAlignLoopMaxCodeSize    = 96
compJitAlignPaddingLimit       = 15
  Adjusting 'align' instruction in IG04 that is targeted for IG05 
*************** In getLoopSize() for G_M23587_IG05
   G_M23587_IG05 has 41 bytes.
   G_M23587_IG06 has 10 bytes.
   G_M23587_IG07 has 5 bytes. -- Found the back edge.
loopSize of G_M23587_IG05 = 56 bytes.
;; Skip alignment: 'Loop at G_M23587_IG05 PaddingNeeded= 11, MaxPadding= 8, LoopSize= 56, AlignmentBoundary= 16B.'
;; Calculated padding to add 0 bytes to align G_M23587_IG05 at 16B boundary.

I'm wondering if the block-level peephole optimization for skipping unconditional jumps, or the fact that conditional blocks may no longer fall into their false targets, is causing this overestimation? Though emitLoopAlignAdjustments happens late enough that those block-level changes shouldn't be a problem...

AndyAyersMS · 2024-05-30T22:36:19Z

The older padding computation seems ok? Loop top is at 0x75 so we would need 0x0B == 11 bytes to reach 0x80, and we're not willing to pad by more than 8. So we bail.

The new layout, though...?

Do you see a perf difference locally when you run these? I am not sure why the new version ends up being so much slower.

amanasifkhalid · 2024-05-31T15:23:59Z

So it turns out the final offset of the loop beginning in the new layout is 0x6D bytes, but when deciding whether to align or not, the JIT uses the loop's estimated offset, which in this case is 0x71 -- hence why the JIT thinks we need 15 bytes to align the loop. The difference in estimated vs actual offset stems from a jump preceding the loop being predicted as 6 bytes in length, and later turning out to be only 2 bytes.

Do you see a perf difference locally when you run these? I am not sure why the new version ends up being so much slower.

~~I see a perf diff when the benchmark input is an ICollection, but not when it's an IEnumerable like in the graph you shared:~~ I reran it a few times, and the initial slowdown I saw with the new layout is gone. They look about the same on my win-x64 machine (which has an AMD CPU, by the way, so I'm assuming there's no JCC erratum mitigation affecting perf):

Old layout:

Method	input	Mean	Error	StdDev	Median	Min	Max	Gen0	Allocated
Contains_ElementNotFound	ICollection	9.221 ns	0.0877 ns	0.0821 ns	9.228 ns	9.078 ns	9.331 ns	-	-
Contains_ElementNotFound	IEnumerable	210.609 ns	0.5856 ns	0.5191 ns	210.582 ns	210.030 ns	211.789 ns	0.0017	32 B

New layout:

Method	input	Mean	Error	StdDev	Median	Min	Max	Gen0	Allocated
Contains_ElementNotFound	ICollection	9.475 ns	0.0592 ns	0.0554 ns	9.486 ns	9.338 ns	9.531 ns	-	-
Contains_ElementNotFound	IEnumerable	208.610 ns	0.8711 ns	0.7274 ns	208.437 ns	208.041 ns	210.256 ns	0.0017	32 B

amanasifkhalid · 2024-06-14T18:02:44Z

I've collated the top 20 benchmark regressions across platforms, not double-counting any repeat offenders -- the duplicate names are from GitHub's markdown viewer not rendering templated types due to the <> syntax. Here they are:

Name	Base	Test	Ratio
System.IO.Hashing.Tests.Crc32_AppendPerf:Append	335.26 ns	1.07 μs	3.2
GuardedDevirtualization.TwoClassVirtual:Call	1.90 ns	5.70 ns	3.0
GuardedDevirtualization.ThreeClassVirtual:Call	1.07 ns	2.90 ns	2.72
GuardedDevirtualization.TwoClassInterface:Call	1.45 ns	2.97 ns	2.05
GuardedDevirtualization.ThreeClassInterface:Call	2.12 ns	4.22 ns	1.99
System.Collections.Sort:Array_Comparison	9.19 μs	17.66 μs	1.92
System.Text.Json.Document.Tests.Perf_EnumerateArray:EnumerateArray	707.17 ns	1.34 μs	1.9
System.Tests.Perf_String:TrimEnd	1.69 ns	3.18 ns	1.88
Benchstone.BenchF.NewtR:Test	55.69 ms	102.90 ms	1.85
LinqBenchmarks:Count00ForX	58.01 ms	103.57 ms	1.79
System.Collections.Sort:Array_ComparerClass	8.31 μs	14.71 μs	1.77
System.Collections.Sort:Array_ComparerClass	9.85 μs	17.36 μs	1.76
System.Collections.Sort:Array_Comparison	7.32 μs	12.29 μs	1.68
System.Linq.Tests.Perf_Enumerable:SelectToList	125.92 ns	209.71 ns	1.67
System.Collections.Sort:LinqQuery	17.06 μs	28.35 μs	1.66
System.Tests.Perf_String:TrimStart_CharArr	2.52 ns	4.12 ns	1.64
System.Collections.Sort:Array_ComparerClass	7.15 μs	11.61 μs	1.62
System.Collections.Sort:LinqOrderByExtension	17.45 μs	28.14 μs	1.61
System.Text.Encodings.Web.Tests.Perf_Encoders:EncodeUtf8	53.46 ns	85.59 ns	1.6
System.Collections.Sort:LinqOrderByExtension	20.34 μs	32.52 μs	1.6

I opened a PR (#103450) implementing a 3-opt pass post-RPO layout, but even with a few iterations, TP impact is up to 10%. I'm concerned that too few iterations leaves too much code quality on the table, but the TP cost is significant. This has pushed me to reconsider my narrower (and cheaper) approach in #102927. I'll update with the new layouts for the biggest regressions below, but I'd like to highlight a blocker I've noticed in System.IO.Hashing.Tests.Crc32_AppendPerf:Append. Here is the initial RPO layout of System.IO.Hashing.NonCryptographicHashAlgorithm:Append:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight   IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1    100 [000..003)-> BB03(1),BB02(0)         ( cond )                     i IBC
BB03 [0008]  1       BB01                  1    100 [000..01A)-> BB19(0),BB04(1)         ( cond )                     i IBC idxlen nullcheck
BB04 [0004]  1       BB03                  1    100 [015..016)-> BB17(0),BB05(1)         ( cond )                     i IBC
BB05 [0012]  1       BB04                  1    100 [015..016)-> BB09(0.999),BB06(0.000981)    ( cond )                     i IBC
BB09 [0033]  1       BB05                  1.00 100 [015..016)-> BB11(1)                 (always)                     i IBC
BB06 [0030]  1       BB05                  0.00   0 [015..016)-> BB07(1)                 (always)                     i IBC
BB07 [0031]  2       BB06,BB07             0.00   0 [015..016)-> BB07(0),BB08(1)         ( cond )                     i IBC loophead hascall gcsafe bwd bwd-target bwd-src
BB08 [0032]  1       BB07                  0.00   0 [015..016)-> BB12(1)                 (always)                     i IBC hascall gcsafe
BB12 [0035]  2       BB08,BB11             1.00 100 [015..016)-> BB10(0),BB13(1)         ( cond )                     i IBC bwd bwd-src
BB13 [0036]  1       BB12                  1.00 100 [015..016)-> BB15(0),BB14(1)         ( cond )                     i IBC hascall gcsafe
BB14 [0037]  1       BB13                  1.00 100 [015..016)-> BB16(1)                 (always)                     i IBC
BB15 [0038]  1       BB13                  0      0 [015..016)-> BB16(1)                 (always)                     i IBC rare hascall gcsafe
BB16 [0039]  2       BB14,BB15             1    100 [015..016)-> BB18(1)                 (always)                     i IBC
BB10 [0034]  1       BB12                  0      0 [015..015)-> BB11(1)                 (always)                     i IBC rare loophead hascall gcsafe bwd bwd-target
BB11 [0041]  2       BB09,BB10             0      0 [015..016)-> BB12(1)                 (always)                     i IBC rare bwd
BB17 [0013]  1       BB04                  0      0 [015..016)-> BB18(1)                 (always)                     i IBC rare hascall gcsafe
BB18 [0014]  2       BB16,BB17             1    100 [015..???)-> BB20(1)                 (always)                     i IBC internal
BB19 [0005]  1       BB03                  0      0 [???..???)-> BB20(1)                 (always)                     i IBC rare internal hascall gcsafe
BB20 [0003]  2       BB18,BB19             1    100 [01A..01B)                           (return)                     i IBC internal
BB02 [0001]  1       BB01                  0      0 [003..00E)                           (throw )                     i IBC rare hascall gcsafe newobj
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

BB09->BB11 is the first non-fallthrough instance we have the opportunity to fix up. However, the profile data indicates BB11 is rarely-run, which does not make sense, considering BB09 is hot and always jumps to it. Our hands are tied here though, as fgMoveColdBlocks will undo any fallthrough we create in these cases. Here's the final layout:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight   IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1    100 [000..003)-> BB03(1),BB02(0)         ( cond )                     i IBC
BB03 [0008]  1       BB01                  1    100 [000..01A)-> BB19(0),BB04(1)         ( cond )                     i IBC idxlen nullcheck
BB04 [0004]  1       BB03                  1    100 [015..016)-> BB17(0),BB05(1)         ( cond )                     i IBC
BB05 [0012]  1       BB04                  1    100 [015..016)-> BB09(0.999),BB06(0.000981)    ( cond )                     i IBC
BB09 [0033]  1       BB05                  1.00 100 [015..016)-> BB11(1)                 (always)                     i IBC
BB06 [0030]  1       BB05                  0.00   0 [015..016)-> BB07(1)                 (always)                     i IBC
BB07 [0031]  2       BB06,BB07             0.00   0 [015..016)-> BB07(0),BB08(1)         ( cond )                     i IBC loophead hascall gcsafe bwd bwd-target bwd-src
BB08 [0032]  1       BB07                  0.00   0 [015..016)-> BB12(1)                 (always)                     i IBC hascall gcsafe
BB12 [0035]  2       BB08,BB11             1.00 100 [015..016)-> BB10(0),BB13(1)         ( cond )                     i IBC bwd bwd-src
BB13 [0036]  1       BB12                  1.00 100 [015..016)-> BB15(0),BB14(1)         ( cond )                     i IBC hascall gcsafe
BB14 [0037]  1       BB13                  1.00 100 [015..016)-> BB16(1)                 (always)                     i IBC
BB16 [0039]  2       BB14,BB15             1    100 [015..016)-> BB18(1)                 (always)                     i IBC
BB18 [0014]  2       BB16,BB17             1    100 [015..???)-> BB20(1)                 (always)                     i IBC internal
BB20 [0003]  2       BB18,BB19             1    100 [01A..01B)                           (return)                     i IBC internal
BB15 [0038]  1       BB13                  0      0 [015..016)-> BB16(1)                 (always)                     i IBC rare hascall gcsafe
BB10 [0034]  1       BB12                  0      0 [015..015)-> BB11(1)                 (always)                     i IBC rare loophead hascall gcsafe bwd bwd-target
BB11 [0041]  2       BB09,BB10             0      0 [015..016)-> BB12(1)                 (always)                     i IBC rare bwd
BB17 [0013]  1       BB04                  0      0 [015..016)-> BB18(1)                 (always)                     i IBC rare hascall gcsafe
BB19 [0005]  1       BB03                  0      0 [???..???)-> BB20(1)                 (always)                     i IBC rare internal hascall gcsafe
BB02 [0001]  1       BB01                  0      0 [003..00E)                           (throw )                     i IBC rare hascall gcsafe newobj
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

Notice BB06, BB07, and BB08 also have weights of 0, but aren't flagged as rarely-run, and thus aren't moved, but fixing BB11's weight would fix this issue too: creating fallthrough from BB09 to BB11 would enable us to also move BB12-BB20 up, thus pushing the cold blocks out of the way. I suspect this poor interleaving explains the large benchmark regression. k-opt isn't any less resilient to this, as it relies on profile data to compute layout costs. So if the profile data is misleading, we probably won't get a good layout either way.

@AndyAyersMS I haven't investigated which phases are responsible for the profile data issues just yet, though perhaps we could consider running profile repair right before layout to expedite solving this? I can collect some metrics to get an idea of how common this issue of hot blocks' most likely successors being cold is. If we can get the profile data consistent at least in terms of the BBF_RUN_RARELY flag usage, I suspect #102927 will fix many of these regressions.

AndyAyersMS · 2024-06-14T18:19:22Z

I'm surprised the 3-opt is looking so costly. It should be fairly cheap.

I suppose it makes sense to try profile repair. Last we knew 30%+ of methods were profile inconsistent after inlining, and it can only get worse from there.

If you share out the jitdump for the above I can perhaps start working on fixing the maintenance issues.

amanasifkhalid · 2024-06-14T18:25:40Z

I'm surprised the 3-opt is looking so costly. It should be fairly cheap.

It's possible there's some significant oversight in my implementation... I tried to keep it simple: For now, layout cost is modeled solely with edge weights, where edges with fallthrough behavior just have a cost of zero. With a sufficient number of iterations (usually no more than 5), I was getting good results for the benchmarks I looked at above in this comment. But it was costly...

If you share out the jitdump for the above I can perhaps start working on fixing the maintenance issues.

Thank you! Here it is

amanasifkhalid · 2024-06-14T19:21:24Z

System.Text.Json.Document.Tests.Perf_EnumerateArray.EnumerateArray is another interesting example, though not necessarily because of profile inconsistency issues. Here's the layout, with changes from #102927:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight        IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1       64216 [000..001)-> BB03(1),BB02(0)         ( cond )                     i IBC nullcheck
BB03 [0015]  1       BB01                  1       64216 [000..001)-> BB06(1),BB18(0)         ( cond )                     i IBC
BB06 [0024]  1       BB03                  1       64216 [000..001)-> BB09(1),BB07(0)         ( cond )                     i IBC
BB09 [0032]  1       BB06                  1       64216 [000..001)-> BB10(1),BB23(0)         ( cond )                     i IBC idxlen
BB10 [0037]  1       BB09                  1       64216 [000..001)-> BB12(0.2),BB11(0.8)     ( cond )                     i IBC idxlen nullcheck
BB11 [0042]  1       BB10                  0.80    51373 [000..001)-> BB12(1)                 (always)                     i IBC hascall gcsafe
BB12 [0043]  2       BB10,BB11             1       64216 [000..001)-> BB13(1)                 (always)                     i IBC idxlen nullcheck
BB13 [0038]  2       BB08,BB12             1.00    64216 [000..001)-> BB14(1),BB28(0)         ( cond )                     i IBC
BB14 [0056]  1       BB13                  1.00    64216 [000..001)-> BB17(1),BB16(0)         ( cond )                     i IBC
BB17 [0011]  1       BB14                  1       64216 [000..001)-> BB19(1),BB18(0)         ( cond )                     i IBC nullcheck
BB19 [0068]  1       BB17                  1       64216 [000..001)-> BB22(1),BB20(0)         ( cond )                     i IBC
BB22 [0076]  1       BB19                  1       64216 [000..001)-> BB24(1),BB23(0)         ( cond )                     i IBC idxlen
BB24 [0081]  1       BB22                  1       64216 [000..001)-> BB26(0.2),BB25(0.8)     ( cond )                     i IBC idxlen nullcheck
BB25 [0086]  1       BB24                  0.80    51373 [000..001)-> BB26(1)                 (always)                     i IBC hascall gcsafe
BB26 [0087]  2       BB24,BB25             1       64216 [000..001)-> BB27(1)                 (always)                     i IBC idxlen nullcheck
BB27 [0082]  2       BB21,BB26             1.00    64216 [000..001)-> BB29(1),BB28(0)         ( cond )                     i IBC
BB29 [0100]  1       BB27                  1.00    64216 [000..001)-> BB31(0.00345),BB30(0.997)   ( cond )                     i IBC
BB30 [0061]  1       BB29                  1.00    63994 [000..001)-> BB32(1)                 (always)                     i IBC
BB32 [0065]  2       BB30,BB31             1       64216 [000..014)-> BB33(1)                 (always)                     i IBC nullcheck
BB31 [0062]  1       BB29                  0.00      222 [000..001)-> BB32(1)                 (always)                     i IBC
BB33 [0001]  2  0    BB32,BB50           297.92 19131096 [014..01F)-> BB35(1),BB34(0)         ( cond ) T0      try {       i IBC keep loophead bwd
BB35 [0114]  1  0    BB33                297.92 19131096 [01E..01F)-> BB37(0.999),BB36(0.000993)    ( cond ) T0                  i IBC bwd
BB37 [0116]  1  0    BB35                297.62 19112096 [01E..01F)-> BB38(1),BB52(0)         ( cond ) T0                  i IBC nullcheck bwd
BB38 [0127]  1  0    BB37                297.62 19112096 [01E..01F)-> BB41(1),BB39(0)         ( cond ) T0                  i IBC bwd
BB41 [0135]  1  0    BB38                297.62 19112096 [01E..01F)-> BB42(1),BB53(0)         ( cond ) T0                  i IBC idxlen bwd
BB42 [0140]  1  0    BB41                297.62 19112096 [01E..01F)-> BB44(0.2),BB43(0.8)     ( cond ) T0                  i IBC idxlen nullcheck bwd
BB43 [0145]  1  0    BB42                238.10 15289676 [01E..01F)-> BB44(1)                 (always) T0                  i IBC hascall gcsafe bwd
BB44 [0146]  2  0    BB42,BB43           297.62 19112096 [01E..01F)-> BB45(1)                 (always) T0                  i IBC idxlen nullcheck bwd
BB45 [0141]  2  0    BB40,BB44           297.62 19112096 [01E..01F)-> BB46(1),BB54(0)         ( cond ) T0                  i IBC bwd
BB46 [0159]  1  0    BB45                297.62 19112096 [01E..01F)-> BB48(0.00345),BB47(0.997)   ( cond ) T0                  i IBC bwd
BB47 [0120]  1  0    BB46                296.59 19046097 [01E..01F)-> BB49(1)                 (always) T0                  i IBC bwd
BB49 [0124]  2  0    BB47,BB48           297.62 19112096 [01E..01F)-> BB50(1)                 (always) T0                  i IBC nullcheck bwd
BB50 [0117]  2  0    BB36,BB49           297.92 19131096 [01E..01F)-> BB33(0.997),BB55(0.00336)   ( cond ) T0                  i IBC bwd
BB48 [0121]  1  0    BB46                  1.03    65998 [01E..01F)-> BB49(1)                 (always) T0                  i IBC bwd
BB36 [0115]  1  0    BB35                  0.30    19000 [01E..01F)-> BB50(1)                 (always) T0                  i IBC bwd
BB39 [0132]  1  0    BB38                  0           0 [01E..01F)-> BB40(0.2),BB53(0.8)     ( cond ) T0                  i IBC rare bwd
BB53 [0139]  2  0    BB39,BB41             0           0 [01E..01F)                           (throw ) T0                  i IBC rare hascall gcsafe bwd
BB40 [0134]  1  0    BB39                  0           0 [01E..01F)-> BB45(1)                 (always) T0                  i IBC rare bwd
BB54 [0158]  1  0    BB45                  0           0 [01E..01F)                           (throw ) T0                  i IBC rare hascall gcsafe bwd
BB52 [0126]  1  0    BB37                  0           0 [01E..01F)                           (throw ) T0                  i IBC rare hascall gcsafe bwd
BB34 [0113]  1  0    BB33                  0           0 [01E..01F)-> BB55(1)                 (always) T0      }           i IBC rare bwd
BB55 [0166]  2       BB34,BB50             1.00    64216 [029..038)                           (return)                     i IBC keep cfb cfe
BB07 [0029]  1       BB06                  0           0 [000..001)-> BB08(0.2),BB23(0.8)     ( cond )                     i IBC rare
BB08 [0031]  1       BB07                  0           0 [000..001)-> BB13(1)                 (always)                     i IBC rare
BB20 [0073]  1       BB19                  0           0 [000..001)-> BB21(0.2),BB23(0.8)     ( cond )                     i IBC rare
BB23 [0080]  4       BB07,BB09,BB20,BB22   0           0 [000..001)                           (throw )                     i IBC rare hascall gcsafe
BB21 [0075]  1       BB20                  0           0 [000..001)-> BB27(1)                 (always)                     i IBC rare
BB16 [0010]  1       BB14                  0           0 [000..001)                           (throw )                     i IBC rare hascall gcsafe
BB28 [0099]  2       BB13,BB27             0           0 [000..001)                           (throw )                     i IBC rare hascall gcsafe
BB18 [0067]  2       BB03,BB17             0           0 [000..001)                           (throw )                     i IBC rare hascall gcsafe
BB02 [0014]  1       BB01                  0           0 [000..001)                           (throw )                     i IBC rare hascall gcsafe newobj
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ funclets follow
BB56 [0005]  1     0                       0           0 [029..037)                           (falret)    H0 F fault { }   i IBC rare keep xentry flet
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

It looks like we're doing a good job of prioritizing fallthrough on the hottest paths. BB31 is an odd case: It isn't marked as rarely-run because of its non-zero IBC, but its scaled weight is close to zero. In such cases, should BB31 be considered cold by fgMoveColdBlocks, and moved to the end of the method to enable fallthrough on the hot path? Should fgMoveColdBlocks check if a block's weight is sufficiently close to zero rather than the rarely-run flag, or should we keep its definition of "cold" consistent with the rest of the JIT?

AndyAyersMS · 2024-06-14T19:28:28Z

I think it make sense for small but non-zero to be treated like cold.. perhaps some kind of an absolute threshold (say normalized weight is > 0.01).

Do you know why BB55 ends up where it does?

amanasifkhalid · 2024-06-14T19:32:38Z

Do you know why BB55 ends up where it does?

BB55 is reachable from blocks in T0, but it isn't in the try region, so we place it after. fgMoveColdBlocks moves cold blocks to the end of their respective regions to keep EH regions contiguous -- this has the sometimes undesirable effect of pushing non-cold blocks after try regions further away from their predecessors, though I'm not sure there's a way around this due to EH constraints.

amanasifkhalid · 2024-06-14T19:50:45Z

I think it make sense for small but non-zero to be treated like cold.. perhaps some kind of an absolute threshold (say normalized weight is > 0.01).

I'm going to go ahead and try this out, but it's worth noting that if we turn hot/cold splitting on, we'll probably want to update fgDetermineFirstColdBlock to use the same definition of "cold," or otherwise we'll be inhibiting splitting with this change.

amanasifkhalid · 2024-06-14T21:17:16Z

Benchstone.BenchF.NewtR:Test is another weird case. Here's the layout with #102927's changes:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight   IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0018]  1                             0.60   1 [???..???)-> BB11(0.00595),BB02(0.994)   ( cond )                     IBC internal
BB02 [0020]  1       BB01                  0.59   1 [01D..???)-> BB03(1)                 (always)                     IBC internal
BB03 [0001]  2       BB02,BB09            99.41 167 [01D..03A)-> BB05(1)                 (always)                     i IBC loophead bwd bwd-target
BB05 [0009]  2       BB03,BB08            98.65 166 [01D..01E)-> BB09(1),BB06(0.000499)  ( cond )                     i IBC loophead bwd bwd-target
BB09 [0022]  4       BB05,BB06,BB07,BB08  99.41 167 [01D..06D)-> BB03(0.994),BB11(0.00595)   ( cond )                     i IBC bwd
BB06 [0010]  1       BB05                  0.05   0 [01D..01E)-> BB09(0),BB07(1)         ( cond )                     i IBC hascall gcsafe bwd
BB07 [0011]  1       BB06                  0.05   0 [01D..01E)-> BB09(0),BB08(1)         ( cond )                     i IBC bwd
BB08 [0012]  1       BB07                 99.45 167 [01D..01E)-> BB05(0.992),BB09(0.00812)   ( cond )                     i IBC bwd
BB11 [0023]  2       BB01,BB09             0.60   1 [06D..090)                           (return)                     i IBC gcsafe newobj
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

It would be nice if we moved BB11 up to after BB09, but the loop body looks good. I don't know where BB08's weight is coming from, since it's only reachable from BB07 -- looks like another profile consistency issue? Here's the JIT dump: gist.

Edit: I tweaked fgMoveHotBlocks to consider moving up the unlikely edge of BBJ_COND blocks if the likely edge is a backedge. Updated layout looks a bit better:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight   IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0018]  1                             0.60   1 [???..???)-> BB11(0.00595),BB02(0.994)   ( cond )                     IBC internal
BB02 [0020]  1       BB01                  0.59   1 [01D..???)-> BB03(1)                 (always)                     IBC internal
BB03 [0001]  2       BB02,BB09            99.41 167 [01D..03A)-> BB05(1)                 (always)                     i IBC loophead bwd bwd-target
BB05 [0009]  2       BB03,BB08            99.08 167 [01D..01E)-> BB09(1),BB06(0.0005)    ( cond )                     i IBC loophead bwd bwd-target
BB09 [0022]  4       BB05,BB06,BB07,BB08  99.41 167 [01D..06D)-> BB03(0.994),BB11(0.00595)   ( cond )                     i IBC bwd
BB11 [0023]  2       BB01,BB09             0.60   1 [06D..090)                           (return)                     i IBC gcsafe newobj
BB06 [0010]  1       BB05                  0.05   0 [01D..01E)-> BB09(0),BB07(1)         ( cond )                     i IBC hascall gcsafe bwd
BB07 [0011]  1       BB06                  0.05   0 [01D..01E)-> BB09(0),BB08(1)         ( cond )                     i IBC bwd
BB08 [0012]  1       BB07                 99.45 167 [01D..01E)-> BB05(0.996),BB09(0.00378)   ( cond )                     i IBC bwd
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

amanasifkhalid · 2024-06-14T22:39:00Z

GuardedDevirtualization.TwoClassInterface:Call has a similar layout shape, and the updated fgMoveHotBlocks seems to handle it better:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight    IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0009]  1                             0.60   10 [???..???)-> BB16(0.00595),BB02(0.994)   ( cond )                     IBC internal idxlen
BB02 [0010]  1       BB01                  0.59   10 [00E..???)-> BB10(0.01),BB05(0.99)   ( cond )                     IBC internal
BB05 [0012]  1       BB02                  0.59   10 [???..???)-> BB06(1)                 (always)                     IBC internal idxlen
BB06 [0001]  2       BB05,BB09            98.41 1654 [00E..00E)-> BB08(0),BB07(1)         ( cond )                     i IBC loophead idxlen bwd bwd-target
BB07 [0006]  1       BB06                 98.41 1654 [???..???)-> BB09(1)                 (always)                     i IBC internal idxlen nullcheck bwd
BB09 [0005]  2       BB07,BB08            98.41 1654 [00E..024)-> BB06(0.994),BB16(0.00595)   ( cond )                     i IBC idxlen bwd
BB16 [0021]  3       BB01,BB09,BB14        0.60   10 [024..026)                           (return)                     i IBC
BB10 [0013]  1       BB02                  0.01    0 [???..???)-> BB11(1)                 (always)                     IBC internal idxlen
BB11 [0014]  2       BB10,BB14             0.99   17 [00E..00E)-> BB13(0),BB12(1)         ( cond )                     i IBC loophead idxlen bwd bwd-target
BB12 [0015]  1       BB11                  0.99   17 [???..???)-> BB14(1)                 (always)                     i IBC internal idxlen nullcheck bwd
BB14 [0017]  2       BB12,BB13             0.99   17 [00E..024)-> BB11(0.994),BB16(0.00595)   ( cond )                     i IBC idxlen bwd
BB08 [0007]  1       BB06                  0       0 [???..???)-> BB09(1)                 (always)                     i IBC rare internal hascall gcsafe idxlen bwd
BB13 [0016]  1       BB11                  0       0 [???..???)-> BB14(1)                 (always)                     i IBC rare internal hascall gcsafe idxlen bwd
BB19 [0022]  0                             0         [???..???)                           (throw )                     i rare keep internal
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

All the other regressed GuardedDevirtualization benchmarks (TwoClassVirtual, ThreeClassVirtual, ThreeClassInterface) have the same shape that can now fall out of the hot loop into the return block. Those feature heavily in the top regressions on x64 and arm64, and the new layout looks optimal, so I'd be surprised if these regressions aren't resolved.

…oldBlocks (#103492) Based on feedback in #102763 (comment), define "cold" blocks based on whether their weights are below a certain threshold, rather than only considering blocks marked with BBF_RUN_RARELY, in fgMoveColdBlocks. I added a BasicBlock method for doing this weight check rather than localizing it to fgMoveColdBlocks, as I plan to use it elsewhere in the layout phase.

amanasifkhalid · 2024-06-15T02:02:52Z

LinqBenchmarks:Count00ForX highlights a case where our loop misrotation fix makes the wrong decision. Here's the initial RPO layout:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight    IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0011]  1                             1      98 [???..???)-> BB06(1)                 (always)                     i IBC internal
BB06 [0005]  4       BB01,BB03,BB04,BB05 100    9778 [02B..02C)-> BB14(0.00016),BB07(1)   ( cond )                     i IBC bwd bwd-src osr-entry
BB07 [0020]  1       BB06                 99.98 9776 [02B..02C)-> BB10(0.161),BB08(0.839) ( cond )                     i IBC bwd
BB08 [0021]  1       BB07                 83.93 8207 [02B..02C)-> BB11(1)                 (always)                     i IBC idxlen bwd
BB10 [0026]  1       BB07                 16.07 1571 [02B..02C)-> BB11(1)                 (always)                     i IBC bwd
BB11 [0023]  2       BB08,BB10           100    9778 [02B..034)-> BB04(0.983),BB12(0.0171)  ( cond )                     i IBC bwd bwd-src
BB04 [0003]  1       BB11                 98.29 9611 [015..027)-> BB06(0.933),BB05(0.0666)  ( cond )                     i IBC loophead bwd bwd-target
BB05 [0004]  1       BB04                  6.55  640 [027..02B)-> BB06(1)                 (always)                     i IBC bwd
BB12 [0006]  1       BB11                  1.71  167 [034..050)-> BB02(0.994),BB13(0.00595)   ( cond )                     i IBC bwd
BB02 [0001]  1       BB12                  1.71  167 [00C..013)-> BB03(1)                 (always)                     i IBC loophead nullcheck bwd bwd-target
BB03 [0002]  1       BB02                  1.71  167 [???..???)-> BB06(1)                 (always)                     i IBC keep internal bwd
BB13 [0010]  1       BB12                  0.01    1 [050..059)                           (return)                     i IBC
BB14 [0025]  1       BB06                  0       0 [02B..02C)                           (throw )                     i IBC rare hascall gcsafe bwd
BB15 [0030]  0                             0         [???..???)                           (throw )                     i rare keep internal
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

And the final layout:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight    IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0011]  1                             1      98 [???..???)-> BB06(1)                 (always)                     i IBC internal
BB05 [0004]  1       BB04                  6.55  640 [027..02B)-> BB06(1)                 (always)                     i IBC bwd
BB06 [0005]  4       BB01,BB03,BB04,BB05 100    9778 [02B..02C)-> BB14(0.00016),BB07(1)   ( cond )                     i IBC bwd bwd-src osr-entry
BB07 [0020]  1       BB06                 99.98 9776 [02B..02C)-> BB10(0.161),BB08(0.839) ( cond )                     i IBC bwd
BB08 [0021]  1       BB07                 83.93 8207 [02B..02C)-> BB11(1)                 (always)                     i IBC idxlen bwd
BB11 [0023]  2       BB08,BB10           100    9778 [02B..034)-> BB04(0.983),BB12(0.0171)  ( cond )                     i IBC bwd bwd-src
BB04 [0003]  1       BB11                 98.29 9611 [015..027)-> BB06(0.933),BB05(0.0666)  ( cond )                     i IBC loophead bwd bwd-target
BB10 [0026]  1       BB07                 16.07 1571 [02B..02C)-> BB11(1)                 (always)                     i IBC bwd
BB12 [0006]  1       BB11                  1.71  167 [034..050)-> BB02(0.994),BB13(0.00595)   ( cond )                     i IBC bwd
BB02 [0001]  1       BB12                  1.71  167 [00C..013)-> BB03(1)                 (always)                     i IBC loophead nullcheck bwd bwd-target
BB03 [0002]  1       BB02                  1.71  167 [???..???)-> BB06(1)                 (always)                     i IBC keep internal bwd
BB13 [0010]  1       BB12                  0.01    1 [050..059)                           (return)                     i IBC
BB14 [0025]  1       BB06                  0       0 [02B..02C)                           (throw )                     i IBC rare hascall gcsafe bwd
BB15 [0030]  0                             0         [???..???)                           (throw )                     i rare keep internal
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

The decision to move BB05 ends up breaking fallthrough from BB01 into BB06 without introducing any new fallthrough. We could fix this in fgMoveHotBlocks by not moving backward jumps if the previous block falls into the block with the jump, and breaking this fallthrough won't introduce new fallthrough into the next block. But this starts to get a bit pattern-matchy, so perhaps we accept this for now until we have a more general (i.e. 3-opt) solution.

amanasifkhalid · 2024-06-15T02:51:02Z

Sorry for all the comment spam. I've looked at the remaining top regressions, and I haven't seen anything noteworthy -- in general, #102927 seems to do a good job of maximizing fallthrough on hot paths, so long as the profile data is sensible. As such, I'm planning on getting that merged for now, and coming back to 3-opt after Preview 6.

…102927) After establishing an RPO-based layout, we currently try to move any backward jumps up to their successors, if it is profitable to do so in spite of any fallthrough behavior lost. In #102763, we see many instances where the RPO layout fails to create fallthrough for forward jumps on the hot path, such as in cases where a block is reachable from many predecessors. This work addresses the RPO's limitations by also considering moving the targets of forward jumps (conditional and unconditional) to maximize fallthrough.

amanasifkhalid · 2024-06-17T18:09:55Z

I've merged #102927 -- fingers crossed we see an improvement in the next triage...

amanasifkhalid · 2024-07-26T18:47:11Z

I'm punting this to .NET 10, as we're planning to continue to iterate on the new layout. We'll re-evaluate any remaining regressions here with each change.

dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label May 28, 2024

dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label May 28, 2024

EgorBo assigned amanasifkhalid May 28, 2024

EgorBo added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels May 28, 2024

EgorBo added this to the 9.0.0 milestone May 28, 2024

dotnet-policy-service bot removed the untriaged New issue has not been triaged by the area owner label May 28, 2024

amanasifkhalid mentioned this issue May 30, 2024

JIT: Propagate profile flags to loop preheaders in Compiler::optCloneLoops #102897

Merged

amanasifkhalid mentioned this issue Jun 14, 2024

JIT: Consider block weights instead of BBF_RUN_RARELY flag in fgMoveColdBlocks #103492

Merged

amanasifkhalid mentioned this issue Jun 17, 2024

JIT: Move targets of hot jumps to create fallthrough post-RPO layout #102927

Merged

This was referenced Jul 25, 2024

[Perf] Windows/x64: 1 Regression on 3/15/2024 7:33:56 PM #99964

Closed

[Perf] Regressions from "Remove BBJ_NONE" #95646

Closed

amanasifkhalid modified the milestones: 9.0.0, 10.0.0 Jul 26, 2024

amanasifkhalid mentioned this issue Jul 26, 2024

[Perf] Windows/arm64: 16 Regressions on 3/14/2024 6:24:18 PM #100085

Closed

DrewScoggins mentioned this issue Aug 22, 2024

Performance Fundamental: Net 8 -> Net 9 Manual Comparison Report #106824

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Widespread perf regressions due to RPO layout #102763

Widespread perf regressions due to RPO layout #102763

EgorBo commented May 28, 2024 •

edited

Loading

dotnet-policy-service bot commented May 28, 2024

amanasifkhalid commented May 28, 2024

AndyAyersMS commented May 28, 2024

amanasifkhalid commented May 29, 2024

amanasifkhalid commented May 29, 2024

AndyAyersMS commented May 29, 2024

AndyAyersMS commented May 29, 2024

AndyAyersMS commented May 29, 2024

AndyAyersMS commented May 29, 2024

amanasifkhalid commented May 30, 2024

AndyAyersMS commented May 30, 2024

amanasifkhalid commented May 30, 2024

AndyAyersMS commented May 30, 2024

amanasifkhalid commented May 30, 2024 •

edited

Loading

AndyAyersMS commented May 30, 2024

amanasifkhalid commented May 30, 2024

AndyAyersMS commented May 30, 2024

amanasifkhalid commented May 31, 2024 •

edited

Loading

amanasifkhalid commented Jun 14, 2024 •

edited

Loading

AndyAyersMS commented Jun 14, 2024

amanasifkhalid commented Jun 14, 2024

amanasifkhalid commented Jun 14, 2024

AndyAyersMS commented Jun 14, 2024

amanasifkhalid commented Jun 14, 2024

amanasifkhalid commented Jun 14, 2024

amanasifkhalid commented Jun 14, 2024 •

edited

Loading

amanasifkhalid commented Jun 14, 2024 •

edited

Loading

amanasifkhalid commented Jun 15, 2024 •

edited

Loading

amanasifkhalid commented Jun 15, 2024

amanasifkhalid commented Jun 17, 2024

amanasifkhalid commented Jul 26, 2024

Widespread perf regressions due to RPO layout #102763

Widespread perf regressions due to RPO layout #102763

Comments

EgorBo commented May 28, 2024 • edited Loading

dotnet-policy-service bot commented May 28, 2024

amanasifkhalid commented May 28, 2024

AndyAyersMS commented May 28, 2024

amanasifkhalid commented May 29, 2024

amanasifkhalid commented May 29, 2024

AndyAyersMS commented May 29, 2024

AndyAyersMS commented May 29, 2024

AndyAyersMS commented May 29, 2024

AndyAyersMS commented May 29, 2024

amanasifkhalid commented May 30, 2024

AndyAyersMS commented May 30, 2024

amanasifkhalid commented May 30, 2024

AndyAyersMS commented May 30, 2024

amanasifkhalid commented May 30, 2024 • edited Loading

AndyAyersMS commented May 30, 2024

amanasifkhalid commented May 30, 2024

AndyAyersMS commented May 30, 2024

amanasifkhalid commented May 31, 2024 • edited Loading

amanasifkhalid commented Jun 14, 2024 • edited Loading

AndyAyersMS commented Jun 14, 2024

amanasifkhalid commented Jun 14, 2024

amanasifkhalid commented Jun 14, 2024

AndyAyersMS commented Jun 14, 2024

amanasifkhalid commented Jun 14, 2024

amanasifkhalid commented Jun 14, 2024

amanasifkhalid commented Jun 14, 2024 • edited Loading

amanasifkhalid commented Jun 14, 2024 • edited Loading

amanasifkhalid commented Jun 15, 2024 • edited Loading

amanasifkhalid commented Jun 15, 2024

amanasifkhalid commented Jun 17, 2024

amanasifkhalid commented Jul 26, 2024

EgorBo commented May 28, 2024 •

edited

Loading

amanasifkhalid commented May 30, 2024 •

edited

Loading

amanasifkhalid commented May 31, 2024 •

edited

Loading

amanasifkhalid commented Jun 14, 2024 •

edited

Loading

amanasifkhalid commented Jun 14, 2024 •

edited

Loading

amanasifkhalid commented Jun 14, 2024 •

edited

Loading

amanasifkhalid commented Jun 15, 2024 •

edited

Loading