QuickJitForLoops causing regressions #80210

alexcovington · 2023-01-04T22:51:16Z

Description

I am noticing some of the microbenchmarks perform worse or take longer before TieredCompilation can fully optimize. Specifically, this looks to be due to the DOTNET_TC_QuickJitForLoops configuration setting.

For example, if I run some of the microbenchmarks using .NET 7.0 with default settings, I get worse performance than if I run with DOTNET_TC_QuickJitForLoops=0.

Using .\Microbenchmarks.exe --filter 'FractalPerf.Launch.Test':

// * Summary *

BenchmarkDotNet=v0.13.2.2052-nightly, OS=Windows 11 (10.0.25267.1000)
.NET SDK=7.0.101
  [Host]     : .NET 7.0.1 (7.0.122.56804), X64 RyuJIT AVX2
  Job-BGBUKP : .NET 7.0.1 (7.0.122.56804), X64 RyuJIT AVX2

PowerPlanMode=00000000-0000-0000-0000-000000000000  IterationTime=250.0000 ms  MaxIterationCount=20
MinIterationCount=15  WarmupCount=1

|      Method |     Mean |   Error |  StdDev |   Median |      Min |      Max | Allocated |
|------------ |---------:|--------:|--------:|---------:|---------:|---------:|----------:|
| FractalPerf | 253.7 ms | 0.24 ms | 0.21 ms | 253.7 ms | 253.3 ms | 254.1 ms |     744 B |

Using .\Microbenchmarks.exe --filter 'FractalPerf.Launch.Test' --envVars DOTNET_TC_QuickJitForLoops:0:

// * Summary *

BenchmarkDotNet=v0.13.2.2052-nightly, OS=Windows 11 (10.0.25267.1000)
.NET SDK=7.0.101
  [Host]     : .NET 7.0.1 (7.0.122.56804), X64 RyuJIT AVX2
  Job-ZQLSHR : .NET 7.0.1 (7.0.122.56804), X64 RyuJIT AVX2

EnvironmentVariables=DOTNET_TC_QuickJitForLoops=0  PowerPlanMode=00000000-0000-0000-0000-000000000000  IterationTime=250.0000 ms
MaxIterationCount=20  MinIterationCount=15  WarmupCount=1

|      Method |     Mean |   Error |  StdDev |   Median |      Min |      Max | Allocated |
|------------ |---------:|--------:|--------:|---------:|---------:|---------:|----------:|
| FractalPerf | 129.8 ms | 0.10 ms | 0.09 ms | 129.8 ms | 129.6 ms | 129.9 ms |     444 B |

Configuration

All benchmarks were run on various x64 systems (AMD Ryzen and Intel).

Baseline .NET version used is .NET 6.0.12.

Comparison .NET version used is .NET 7.0.1.

I ran two comparisons:

First comparison was between .NET 6.0 vs .NET 7.0 (DOTNET_TC_QuickJitForLoops=1)
Second comparison was between .NET 6.0 vs .NET 7.0 (DOTNET_TC_QuickJitForLoops=0)

Regression?

Yes, this is a regression going from .NET 6.0.12 -> .NET 7.0.1.

Data

I've noticed these microbenchmarks are affected by this:

Benchstone.BenchI.Array2.Test
Benchstone.BenchI.NDhrystone.Test
FractalPerf.Launch.Test
System.Collections.IndexerSetReverse.IList(Size: 512)

6.0 vs 7.0 (Base, QuickJitForLoops=1)

## Benchstone.BenchI.Array2.Test

| Result | Ratio | Alloc Delta | Operating System | Bit | Processor Name                        | Modality|
| ------ | -----:| -----------:| ---------------- | --- | ------------------------------------- | --------:|
| Slower |  0.56 |        -240 | Windows 11       | X64 | 11th Gen Intel Core i9-11900K 3.50GHz |         |
| Slower |  0.59 |        -240 | Windows 11       | X64 | AMD Ryzen 9 7900X                     |         |


## Benchstone.BenchI.NDhrystone.Test

| Result | Ratio | Alloc Delta | Operating System | Bit | Processor Name                        | Modality|
| ------ | -----:| -----------:| ---------------- | --- | ------------------------------------- | --------:|
| Slower |  0.87 |        -240 | Windows 11       | X64 | 11th Gen Intel Core i9-11900K 3.50GHz |         |
| Slower |  0.85 |       -2232 | Windows 11       | X64 | AMD Ryzen 9 7900X                     |         |


## FractalPerf.Launch.Test

| Result | Ratio | Alloc Delta | Operating System | Bit | Processor Name                        | Modality|
| ------ | -----:| -----------:| ---------------- | --- | ------------------------------------- | --------:|
| Slower |  0.72 |        -784 | Windows 11       | X64 | 11th Gen Intel Core i9-11900K 3.50GHz |         |
| Slower |  0.52 |        -464 | Windows 11       | X64 | AMD Ryzen 9 7900X                     |         |


## System.Collections.IndexerSetReverse<Int32>.IList(Size: 512)

| Result | Ratio | Alloc Delta | Operating System | Bit | Processor Name                        | Modality|
| ------ | -----:| -----------:| ---------------- | --- | ------------------------------------- | --------:|
| Slower |  0.88 |          +0 | Windows 11       | X64 | 11th Gen Intel Core i9-11900K 3.50GHz |         |
| Slower |  0.88 |          +0 | Windows 11       | X64 | AMD Ryzen 9 7900X                     |         |

6.0 vs 7.0 (Diff, QuickJitForLoops=0)

## Benchstone.BenchI.Array2.Test

| Result | Ratio | Alloc Delta | Operating System | Bit | Processor Name                        | Modality|
| ------ | -----:| -----------:| ---------------- | --- | ------------------------------------- | --------:|
| Same   |  0.99 |        -240 | Windows 11       | X64 | 11th Gen Intel Core i9-11900K 3.50GHz |         |
| Same   |  1.01 |        -240 | Windows 11       | X64 | AMD Ryzen 9 7900X                     |         |


## Benchstone.BenchI.NDhrystone.Test

| Result | Ratio | Alloc Delta | Operating System | Bit | Processor Name                        | Modality|
| ------ | -----:| -----------:| ---------------- | --- | ------------------------------------- | --------:|
| Same   |  1.00 |        -240 | Windows 11       | X64 | 11th Gen Intel Core i9-11900K 3.50GHz |         |
| Same   |  0.93 |       -2232 | Windows 11       | X64 | AMD Ryzen 9 7900X                     |         |


## FractalPerf.Launch.Test

| Result | Ratio | Alloc Delta | Operating System | Bit | Processor Name                        | Modality|
| ------ | -----:| -----------:| ---------------- | --- | ------------------------------------- | --------:|
| Same   |  0.99 |        -784 | Windows 11       | X64 | 11th Gen Intel Core i9-11900K 3.50GHz |         |
| Same   |  1.02 |        -464 | Windows 11       | X64 | AMD Ryzen 9 7900X                     |         |


## System.Collections.IndexerSetReverse<Int32>.IList(Size: 512)

| Result | Ratio | Alloc Delta | Operating System | Bit | Processor Name                        | Modality|
| ------ | -----:| -----------:| ---------------- | --- | ------------------------------------- | --------:|
| Same   |  1.00 |          +0 | Windows 11       | X64 | 11th Gen Intel Core i9-11900K 3.50GHz |         |
| Same   |  1.00 |          +0 | Windows 11       | X64 | AMD Ryzen 9 7900X                     |         |

The text was updated successfully, but these errors were encountered:

dotnet-issue-labeler · 2023-01-04T22:51:19Z

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

ghost · 2023-01-05T08:49:24Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

I am noticing some of the microbenchmarks perform worse or take longer before TieredCompilation can fully optimize. Specifically, this looks to be due to the DOTNET_TC_QuickJitForLoops configuration setting.

For example, if I run some of the microbenchmarks using .NET 7.0 with default settings, I get worse performance than if I run with DOTNET_TC_QuickJitForLoops=0.

Using .\Microbenchmarks.exe --filter 'FractalPerf.Launch.Test':

// * Summary *

BenchmarkDotNet=v0.13.2.2052-nightly, OS=Windows 11 (10.0.25267.1000)
AMD Eng Sample: 100-000000589-50_Y, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.101
  [Host]     : .NET 7.0.1 (7.0.122.56804), X64 RyuJIT AVX2
  Job-BGBUKP : .NET 7.0.1 (7.0.122.56804), X64 RyuJIT AVX2

PowerPlanMode=00000000-0000-0000-0000-000000000000  IterationTime=250.0000 ms  MaxIterationCount=20
MinIterationCount=15  WarmupCount=1

|      Method |     Mean |   Error |  StdDev |   Median |      Min |      Max | Allocated |
|------------ |---------:|--------:|--------:|---------:|---------:|---------:|----------:|
| FractalPerf | 253.7 ms | 0.24 ms | 0.21 ms | 253.7 ms | 253.3 ms | 254.1 ms |     744 B |

Using .\Microbenchmarks.exe --filter 'FractalPerf.Launch.Test' --envVars DOTNET_TC_QuickJitForLoops:0:

// * Summary *

BenchmarkDotNet=v0.13.2.2052-nightly, OS=Windows 11 (10.0.25267.1000)
AMD Eng Sample: 100-000000589-50_Y, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.101
  [Host]     : .NET 7.0.1 (7.0.122.56804), X64 RyuJIT AVX2
  Job-ZQLSHR : .NET 7.0.1 (7.0.122.56804), X64 RyuJIT AVX2

EnvironmentVariables=DOTNET_TC_QuickJitForLoops=0  PowerPlanMode=00000000-0000-0000-0000-000000000000  IterationTime=250.0000 ms
MaxIterationCount=20  MinIterationCount=15  WarmupCount=1

|      Method |     Mean |   Error |  StdDev |   Median |      Min |      Max | Allocated |
|------------ |---------:|--------:|--------:|---------:|---------:|---------:|----------:|
| FractalPerf | 129.8 ms | 0.10 ms | 0.09 ms | 129.8 ms | 129.6 ms | 129.9 ms |     444 B |

Configuration

All benchmarks were run on various x64 systems (AMD Zen 3, AMD Zen 4, and Intel Rocket Lake).

Baseline .NET version used is .NET 6.0.12.

Comparison .NET version used is .NET 7.0.1.

I ran two comparisons:

First comparison was between .NET 6.0 vs .NET 7.0 (DOTNET_TC_QuickJitForLoops=1)
Second comparison was between .NET 6.0 vs .NET 7.0 (DOTNET_TC_QuickJitForLoops=0)

Regression?

Yes, this is a regression going from .NET 6.0.12 -> .NET 7.0.1.

Data

I've noticed these microbenchmarks are affected by this:

Benchstone.BenchI.Array2.Test
Benchstone.BenchI.NDhrystone.Test
FractalPerf.Launch.Test
System.Collections.IndexerSetReverse.IList(Size: 512)

6.0 vs 7.0 (Base, QuickJitForLoops=1)

## Benchstone.BenchI.Array2.Test

| Result |         Base |         Diff | Ratio | Alloc Delta | Operating System | Bit | Processor Name                        | Modality|
| ------ | ------------:| ------------:| -----:| -----------:| ---------------- | --- | ------------------------------------- | --------:|
| Slower | 378068100.00 | 679266400.00 |  0.56 |        -240 | Windows 11       | X64 | 11th Gen Intel Core i9-11900K 3.50GHz |         |
| Slower | 382520250.00 | 643846900.00 |  0.59 |        -144 | Windows 11       | X64 | AMD Ryzen 9 5900X                     |         |
| Slower | 356147150.00 | 600143300.00 |  0.59 |        -240 | Windows 11       | X64 | AMD Ryzen 9 7900X                     |         |


## Benchstone.BenchI.NDhrystone.Test

| Result |         Base |         Diff | Ratio | Alloc Delta | Operating System | Bit | Processor Name                        | Modality|
| ------ | ------------:| ------------:| -----:| -----------:| ---------------- | --- | ------------------------------------- | --------:|
| Slower | 287615500.00 | 329207650.00 |  0.87 |        -240 | Windows 11       | X64 | 11th Gen Intel Core i9-11900K 3.50GHz |         |
| Slower | 295240800.00 | 361841300.00 |  0.82 |       -1936 | Windows 11       | X64 | AMD Ryzen 9 5900X                     |         |
| Slower | 266774400.00 | 314499600.00 |  0.85 |       -2232 | Windows 11       | X64 | AMD Ryzen 9 7900X                     |         |


## FractalPerf.Launch.Test

| Result |         Base |         Diff | Ratio | Alloc Delta | Operating System | Bit | Processor Name                        | Modality|
| ------ | ------------:| ------------:| -----:| -----------:| ---------------- | --- | ------------------------------------- | --------:|
| Slower | 177198500.00 | 245312800.00 |  0.72 |        -784 | Windows 11       | X64 | 11th Gen Intel Core i9-11900K 3.50GHz |         |
| Slower | 135577150.00 | 254808900.00 |  0.53 |        -416 | Windows 11       | X64 | AMD Ryzen 9 5900X                     |         |
| Slower | 130623500.00 | 252697300.00 |  0.52 |        -464 | Windows 11       | X64 | AMD Ryzen 9 7900X                     |         |


## System.Collections.IndexerSetReverse<Int32>.IList(Size: 512)

| Result |    Base |    Diff | Ratio | Alloc Delta | Operating System | Bit | Processor Name                        | Modality|
| ------ | -------:| -------:| -----:| -----------:| ---------------- | --- | ------------------------------------- | --------:|
| Slower | 1039.31 | 1180.33 |  0.88 |          +0 | Windows 11       | X64 | 11th Gen Intel Core i9-11900K 3.50GHz |         |
| Slower | 1033.24 | 1180.34 |  0.88 |          +0 | Windows 11       | X64 | AMD Ryzen 9 5900X                     |         |
| Slower | 1034.39 | 1179.53 |  0.88 |          +0 | Windows 11       | X64 | AMD Ryzen 9 7900X                     |         |

6.0 vs 7.0 (Diff, QuickJitForLoops=0)

## Benchstone.BenchI.Array2.Test

| Result |         Base |         Diff | Ratio | Alloc Delta | Operating System | Bit | Processor Name                        | Modality|
| ------ | ------------:| ------------:| -----:| -----------:| ---------------- | --- | ------------------------------------- | --------:|
| Same   | 378068100.00 | 380040700.00 |  0.99 |        -240 | Windows 11       | X64 | 11th Gen Intel Core i9-11900K 3.50GHz |         |
| Same   | 382520250.00 | 378924350.00 |  1.01 |        -144 | Windows 11       | X64 | AMD Ryzen 9 5900X                     |         |
| Same   | 356147150.00 | 353812900.00 |  1.01 |        -240 | Windows 11       | X64 | AMD Ryzen 9 7900X                     |         |


## Benchstone.BenchI.NDhrystone.Test

| Result |         Base |         Diff | Ratio | Alloc Delta | Operating System | Bit | Processor Name                        | Modality|
| ------ | ------------:| ------------:| -----:| -----------:| ---------------- | --- | ------------------------------------- | --------:|
| Same   | 287615500.00 | 287273800.00 |  1.00 |        -240 | Windows 11       | X64 | 11th Gen Intel Core i9-11900K 3.50GHz |         |
| Slower | 295240800.00 | 354430000.00 |  0.83 |       -1936 | Windows 11       | X64 | AMD Ryzen 9 5900X                     |         |
| Same   | 266774400.00 | 285916500.00 |  0.93 |       -2232 | Windows 11       | X64 | AMD Ryzen 9 7900X                     |         |


## FractalPerf.Launch.Test

| Result |         Base |         Diff | Ratio | Alloc Delta | Operating System | Bit | Processor Name                        | Modality|
| ------ | ------------:| ------------:| -----:| -----------:| ---------------- | --- | ------------------------------------- | --------:|
| Same   | 177198500.00 | 178737900.00 |  0.99 |        -784 | Windows 11       | X64 | 11th Gen Intel Core i9-11900K 3.50GHz |         |
| Same   | 135577150.00 | 135551375.00 |  1.00 |        -416 | Windows 11       | X64 | AMD Ryzen 9 5900X                     |         |
| Same   | 130623500.00 | 128415250.00 |  1.02 |        -464 | Windows 11       | X64 | AMD Ryzen 9 7900X                     |         |


## System.Collections.IndexerSetReverse<Int32>.IList(Size: 512)

| Result |    Base |    Diff | Ratio | Alloc Delta | Operating System | Bit | Processor Name                        | Modality|
| ------ | -------:| -------:| -----:| -----------:| ---------------- | --- | ------------------------------------- | --------:|
| Same   | 1039.31 | 1035.28 |  1.00 |          +0 | Windows 11       | X64 | 11th Gen Intel Core i9-11900K 3.50GHz |         |
| Same   | 1033.24 | 1033.49 |  1.00 |          +0 | Windows 11       | X64 | AMD Ryzen 9 5900X                     |         |
| Same   | 1034.39 | 1033.07 |  1.00 |          +0 | Windows 11       | X64 | AMD Ryzen 9 7900X                     |         |

Author:	alexcovington
Assignees:	-
Labels:	`tenet-performance`, `area-CodeGen-coreclr`, `untriaged`
Milestone:	-

jakobbotsch · 2023-01-05T08:55:36Z

In general it was expected that certain microbenchmarks would be regressions due to various interactions between on-stack replacement and how Benchmark.NET works. @AndyAyersMS analyzed most of them in #33658 and #67594.

AndyAyersMS · 2023-01-07T00:41:00Z

In general it was expected that certain microbenchmarks would be regressions

Right, mostly benchmarks whose invocation times are on the order of 100ms or more that spend most of their time in a single method. When this happens BDN does not run many invocations per iteration, and so code being tested does not have a chance to tier up, and inevitably also contains loops, hence BDN ends up measuring the perf of the OSR version of the method.

There are a variety of reasons why the OSR code may be less efficient than the tiered up method code. I have ideas on how to mitigate some these (see the linked issues above) but nothing committed or scheduled.

alexcovington added the tenet-performance Performance related issue label Jan 4, 2023

ghost added the untriaged New issue has not been triaged by the area owner label Jan 4, 2023

jakobbotsch added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jan 5, 2023

JulieLeeMSFT removed the untriaged New issue has not been triaged by the area owner label Jan 9, 2023

JulieLeeMSFT assigned AndyAyersMS Jan 9, 2023

JulieLeeMSFT added this to the Future milestone Jan 9, 2023

AndyAyersMS mentioned this issue Mar 30, 2023

On Stack Replacement Next Steps #33658

Open

72 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QuickJitForLoops causing regressions #80210

QuickJitForLoops causing regressions #80210

alexcovington commented Jan 4, 2023 •

edited

Loading

dotnet-issue-labeler bot commented Jan 4, 2023

ghost commented Jan 5, 2023

Description

Configuration

Regression?

Data

jakobbotsch commented Jan 5, 2023 •

edited

Loading

AndyAyersMS commented Jan 7, 2023

QuickJitForLoops causing regressions #80210

QuickJitForLoops causing regressions #80210

Comments

alexcovington commented Jan 4, 2023 • edited Loading

Description

Configuration

Regression?

Data

dotnet-issue-labeler bot commented Jan 4, 2023

ghost commented Jan 5, 2023

Description

Configuration

Regression?

Data

jakobbotsch commented Jan 5, 2023 • edited Loading

AndyAyersMS commented Jan 7, 2023

alexcovington commented Jan 4, 2023 •

edited

Loading

jakobbotsch commented Jan 5, 2023 •

edited

Loading