Unoptimized code is used for benchmark #1466

ebfortin · 2020-06-07T02:56:26Z

I opened an issue in dotnet/runtime (37216) regarding what I thought at the beginning was a poor register allocation algorithm for hardware intrinsic. It evolved into why BDN do not optimize the code. And that's why I just opened an issue here.

In a nutshell, I have implemented a double double type and I want to optimize the "naive" port from a C library I did with some SIMD instructions. So I compare between the naive code and the SIMD code. Looking at the diassembly I found out the code is not optimized at all. Ever. Except when I force COMPlus_TieredCompilation=0. That is the only moment the code get's optimized. Go so the linked issue in dotnet/runtime for details.

My questions are then:

What are the conditions for the code to be optimized?
How does BDN force the runtime to optimize the code?
Is there any way with BDN to get stats (Diagnoser) of how code is optimized by the runtime and when?

alexcovington · 2020-08-07T23:03:30Z

+1 on this. I'm looking to do similar optimizations and this information would be useful.

adamsitnik · 2020-08-10T09:03:02Z

Hi @ebfortin

How does BDN force the runtime to optimize the code?

It does not. To tell the long story short, the warmup or pilot phase should invoke the microbenchmark enough times to make JIT promote it from Tier 0 to Tier 1. This doc explains how BDN works: https://benchmarkdotnet.org/articles/guides/how-it-works.html This discussion is also a great source of knowledge: dotnet/runtime#13069

What are the conditions for the code to be optimized?

This is more a JIT question. As far as I remember the methods needs to be invoked more than 30 times (or contain a loop) and there needs to be a time window of 100ms when no new method was JITted. Also, the process can't be 100% busy. Then, the Tiered JIT Thread kicks in, recompiles the method and promotes it from Tier 0 to Tier 1.

Is there any way with BDN to get stats (Diagnoser) of how code is optimized by the runtime and when?

Not in BDN itself, but when used BDN with PerfView it's possible:

you need to profile the code using BDN plugin called EtwProfiler
open produced trace file with PerfView, go to Advanced Group and open JITStats window

the TC in the table corresponds to Tiered Compilation. The first column tells us when it happened (in milliseconds, relatively from the beggining of trace file time)

If given method is not in the table, it means it was not promoted to Tier 1

@ebfortin Could you please share a minimum repro case so I could try to repro it? You must be hitting some kind of edge case

adamsitnik · 2020-10-22T14:08:50Z

I believe that I've answered the question. The repro which I've asked for in August was not provided, so I am closing the issue. Please feel free to provide the minimum repro case and reopen the issue.

ebfortin · 2020-10-23T22:11:19Z

Sorry. With COVID and all I completely forgot to answer your request.

I zipped a solution for you to test. But here's the result. You can see that unless I disable QuickJIT, the code doesn't seem to get optimized. Also disregard the fact that my intrinsic code is performing less than the naive code. I stopped playing with intrinsics until I get an answer on why it doesn't get used correctly by the runtime.

Benchmark SIMD Optimization Test.zip

Method	Job	EnvironmentVariables	Mean	Error	StdDev	Max	Code Size
SomeComputation	Job-WHKPVA	COMPlus_EnableHWIntrinsic=0	14.833 ns	0.3453 ns	0.3694 ns	15.700 ns	945 B
SomeComputation	Job-MMXSAO	COMPlus_EnableHWIntrinsic=1	16.269 ns	0.1854 ns	0.1548 ns	16.600 ns	554 B
SomeComputation	Job-THVAFE	COMPlus_TC_QuickJit=0,COMPlus_EnableHWIntrinsic=0	2.636 ns	0.0561 ns	0.0497 ns	2.700 ns	104 B
SomeComputation	Job-YOPMAU	COMPlus_TC_QuickJit=0,COMPlus_EnableHWIntrinsic=1	3.387 ns	0.0795 ns	0.0743 ns	3.500 ns	140 B

adamsitnik closed this as completed Oct 22, 2020

adamsitnik reopened this Oct 24, 2020

adamsitnik mentioned this issue Feb 15, 2022

BenchmarkDotNet not generating fully optimized x86 assembly compared to other disassemblers #1924

Closed

AndreyAkinshin mentioned this issue Apr 15, 2022

MinIterationTime, WarmupCount, and tiered JIT #1993

Open

YegorStepanov mentioned this issue Nov 15, 2022

Notify about TieredCompilation #2190

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unoptimized code is used for benchmark #1466

Unoptimized code is used for benchmark #1466

ebfortin commented Jun 7, 2020

alexcovington commented Aug 7, 2020

adamsitnik commented Aug 10, 2020 •

edited

Loading

adamsitnik commented Oct 22, 2020

ebfortin commented Oct 23, 2020

Unoptimized code is used for benchmark #1466

Unoptimized code is used for benchmark #1466

Comments

ebfortin commented Jun 7, 2020

alexcovington commented Aug 7, 2020

adamsitnik commented Aug 10, 2020 • edited Loading

adamsitnik commented Oct 22, 2020

ebfortin commented Oct 23, 2020

adamsitnik commented Aug 10, 2020 •

edited

Loading