Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unoptimized code is used for benchmark #1466

Open
ebfortin opened this issue Jun 7, 2020 · 4 comments
Open

Unoptimized code is used for benchmark #1466

ebfortin opened this issue Jun 7, 2020 · 4 comments

Comments

@ebfortin
Copy link

ebfortin commented Jun 7, 2020

I opened an issue in dotnet/runtime (37216) regarding what I thought at the beginning was a poor register allocation algorithm for hardware intrinsic. It evolved into why BDN do not optimize the code. And that's why I just opened an issue here.

In a nutshell, I have implemented a double double type and I want to optimize the "naive" port from a C library I did with some SIMD instructions. So I compare between the naive code and the SIMD code. Looking at the diassembly I found out the code is not optimized at all. Ever. Except when I force COMPlus_TieredCompilation=0. That is the only moment the code get's optimized. Go so the linked issue in dotnet/runtime for details.

My questions are then:

  1. What are the conditions for the code to be optimized?
  2. How does BDN force the runtime to optimize the code?
  3. Is there any way with BDN to get stats (Diagnoser) of how code is optimized by the runtime and when?
@alexcovington
Copy link

+1 on this. I'm looking to do similar optimizations and this information would be useful.

@adamsitnik
Copy link
Member

adamsitnik commented Aug 10, 2020

Hi @ebfortin

How does BDN force the runtime to optimize the code?

It does not. To tell the long story short, the warmup or pilot phase should invoke the microbenchmark enough times to make JIT promote it from Tier 0 to Tier 1. This doc explains how BDN works: https://benchmarkdotnet.org/articles/guides/how-it-works.html This discussion is also a great source of knowledge: dotnet/runtime#13069

What are the conditions for the code to be optimized?

This is more a JIT question. As far as I remember the methods needs to be invoked more than 30 times (or contain a loop) and there needs to be a time window of 100ms when no new method was JITted. Also, the process can't be 100% busy. Then, the Tiered JIT Thread kicks in, recompiles the method and promotes it from Tier 0 to Tier 1.

Is there any way with BDN to get stats (Diagnoser) of how code is optimized by the runtime and when?

Not in BDN itself, but when used BDN with PerfView it's possible:

  • you need to profile the code using BDN plugin called EtwProfiler
  • open produced trace file with PerfView, go to Advanced Group and open JITStats window

obraz

  • the TC in the table corresponds to Tiered Compilation. The first column tells us when it happened (in milliseconds, relatively from the beggining of trace file time)

obraz

If given method is not in the table, it means it was not promoted to Tier 1

@ebfortin Could you please share a minimum repro case so I could try to repro it? You must be hitting some kind of edge case

@adamsitnik
Copy link
Member

I believe that I've answered the question. The repro which I've asked for in August was not provided, so I am closing the issue. Please feel free to provide the minimum repro case and reopen the issue.

@ebfortin
Copy link
Author

Sorry. With COVID and all I completely forgot to answer your request.

I zipped a solution for you to test. But here's the result. You can see that unless I disable QuickJIT, the code doesn't seem to get optimized. Also disregard the fact that my intrinsic code is performing less than the naive code. I stopped playing with intrinsics until I get an answer on why it doesn't get used correctly by the runtime.

Benchmark SIMD Optimization Test.zip

Method Job EnvironmentVariables Mean Error StdDev Max Code Size
SomeComputation Job-WHKPVA COMPlus_EnableHWIntrinsic=0 14.833 ns 0.3453 ns 0.3694 ns 15.700 ns 945 B
SomeComputation Job-MMXSAO COMPlus_EnableHWIntrinsic=1 16.269 ns 0.1854 ns 0.1548 ns 16.600 ns 554 B
SomeComputation Job-THVAFE COMPlus_TC_QuickJit=0,COMPlus_EnableHWIntrinsic=0 2.636 ns 0.0561 ns 0.0497 ns 2.700 ns 104 B
SomeComputation Job-YOPMAU COMPlus_TC_QuickJit=0,COMPlus_EnableHWIntrinsic=1 3.387 ns 0.0795 ns 0.0743 ns 3.500 ns 140 B

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants