Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tiered jitting and BenchmarkDotNet #13069

Open
AndreyAkinshin opened this issue Jul 13, 2019 · 6 comments
Open

Tiered jitting and BenchmarkDotNet #13069

AndreyAkinshin opened this issue Jul 13, 2019 · 6 comments

Comments

@AndreyAkinshin
Copy link
Member

.NET Core 3.0 has tiered jitting enabled by default which is pretty important in the context of benchmarking: it may spoil benchmark results if the number of warmup iterations is not enough. It seems it's not such a big issue since .NET Core 3.0 preview 4 (after dotnet/coreclr#23599 was merged). I didn't observe any noticeable tiered jitting effects with .NET Core 3.0 preview 6: all of my benchmarks produce pretty stable results. However, the internal logic of the tiered jitting can be changed in the future versions of .NET Core, so I would like to discuss how can we prepare for the upcoming changes. Currently, I have the following questions:

  • Is it possible to get the current values of the external configuration knobs like TieredCompilation or TC_QuickJit from the runtime? It will be nice to see if the tiered complication is enabled or disabled in the environment section of the BenchmarkDotNet output. Of course, we can use the knowledge of the current environment variable values and the corresponding defaults for each version of .NET Core, but I'm looking for a more reliable way that will not depend on the specific .NET Core version.
  • Is it possible to get the current values of the internal configuration knobs like TC_CallCountThreshold or TC_CallCountingDelayMs? It will be great to use these values in the internal BenchmarkDotNet heuristics (e.g., we can always try to invoke the benchmarked method at least TC_CallCountThreshold times).
  • Is it a good idea to disable tiered jitting in BenchmarkDotNet by default?
  • Is it possible to switch the tiered jitting behavior in runtime? (Relevant for the InProcess BenchmarkDotNet mode)
  • Is it possible to force the runtime to use a specific jit tier for a specific method? (Can be useful for performance comparison of different tiers).
  • Does it make sense to make any other adjustments for the tiered jitting on the BenchmarkDotNet side?

/cc @kouvel @noahfalk @adamsitnik

Some relevant discussions: dotnet/coreclr#23599 https://github.com/dotnet/coreclr/issues/19751 https://github.com/dotnet/coreclr/issues/22998 dotnet/core#2257 dotnet/BenchmarkDotNet#1125 dotnet/coreclr#24576

@EgorBo
Copy link
Member

EgorBo commented Jul 13, 2019

Can I add a question to this list? Can R2R code be faster than IL-to-tier1? I mean R2R has no time constraints so are there any R2R specific optimizations? (or maybe some are planned - e.g. full escape analysis?)

Regarding the BDN I'd personally just overwrite the both TC_CallCountThreshold and TC_CallCountingDelayMs to some small hard-coded values or/and mark benchmarks with [MethodImpl(MethodImplOptions.AggressiveOptimization)]

@yahorsi
Copy link

yahorsi commented Jul 13, 2019

  • ideally we need to explicitly get info on how bench method (s) were jitted to at least output that information, to know exactly bench details

@kouvel kouvel self-assigned this Jul 19, 2019
@kouvel
Copy link
Member

kouvel commented Jul 23, 2019

Tiered compilation has a likelihood to become more dynamic in the future, so some things that may be possible now may not make sense in the future. For example:

  • The call count threshold may not be a constant, it may vary based on how much the method is likely to benefit from a rejit
  • The optimizing JIT used with tiering may diverge from the JIT used when tiering is disabled, it may choose to spend more CPU time to generate better code since jitting will be happening in the background
  • Higher tiers may depend on information gathered from lower tiers, and the code generated at a particular tier may vary based on information gathered at run-time

For tests that have short invocation duration and involve a small amount of code, tiering typically happens during the pilot or warmup phases, provided that the total time spent in those phases is long enough to expire the startup delay (TC_CallCountingDelayMs, currently 100 ms but practically 100-200+ ms) and for background jitting.

For tests that have long invocation duration and involve a large amount of code / many methods, some of those methods may not get tiered up by the time of the measurement phase. Most of the time would typically be going into loops and those methods would be ok, but each invocation may call some methods only once and they may not reach the threshold. This may be insignificant for perf, as most of the time would be spent in optimized code. In some cases it may make a difference, like if measuring GC effects of a test. In some cases like that it may be appropriate in the current state to disable tiered compilation or to tweak tiered compilation to tier up more quickly using environment variables. It may be beneficial to add a mode that is configurable at a project level to tier up aggressively for these kinds of cases, and based on changes in the future that mode could be tweaked to do something reasonable for that purpose.

I don't recommend disabling tiered compilation or tweaking it by default from the BDN side. In some cases it may result in perf data that is not representative to some degree (some perf differences due to change in JIT timing), perhaps more so in the future.

Is it possible to get the current values of the external configuration knobs like TieredCompilation or TC_QuickJit from the runtime?

No, will consider adding an API post-3.0. There is an event with the info that could be gotten out-of-proc but it's fired early and won't be seen in-proc.

Is it possible to get the current values of the internal configuration knobs like TC_CallCountThreshold or TC_CallCountingDelayMs?

No, these may change or may be replaced in the future, and they are internal flags that are not supported. May be beneficial for some cases to add a "tier up aggressively" config option, but I don't recommend using that by default.

Is it possible to switch the tiered jitting behavior in runtime? (Relevant for the InProcess BenchmarkDotNet mode)

No

Is it possible to force the runtime to use a specific jit tier for a specific method? (Can be useful for performance comparison of different tiers).

Other than attributing with NoOptimization or AggressiveOptimization, which currently yield the same effect but may not in the future, no

Does it make sense to make any other adjustments for the tiered jitting on the BenchmarkDotNet side?

The pilot phase runs early before warmup and perf could be very different in some cases during the pilot phase and after warmup when tiering is enabled. If piloting completes before the main parts of the benchmark are tiered up, the invocation count that is determined could be much lower compared with tiering disabled. Some benchmarks may perform differently at different invocation counts. Perhaps the warmup phase could also adjust invocation counts, and overhead could be measured after warmup (if overhead measurement is dependent on the invocation count)?

There are some events that can be gotten in-proc with EventListener to tell when tiering is paused and when background jitting is happening, but the events are difficult to use currently for that purpose due to initially missed events and because the events are mainly informational and are not strictly ordered. If it's really necessary, could consider adding APIs that could be polled.

@AndreyAkinshin
Copy link
Member Author

@kouvel thank you very much for such a detailed answer! Let me know if you come up with any ideas of additional BenchmarkDotNet features that may help to improve accuracy. By the way, we know the exact version of .NET Core during benchmarking, so we can introduce some heuristics for specific versions of .NET Core based on the knowledge of its internals.

The pilot phase runs early before warmup and perf could be very different in some cases during the pilot phase and after warmup when tiering is enabled. If piloting completes before the main parts of the benchmark are tiered up, the invocation count that is determined could be much lower compared with tiering disabled. Some benchmarks may perform differently at different invocation counts. Perhaps the warmup phase could also adjust invocation counts, and overhead could be measured after warmup (if overhead measurement is dependent on the invocation count)?

It's a very good idea! It may also help to resolve some problems which are not related to tiered jitting: sometimes we choose a bad number of invocation during the pilot stage because of heavy assembly loading on the first pilot iteration. I created a separate issue for that: dotnet/BenchmarkDotNet#1210

@adamsitnik
Copy link
Member

Today I've hit a problem related to Tiered JIT and BenchmarkDotNet that most probably affects the stability of the benchmark results in some edge cases.

The benchmark was executed more than 30 times and for longer than 100 ms there was no new method compilation, however, the "hot" methods did not get promoted to Tier 1.

Most probably because Tiered JIT runs on a background thread and the thread did not get a chance to "kick in" and promote things to Tier 1.

@kouvel is there any way of forcing the Tiered JIT to run at given moment?

@kouvel
Copy link
Member

kouvel commented Oct 17, 2019

Call counting starts after there has been 100-200 ms during which no new methods are called. Methods that are called 30 times after that point (still with no new methods being called, which would initiate the delay again), would get tiered up in the background. If the total pilot+warmup duration is not long enough to reach the point when no new methods are called for the delay duration (with extra time for call counting and jitting after the delay expires) then it's definitely possible that the necessary things would not get tiered up in time.

is there any way of forcing the Tiered JIT to run at given moment?

There isn't a way to do that. If that would be necessary you'd probably be better off to disable tiering. For benchmarks though I think it would make sense to have a project-configurable option to tier up aggressively such that the timing factor can be mostly eliminated, hopefully without affecting the generated code too much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants