[TieredCompilation] Cold methods with hot loops may run slower with tiering #11006

kouvel · 2018-08-29T23:57:43Z

internal static class Program
{
    private const int HistoryCount = 8;
    private const int InnerIterationCount = 256;
    private static readonly TimeSpan s_ts500ms = TimeSpan.FromMilliseconds(500);

    private static void Main()
    {
        var sw = new Stopwatch();
        var history = new Queue<double>(HistoryCount);
        var list = new List<int>(InnerIterationCount);
        for (int outerIteration = -1; outerIteration < HistoryCount; ++outerIteration)
        {
            var duration = s_ts500ms;
            int iterations = 0;
            TimeSpan elapsed;
            sw.Restart();
            do
            {
                // ---
                list.Clear();
                for (int innerIteration = 0; innerIteration < InnerIterationCount; ++innerIteration)
                    list.Add(innerIteration);
                // ---
                ++iterations;
            } while ((iterations & 0xf) != 0 || (elapsed = sw.Elapsed) < duration);

            if (outerIteration < 0)
                continue;

            var iterationsPerMs = iterations / elapsed.TotalMilliseconds;
            if (history.Count >= HistoryCount)
                history.Dequeue();
            history.Enqueue(iterationsPerMs);
            Console.WriteLine($"{iterationsPerMs,10:0.00} {history.Average(),10:0.00}");
        }
    }
}

Average iterations per ms with tiering disabled: 2775.05
Tiering enabled: 2045.84

A comparison of PerfView profiles shows that some inlining is not happening:

Name                                                                               	Inc %	     Inc	Exc %	   Exc
 test!Program.Main()                                                               	 97.7	   4,485	 29.9	 1,375
+ system.private.corelib!System.Collections.Generic.List`1[System.Int32].Add(Int32)	 66.9	   3,072	 66.8	 3,069

The JITStats summary shows that the only JIT trigger for Main is FG (foreground), which when tiering is enabled, is tier 0 (minopts), which does not do inlining. There is no TC trigger to indicate tier 1 for Main.

A workaround is to move the iteration code into a separate method:

internal static class Program
{
    private const int HistoryCount = 8;
    private const int InnerIterationCount = 256;
    private static readonly TimeSpan s_ts500ms = TimeSpan.FromMilliseconds(500);

    private static void Main()
    {
        var sw = new Stopwatch();
        var history = new Queue<double>(HistoryCount);
        var list = new List<int>(InnerIterationCount);
        for (int outerIteration = -1; outerIteration < HistoryCount; ++outerIteration)
        {
            var duration = s_ts500ms;
            int iterations = 0;
            TimeSpan elapsed;
            sw.Restart();
            do
            {
                // ---
                RunIteration(list);
                // ---
                ++iterations;
            } while ((iterations & 0xf) != 0 || (elapsed = sw.Elapsed) < duration);

            if (outerIteration < 0)
                continue;

            var iterationsPerMs = iterations / elapsed.TotalMilliseconds;
            if (history.Count >= HistoryCount)
                history.Dequeue();
            history.Enqueue(iterationsPerMs);
            Console.WriteLine($"{iterationsPerMs,10:0.00} {history.Average(),10:0.00}");
        }
    }

    private static void RunIteration(List<int> list)
    {
        list.Clear();
        for (int innerIteration = 0; innerIteration < InnerIterationCount; ++innerIteration)
            list.Add(innerIteration);
    }
}

Average iterations per ms with tiering disabled: 2775.55
Tiering enabled: 2728.70

The PerfView profile now shows most of the time spent is exclusively in RunIteration as expected:

Name                                                                                	Inc %	     Inc	Exc %	   Exc	    First	      Last
 test!Program.Main()                                                                	 98.0	   4,490	  0.5	    22	1,567.832	 6,074.464
+ test!Program.RunIteration(class System.Collections.Generic.List`1)                	 97.0	   4,443	 93.5	 4,283	1,568.696	 6,074.464
|+ system.private.corelib!System.Collections.Generic.List`1[System.Int32].Add(Int32)	  3.5	     159	  3.5	   159	1,569.678	 1,792.351

List.Add is still showing up, and that must be when RunIteration was at tier 0, as the JITStats summary shows:

Start (msec)	JitTime msec	IL Size	Native Size	Method Name	Trigger
1,568.151	0.1	30	74	Program.RunIteration(class System.Collections.Generic.List`1)	FG
1,791.821	0.6	30	78	Program.RunIteration(class System.Collections.Generic.List`1)	TC

The last sample in List.Add in the profile was at 1,792.351. The tier 1 JIT for RunIteration was initiated at 1,791.821 and would have completed at around 1,792.421.

Other workarounds:

For benchmarks where each iteration of the benchmark is very short (a few milliseconds or less), use something like BenchmarkDotNet, where tiering would occur during the piloting or warmup phases and would not affect the measured phase. If each iteration of the benchmark takes longer, the number of warmup iterations may be increased to allow enough time for tiering to occur before measurement begins.
Disable tier 0 JIT (in environment COMPlus_TieredCompilation_DisableTier0Jit=1 or in project file <DisableTier0Jit>true</DisableTier0Jit>). In this mode, methods that don't have pregenerated code would be optimized initially. It may be useful as a global workaround for a suite of benchmarks where there may be several instances of cold methods with hot loops. For apps, it would avoid the worst-case situations where a cold method jitted at tier 0 contains a hot loop that runs for a long time. It would still be possible to be running a long-running hot loop in a cold method that has not yet been jitted at tier 1, but it would be running optimized pregenerated code, so the perf may be reasonable and the issue may not be as severe.
Attribute methods expected to contain hot code with MethodImplOptions.AggressiveOptimization. In the first example above, that would be:
```
  [MethodImpl(MethodImplOptions.AggressiveOptimization)]
  private static void Main()
  {
      ...
  }
```
Turn off tiered compilation (in environment COMPlus_TieredCompilation=0 or in project file <TieredCompilation>false</TieredCompilation>) for such types of benchmarks

Considerations:

Consider optimizing loops at tier 0, or methods containing loops. Data needs to be collected on how this would affect startup performance.
Longer-term: A proper fix would probably involve at least some portions of what OSR involves

The text was updated successfully, but these errors were encountered:

fiigii · 2018-08-30T23:08:21Z

OSR looks like the best solution in general if the engineering cost is acceptable 😄

Related to https://github.com/dotnet/corefx/issues/32235 Workaround (and probably a long-term fix) for https://github.com/dotnet/coreclr/issues/19751 - For a method flagged with AggressiveOptimization, tiering would use a foreground tier 1 JIT on first call to the method, skipping the tier 0 JIT and call counting

Part of fix for https://github.com/dotnet/corefx/issues/32235 Workaround for https://github.com/dotnet/coreclr/issues/19751 - For a method flagged with AggressiveOptimization, tiering would use a foreground tier 1 JIT on first call to the method, skipping the tier 0 JIT and call counting

Part of fix for https://github.com/dotnet/corefx/issues/32235 Workaround for https://github.com/dotnet/coreclr/issues/19751 - Added and set CORJIT_FLAG_AGGRESSIVE_OPT to indicate that a method is flagged with AggressiveOptimization - For a method flagged with AggressiveOptimization, tiering uses a foreground tier 1 JIT on first call to the method, skipping the tier 0 JIT and call counting - When tiering is disabled, a method flagged with AggressiveOptimization does not use r2r-pregenerated code - R2r crossgen does not generate code for a method flagged with AggressiveOptimization

…20009) Add MethodImplOptions.AggressiveOptimization and use it for tiering Part of fix for https://github.com/dotnet/corefx/issues/32235 Workaround for https://github.com/dotnet/coreclr/issues/19751 - Added and set CORJIT_FLAG_AGGRESSIVE_OPT to indicate that a method is flagged with AggressiveOptimization - For a method flagged with AggressiveOptimization, tiering uses a foreground tier 1 JIT on first call to the method, skipping the tier 0 JIT and call counting - When tiering is disabled, a method flagged with AggressiveOptimization does not use r2r-pregenerated code - R2r crossgen does not generate code for a method flagged with AggressiveOptimization

…#20009) Add MethodImplOptions.AggressiveOptimization and use it for tiering Part of fix for https://github.com/dotnet/corefx/issues/32235 Workaround for https://github.com/dotnet/coreclr/issues/19751 - Added and set CORJIT_FLAG_AGGRESSIVE_OPT to indicate that a method is flagged with AggressiveOptimization - For a method flagged with AggressiveOptimization, tiering uses a foreground tier 1 JIT on first call to the method, skipping the tier 0 JIT and call counting - When tiering is disabled, a method flagged with AggressiveOptimization does not use r2r-pregenerated code - R2r crossgen does not generate code for a method flagged with AggressiveOptimization Signed-off-by: dotnet-bot <[email protected]>

…otnet#20009) Add MethodImplOptions.AggressiveOptimization and use it for tiering Part of fix for https://github.com/dotnet/corefx/issues/32235 Workaround for https://github.com/dotnet/coreclr/issues/19751 - Added and set CORJIT_FLAG_AGGRESSIVE_OPT to indicate that a method is flagged with AggressiveOptimization - For a method flagged with AggressiveOptimization, tiering uses a foreground tier 1 JIT on first call to the method, skipping the tier 0 JIT and call counting - When tiering is disabled, a method flagged with AggressiveOptimization does not use r2r-pregenerated code - R2r crossgen does not generate code for a method flagged with AggressiveOptimization

…#20009) Add MethodImplOptions.AggressiveOptimization and use it for tiering Part of fix for https://github.com/dotnet/corefx/issues/32235 Workaround for https://github.com/dotnet/coreclr/issues/19751 - Added and set CORJIT_FLAG_AGGRESSIVE_OPT to indicate that a method is flagged with AggressiveOptimization - For a method flagged with AggressiveOptimization, tiering uses a foreground tier 1 JIT on first call to the method, skipping the tier 0 JIT and call counting - When tiering is disabled, a method flagged with AggressiveOptimization does not use r2r-pregenerated code - R2r crossgen does not generate code for a method flagged with AggressiveOptimization Signed-off-by: dotnet-bot <[email protected]>

…otnet#20009) Add MethodImplOptions.AggressiveOptimization and use it for tiering Part of fix for https://github.com/dotnet/corefx/issues/32235 Workaround for https://github.com/dotnet/coreclr/issues/19751 - Added and set CORJIT_FLAG_AGGRESSIVE_OPT to indicate that a method is flagged with AggressiveOptimization - For a method flagged with AggressiveOptimization, tiering uses a foreground tier 1 JIT on first call to the method, skipping the tier 0 JIT and call counting - When tiering is disabled, a method flagged with AggressiveOptimization does not use r2r-pregenerated code - R2r crossgen does not generate code for a method flagged with AggressiveOptimization

vancem · 2018-12-18T19:36:22Z

@AndyAyersMS

First I woudl like to highlight that the mitigation mention above of using an attribute has been implemented. In particular if you add the AggresiveInlining flags to the 'RunIteration' method like so

    using System.Runtime.CompilerServices;

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static void RunIteration(List<int> list)

The original repro will be fixed. Thus any particular instance of the issue can be mitigated in a straightfoward way. You just have to realize that you need to do it.

Second I should mention that it is relatively rare that this situation (where a perf-critical method is only run once) is pretty rare in real code. Much more typically perf-critical methods are called repeatedly. The sad exception to this is microbenchmarks (they are typically designed to be a hot loop, and if the important code is in the loop (rather that what was called from the loop), you will get this issue. Having benchmarks add the AggressiveInlininng switch.

It has been suggested above that On Stack Replacement (OSR) is the solution to this bug (since it would allow methods that have not returned to updated. This of course would solve the problem but figuring how how to map local state is non-trivial and likely error prone, which is why the cost-benefit of actually fixing this bug that way is low.

However I would like to suggest two other possible mitigations. They are not as comprehensive as OSR, but they are much easier, and likely to get use most of the way.

Heuristically detect benchmark methods and got directly to fully optimized code for those method. For example In Benchmark.NET you attribute benchmarks with the [Benchmark] attribute. Thus you can look for that attribute. There are not that many benchmark systems out there and detecting them all is not hard. This would probably be sufficient.
The most likely benchmarks that would be 'bad' are simple cases (one loop that calls one or very few APIs). We could detect code that looks like this can again skip tiering for them.
While on-stack replacement is a hard problem in general, it is likely that simple cases (e.g. when the 'dumb' code has only a single local variable), are likely to be straightfoward, and are likely to be the most common things in benchmarks. Just implement OSR for this case.

My main point here is that we have a way that users can fix things for sure, so the only issue is that they have to discover that, and that may not happen. We DON'T have to do a perfect job trying to detect the rest, and it is OK to be very heuristic and it is OK if the heuristics look 'ugly' because they are a 'best effort' kind of a thing.

I personally think we should do (1) for V3.0

kouvel · 2018-12-18T21:41:16Z

Updated workaround info above to include the attribute. I think you meant AggressiveOptimization instead of AggressiveInlining.

RE (1):

Microbenchmarks that use BenchmarkDotNet typically don't run into the issue because tiering typically happens during the pilot or warmup phases. If a single iteration takes long enough that it moves to the measurement phase before the benchmark is tiered, then it would run into the same issue.
Looking for one of several specific attributes on every method that is called for the first time might be a bit too much work to do at run-time. Maybe the compiler could inject the AggressiveOptimization based on other attributes.
Maybe it could be done by assembly. If the compiler sees a method in an assembly using one of the Benchmark-like attributes, it could mark all methods in that assembly with AggressiveOptimization, in order to detect dependency methods that would otherwise be missed.
The perf of a method attributed with AggressiveOptimization may eventually diverge from an identical method that is not attributed. If the intention is to measure the perf of a library method that is unattributed and typically called by unattributed methods, the perf result may not be representative.

kouvel · 2018-12-18T21:43:58Z

RE (3), OSR-like strategies would also be useful for profile-based speculative optimizations along with solving this issue. So it could be considered that fixing this issue with such a strategy is just a side-effect.

…ames - Tier 0 JIT is being called quick JIT in config options, renamed DisableTier0Jit to StartupTierQuickJit - Disabled quick JIT by default, the current plan is to do that for preview 4 - Concerns were that code produced by quick JIT may be slow, may allocate more, may use more stack space, and may be much larger than optimized code, and there there may be many cases where these things lead to regressions when the span of time between startup and steady-state is important - The thought was that with quick JIT disabled, tiering overhead from call counting and backgorund jitting with optimizations would be less, and perf during any point in time would be closer to 2.x releases - This mostly loses the startup perf gains from tiering. It may also be slightly slower compared with tiering off due to some overhead. When quick JIT is disabled for the startup tier, made a change to disable tiered compilation for methods in modules that are not R2R'ed since they will not be tiered currently anyway. The overhead and regression in R2R'ed modules will be looked into separately to see if it can be reduced. - Renamed tier 0 / tier 1 to StartupTier, OptimizedTier - Added config option ForceQuickJit, which uses quick JIT instead of the normal JIT. Off by default. Disables tiering. - Added config option QuickJitForLoops, which determines whether quick JIT, when enabled, may be used for methods that contain loops. Off by default, so StartupTierQuickJit=1 or ForceQuickJit=1 would still not use quick JIT for methods that contain loops by default. Fixes https://github.com/dotnet/coreclr/issues/22998 Fixes https://github.com/dotnet/coreclr/issues/19751

- Tier 0 JIT is being called quick JIT in config options, renamed DisableTier0Jit to StartupTierQuickJit - Disabled quick JIT by default, the current plan is to do that for preview 4 - Concerns were that code produced by quick JIT may be slow, may allocate more, may use more stack space, and may be much larger than optimized code, and there there may be many cases where these things lead to regressions when the span of time between startup and steady-state is important - The thought was that with quick JIT disabled, tiering overhead from call counting and backgorund jitting with optimizations would be less, and perf during any point in time would be closer to 2.x releases - This mostly loses the startup perf gains from tiering. It may also be slightly slower compared with tiering off due to some overhead. When quick JIT is disabled for the startup tier, made a change to disable tiered compilation for methods in modules that are not R2R'ed since they will not be tiered currently anyway. The overhead and regression in R2R'ed modules will be looked into separately to see if it can be reduced. - Added config option ForceQuickJit, which uses quick JIT instead of the normal JIT. Off by default. Disables tiering. Fixes https://github.com/dotnet/coreclr/issues/22998 Fixes https://github.com/dotnet/coreclr/issues/19751

…option (#23599) Disable tier 0 JIT (quick JIT) by default, rename config option - Tier 0 JIT is being called quick JIT in config options, renamed DisableTier0Jit to StartupTierQuickJit - Disabled quick JIT by default, the current plan is to do that for preview 4 - Concerns were that code produced by quick JIT may be slow, may allocate more, may use more stack space, and may be much larger than optimized code, and there there may be many cases where these things lead to regressions when the span of time between startup and steady-state is important - The thought was that with quick JIT disabled, tiering overhead from call counting and backgorund jitting with optimizations would be less, and perf during any point in time would be closer to 2.x releases - This mostly loses the startup perf gains from tiering. It may also be slightly slower compared with tiering off due to some overhead. When quick JIT is disabled for the startup tier, made a change to disable tiered compilation for methods in modules that are not R2R'ed since they will not be tiered currently anyway. The overhead and regression in R2R'ed modules will be looked into separately to see if it can be reduced. Fixes https://github.com/dotnet/coreclr/issues/22998 Fixes https://github.com/dotnet/coreclr/issues/19751

…y default Fixes https://github.com/dotnet/coreclr/issues/19751 by default when QuickJit is enabled - Added config variable TC_QuickJitForLoops. When disabled (the default), the JIT identifies loops and explicit tail calls and switches to tier 1 JIT. - This would prevent the possibility of spending too long in QuickJit code, but may decrease startup time a bit when QuickJit is enabled - Removed TC_StartupTier_OptimizeCode, as now that there is TC_QuickJit, I didn't see a good use for it - Removed references to "StartupTier" in config variables because we had previously decided not to call it that. - When QuickJit is disabled, avoid creating native code slots for methods in non-R2R'ed modules, as tiering would be disabled for those anyway

…y default (#24252) When QuickJit is enabled, disable it for methods that contain loops by default Fixes https://github.com/dotnet/coreclr/issues/19751 by default when QuickJit is enabled - Added config variable TC_QuickJitForLoops. When disabled (the default), the JIT identifies loops and explicit tail calls and switches to tier 1 JIT. - This would prevent the possibility of spending too long in QuickJit code, but may decrease startup time a bit when QuickJit is enabled - Removed TC_StartupTier_OptimizeCode, as now that there is TC_QuickJit, I didn't see a good use for it - Removed references to "StartupTier" in config variables because we had previously decided not to call it that. - When QuickJit is disabled, avoid creating native code slots for methods in non-R2R'ed modules, as tiering would be disabled for those anyway - Marked TC_QuickJit config var as external

kouvel closed this as completed in dotnet/coreclr#23599 Apr 3, 2019

msftgits transferred this issue from dotnet/coreclr Jan 31, 2020

msftgits added this to the 3.0 milestone Jan 31, 2020

EgorBo mentioned this issue Feb 25, 2020

TC_QuickJitForLoops support #32784

Closed

ghost locked as resolved and limited conversation to collaborators Dec 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TieredCompilation] Cold methods with hot loops may run slower with tiering #11006

[TieredCompilation] Cold methods with hot loops may run slower with tiering #11006

kouvel commented Aug 29, 2018

fiigii commented Aug 30, 2018

vancem commented Dec 18, 2018

kouvel commented Dec 18, 2018 •

edited

Loading

kouvel commented Dec 18, 2018

[TieredCompilation] Cold methods with hot loops may run slower with tiering #11006

[TieredCompilation] Cold methods with hot loops may run slower with tiering #11006

Comments

kouvel commented Aug 29, 2018

fiigii commented Aug 30, 2018

vancem commented Dec 18, 2018

kouvel commented Dec 18, 2018 • edited Loading

kouvel commented Dec 18, 2018

kouvel commented Dec 18, 2018 •

edited

Loading