don't execute long operations more than once per iteration #760

adamsitnik · 2018-05-22T18:46:02Z

Background: because of the default unroll factor = 16, every iteration was invoking benchmark at least 16 times.

So if the benchmark took 1s, then:

jitting took 16x1s
warmup took 6 (MinIterationCount for EngineWarmupStage) x 16 x 1s = 96 s
every target iteration was taking 16s

Note: In the default mode, this was described in our docs and was possible to change with custom config. We just did not have the time to fix it.

This PR introduces following changes:

EngineFactory is supposed to create an engine which is "ready to run", which means it's supposed to call GlobalSetup. (hence the method name change Create => CreateReadyToRun)
Engine implements IDisposable, which calls GlobalCleanup
When the strategy requires us to jit the code first (default mode) and to run the pilot (no explicit value for invocationCount), the engine factory does jitting and also checks how long does it take to run the benchmark once (unrol factor=invocation count=1).
a. If it took longer than IterationTime (0.5s by default), the engine is created for job with one invocation per iteration
b. otherwise, old rules applies
Strategies which don't require jitting don't follow this path (by design)
I had to duplicate both the C# and IL we generate to have possibility to choose the mode and keep the Engine immutable. It does not look as bad as I expected it to look ;)
Because the changes applies mostly to EngineFactory, I was able to write some nice unit tests (not integration tests)

public class Sleeep
{
    [Benchmark] public void Sleep1s() => Thread.Sleep(TimeSpan.FromSeconds(1));
}

Before: Total time: 00:06:16 (376.01 sec)
After: Total time: 00:00:27 (27.88 sec)

This fixes #736, makes #730 easy and contributes to #550.

@AndreyAkinshin please take a look ;)
@KrzysztofCwalina this fixes the issue you have faced in ML.NET

…run (hence the method name change), #736

tests/BenchmarkDotNet.IntegrationTests/CoreRtTests.cs

@@ -17,6 +17,9 @@ public class CoreRtTests : BenchmarkTestExecutor
        [Fact]
        public void CoreRtIsSupported()
        {
+            if (IntPtr.Size == sizeof(int)) // CoreRT does not support 32bit yet


AndreyAkinshin · 2018-05-22T19:14:25Z

Currently, I have the following output for a 50ms method:

IdleJit -1: 1 op, 109000.00 ns, 109.0000 us/op
Jit -1: 1 op, 53255800.00 ns, 53.2558 ms/op
IdleJit -1: 16 op, 212200.00 ns, 13.2625 us/op
Jit -1: 16 op, 848930200.00 ns, 53.0581 ms/op
Pilot  1: 16 op, 854972000.00 ns, 53.4358 ms/op

IdleWarmup  1: 16 op, 600.00 ns, 37.5000 ns/op
IdleWarmup  2: 16 op, 500.00 ns, 31.2500 ns/op
IdleWarmup  3: 16 op, 600.00 ns, 37.5000 ns/op
IdleWarmup  4: 16 op, 600.00 ns, 37.5000 ns/op
IdleWarmup  5: 16 op, 500.00 ns, 31.2500 ns/op
IdleWarmup  6: 16 op, 500.00 ns, 31.2500 ns/op

And the following output for a 5 sec method:

IdleJit -1: 1 op, 363300.00 ns, 363.3000 us/op
Jit -1: 1 op, 5004330100.00 ns, 5.0043 s/op
IdleWarmup  1: 1 op, 200.00 ns, 200.0000 ns/op
IdleWarmup  2: 1 op, 200.00 ns, 200.0000 ns/op
IdleWarmup  3: 1 op, 200.00 ns, 200.0000 ns/op
IdleWarmup  4: 1 op, 300.00 ns, 300.0000 ns/op
IdleWarmup  5: 1 op, 300.00 ns, 300.0000 ns/op
IdleWarmup  6: 1 op, 300.00 ns, 300.0000 ns/op

IdleTarget  1: 1 op, 200.00 ns, 200.0000 ns/op
IdleTarget  2: 1 op, 300.00 ns, 300.0000 ns/op
IdleTarget  3: 1 op, 300.00 ns, 300.0000 ns/op
IdleTarget  4: 1 op, 200.00 ns, 200.0000 ns/op

It works which is great! Awesome PR! However, I suggest to improve the output:

IterationIndex = -1 confuses me. Could we start it with 1?
Maybe it makes sense to rename IdleJit and Jit to IdleJitting and Jitting?
It would be great to align all output lines and form "pseudocolumns" for iteration name and index. In this case, the output will be much readable.

AndreyAkinshin · 2018-05-22T19:19:18Z

I also tried to run a 50ms method with the following config:

private class Config : ManualConfig
{
    public Config()
    {
        Add(Job.Default.WithUnrollFactor(4));
    }
}

The output:

IdleJit -1: 1 op, 114500.00 ns, 114.5000 us/op
Jit -1: 1 op, 53745100.00 ns, 53.7451 ms/op
IdleJit -1: 4 op, 131000.00 ns, 32.7500 us/op
Jit -1: 4 op, 216355400.00 ns, 54.0889 ms/op
Pilot  1: 4 op, 212807600.00 ns, 53.2019 ms/op
Pilot  2: 8 op, 428189900.00 ns, 53.5237 ms/op
Pilot  3: 16 op, 861407400.00 ns, 53.8380 ms/op

IdleWarmup  1: 16 op, 400.00 ns, 25.0000 ns/op
IdleWarmup  2: 16 op, 500.00 ns, 31.2500 ns/op
IdleWarmup  3: 16 op, 400.00 ns, 25.0000 ns/op

If we specifically ask for UnrollFactor=4, this value should be used. The new auto strategy for UnrollFactor should be applied only for unspecified value. Also it would be nice to have a test for such case.

adamsitnik · 2018-05-22T20:04:00Z

@AndreyAkinshin very good points! I totally missed the case with unrollFactor set in explicit way.

I have pushed a commit with fixes.

With the aligning it definitely looks better:

IdleJitting  1: 1 op, 361510.84 ns, 361.5108 us/op
MainJitting  1: 1 op, 1000483249.90 ns, 1.0005 s/op

Was:

IdleJitting -1: 1 op, 374340.49 ns, 374.3405 us/op
Jitting -1: 1 op, 1000731574.78 ns, 1.0007 s/op

AndreyAkinshin · 2018-05-23T08:34:21Z

IdleJitting  1: 1 op, 234600.00 ns, 234.6000 us/op
MainJitting  1: 1 op, 51181500.00 ns, 51.1815 ms/op

IdleJitting  1: 16 op, 240400.00 ns, 15.0250 us/op
MainJitting  1: 16 op, 812670500.00 ns, 50.7919 ms/op

Pilot  1: 16 op, 813271600.00 ns, 50.8295 ms/op

IdleWarmup  1: 16 op, 1600.00 ns, 100.0000 ns/op
IdleWarmup  2: 16 op, 400.00 ns, 25.0000 ns/op
IdleWarmup  3: 16 op, 400.00 ns, 25.0000 ns/op
IdleWarmup  4: 16 op, 200.00 ns, 12.5000 ns/op
IdleWarmup  5: 16 op, 300.00 ns, 18.7500 ns/op
IdleWarmup  6: 16 op, 300.00 ns, 18.7500 ns/op

Now only the Pilot stage is unaligned.
@adamsitnik, could you align it as well?

AndreyAkinshin · 2018-05-23T08:37:38Z

Also MainJitting and IdleWarmup have the different amount of symbols, so it would be nice to align them as well.

adamsitnik · 2018-05-23T10:08:39Z

I aligned it, now it looks like this:

// Benchmark Process Environment Information:
// Runtime=.NET CoreRT 1.0.26523.01, 64bit AOT
// GC=Concurrent Workstation
// Job: CoreRT(Runtime=CoreRT)

IdleJitting  1: 1 op, 570.21 ns, 570.2064 ns/op
MainJitting  1: 1 op, 39059.14 ns, 39.0591 us/op

IdleJitting  1: 16 op, 570.21 ns, 35.6379 ns/op
MainJitting  1: 16 op, 34212.38 ns, 2.1383 us/op

Pilot        1: 16 op, 30791.14 ns, 1.9244 us/op
Pilot        2: 32 op, 38488.93 ns, 1.2028 us/op
Pilot        3: 64 op, 44476.10 ns, 694.9390 ns/op
Pilot        4: 128 op, 51603.68 ns, 403.1537 ns/op
Pilot        5: 256 op, 81539.51 ns, 318.5137 ns/op
Pilot        6: 512 op, 116892.31 ns, 228.3053 ns/op
Pilot        7: 1024 op, 202423.26 ns, 197.6790 ns/op
Pilot        8: 2048 op, 416250.65 ns, 203.2474 ns/op
Pilot        9: 4096 op, 925159.84 ns, 225.8691 ns/op
Pilot       10: 8192 op, 1741695.37 ns, 212.6093 ns/op
Pilot       11: 16384 op, 3537845.45 ns, 215.9329 ns/op
Pilot       12: 32768 op, 11446322.77 ns, 349.3140 ns/op
Pilot       13: 65536 op, 15517881.39 ns, 236.7841 ns/op
Pilot       14: 131072 op, 26722436.65 ns, 203.8760 ns/op
Pilot       15: 262144 op, 50601254.11 ns, 193.0285 ns/op
Pilot       16: 524288 op, 100716692.39 ns, 192.1018 ns/op
Pilot       17: 1048576 op, 197987912.77 ns, 188.8160 ns/op
Pilot       18: 2097152 op, 426441667.03 ns, 203.3432 ns/op
Pilot       19: 4194304 op, 829293326.13 ns, 197.7189 ns/op

IdleWarmup   1: 4194304 op, 16682242.80 ns, 3.9774 ns/op
IdleWarmup   2: 4194304 op, 9546395.13 ns, 2.2760 ns/op
IdleWarmup   3: 4194304 op, 8673124.06 ns, 2.0678 ns/op
IdleWarmup   4: 4194304 op, 8665141.17 ns, 2.0659 ns/op
IdleWarmup   5: 4194304 op, 8839054.12 ns, 2.1074 ns/op
IdleWarmup   6: 4194304 op, 9296644.73 ns, 2.2165 ns/op
IdleWarmup   7: 4194304 op, 9432068.75 ns, 2.2488 ns/op
IdleWarmup   8: 4194304 op, 8727578.77 ns, 2.0808 ns/op
IdleWarmup   9: 4194304 op, 8693936.60 ns, 2.0728 ns/op
IdleWarmup  10: 4194304 op, 8712468.30 ns, 2.0772 ns/op

IdleTarget   1: 4194304 op, 9002133.14 ns, 2.1463 ns/op
IdleTarget   2: 4194304 op, 8681106.95 ns, 2.0697 ns/op
IdleTarget   3: 4194304 op, 8636915.96 ns, 2.0592 ns/op
IdleTarget   4: 4194304 op, 8652596.63 ns, 2.0629 ns/op
IdleTarget   5: 4194304 op, 8727008.57 ns, 2.0807 ns/op
IdleTarget   6: 4194304 op, 8618669.35 ns, 2.0549 ns/op
IdleTarget   7: 4194304 op, 8693651.49 ns, 2.0727 ns/op
IdleTarget   8: 4194304 op, 8786310.03 ns, 2.0948 ns/op
IdleTarget   9: 4194304 op, 8670843.24 ns, 2.0673 ns/op
IdleTarget  10: 4194304 op, 8831641.44 ns, 2.1056 ns/op
IdleTarget  11: 4194304 op, 9313465.82 ns, 2.2205 ns/op
IdleTarget  12: 4194304 op, 8664856.07 ns, 2.0659 ns/op
IdleTarget  13: 4194304 op, 8687949.43 ns, 2.0714 ns/op
IdleTarget  14: 4194304 op, 8621520.39 ns, 2.0555 ns/op
IdleTarget  15: 4194304 op, 9069987.70 ns, 2.1625 ns/op

MainWarmup   1: 4194304 op, 812825480.93 ns, 193.7927 ns/op
MainWarmup   2: 4194304 op, 857425598.05 ns, 204.4262 ns/op
MainWarmup   3: 4194304 op, 793740388.46 ns, 189.2425 ns/op
MainWarmup   4: 4194304 op, 811990698.79 ns, 193.5937 ns/op
MainWarmup   5: 4194304 op, 756858299.72 ns, 180.4491 ns/op
MainWarmup   6: 4194304 op, 754029505.90 ns, 179.7746 ns/op

// BeforeMainRun
MainTarget   1: 4194304 op, 761732138.71 ns, 181.6111 ns/op
MainTarget   2: 4194304 op, 752033498.48 ns, 179.2988 ns/op
MainTarget   3: 4194304 op, 770667272.61 ns, 183.7414 ns/op
MainTarget   4: 4194304 op, 769454443.65 ns, 183.4522 ns/op
MainTarget   5: 4194304 op, 756888805.77 ns, 180.4564 ns/op
MainTarget   6: 4194304 op, 757020523.44 ns, 180.4878 ns/op
MainTarget   7: 4194304 op, 759448462.18 ns, 181.0666 ns/op
MainTarget   8: 4194304 op, 765128002.78 ns, 182.4207 ns/op
MainTarget   9: 4194304 op, 751105487.61 ns, 179.0775 ns/op
MainTarget  10: 4194304 op, 756828363.89 ns, 180.4419 ns/op
MainTarget  11: 4194304 op, 766402984.23 ns, 182.7247 ns/op
MainTarget  12: 4194304 op, 751867568.43 ns, 179.2592 ns/op
MainTarget  13: 4194304 op, 755829362.32 ns, 180.2038 ns/op
MainTarget  14: 4194304 op, 753383462.08 ns, 179.6206 ns/op
MainTarget  15: 4194304 op, 755330716.85 ns, 180.0849 ns/op

// AfterMainRun
Result       1: 4194304 op, 752954894.97 ns, 179.5184 ns/op
Result       2: 4194304 op, 743256254.74 ns, 177.2061 ns/op
Result       3: 4194304 op, 761890028.86 ns, 181.6487 ns/op
Result       4: 4194304 op, 760677199.90 ns, 181.3596 ns/op
Result       5: 4194304 op, 748111562.02 ns, 178.3637 ns/op
Result       6: 4194304 op, 748243279.69 ns, 178.3951 ns/op
Result       7: 4194304 op, 750671218.43 ns, 178.9740 ns/op
Result       8: 4194304 op, 756350759.03 ns, 180.3281 ns/op
Result       9: 4194304 op, 742328243.86 ns, 176.9848 ns/op
Result      10: 4194304 op, 748051120.14 ns, 178.3493 ns/op
Result      11: 4194304 op, 757625740.48 ns, 180.6321 ns/op
Result      12: 4194304 op, 743090324.68 ns, 177.1665 ns/op
Result      13: 4194304 op, 747052118.57 ns, 178.1111 ns/op
Result      14: 4194304 op, 744606218.33 ns, 177.5280 ns/op
Result      15: 4194304 op, 746553473.10 ns, 177.9922 ns/op
GC:  0 0 0 0 0

adamsitnik · 2018-05-24T11:17:25Z

@AndreyAkinshin Tonight I have been running benchmark game benchmarks with this change and I have two more ideas for improvement:

If the benchmark takes a lot of time to execute and we call it 1 per iteration and the user did not ask to measure overhead in an explicit way, don't measure the overhead. Why? If the benchmark takes more than 500 ms (500 * 1000000 ns) and the avg overhead is sth around 250 ns. then the overhead is 0,000025% of the final result.
There is a lot of benchmarks, which should not be executed once, but 16 times time per iteration. Proposal: when Pilot condition is met for 16 per iteration, set the invocation count to 16 and don't run pilot.

An example:

IdleJitting  1: 1 op, 806842.02 ns, 806.8420 us/op
MainJitting  1: 1 op, 491133861.08 ns, 491.1339 ms/op

IdleJitting  1: 16 op, 2210975.22 ns, 138.1860 us/op
MainJitting  1: 16 op, 7509629645.26 ns, 469.3519 ms/op // we now know that the pilot is not needed

Pilot        1: 16 op, 7430266611.39 ns, 464.3917 ms/op // could have been avoided

// the next is IdleWarmup

@AndreyAkinshin What do you think?

AndreyAkinshin · 2018-05-24T13:32:02Z

@adamsitnik, I like these ideas! However, we should discuss details and write tests for all corner cases.

P.S. The explicit unroll factor still doesn't work for me. Is it already fixed? Could you tell me the name of the corresponded test?

…ilot if jitting gives the answer, #736

adamsitnik · 2018-05-24T17:31:31Z

@AndreyAkinshin ok, I just pushed a commit with the changes. I have also cleaned up the code, I hope it's easier to read and maintain now.

For:

public class Tests
{
    [Benchmark] public void Sleep1s() => Thread.Sleep(TimeSpan.FromSeconds(1));
}

The output is:

IdleJitting  1: 1 op, 353813.06 ns, 353.8131 us/op
MainJitting  1: 1 op, 1000552244.87 ns, 1.0006 s/op

MainWarmup   1: 1 op, 1000413114.52 ns, 1.0004 s/op
MainWarmup   2: 1 op, 1000174768.25 ns, 1.0002 s/op
MainWarmup   3: 1 op, 1000517747.39 ns, 1.0005 s/op
MainWarmup   4: 1 op, 1000135138.91 ns, 1.0001 s/op
MainWarmup   5: 1 op, 1000249750.39 ns, 1.0002 s/op
MainWarmup   6: 1 op, 1000160513.09 ns, 1.0002 s/op

// BeforeMainRun
MainTarget   1: 1 op, 1000562508.59 ns, 1.0006 s/op
MainTarget   2: 1 op, 1000154525.93 ns, 1.0002 s/op
MainTarget   3: 1 op, 1000243478.12 ns, 1.0002 s/op
MainTarget   4: 1 op, 1000163079.02 ns, 1.0002 s/op
MainTarget   5: 1 op, 1000204418.99 ns, 1.0002 s/op
MainTarget   6: 1 op, 1000112330.66 ns, 1.0001 s/op
MainTarget   7: 1 op, 1000160227.99 ns, 1.0002 s/op
MainTarget   8: 1 op, 1000543976.88 ns, 1.0005 s/op
MainTarget   9: 1 op, 1000763506.34 ns, 1.0008 s/op
MainTarget  10: 1 op, 1000800284.65 ns, 1.0008 s/op
MainTarget  11: 1 op, 1000441909.94 ns, 1.0004 s/op
MainTarget  12: 1 op, 1000334711.14 ns, 1.0003 s/op
MainTarget  13: 1 op, 1000443620.56 ns, 1.0004 s/op
MainTarget  14: 1 op, 1000493513.62 ns, 1.0005 s/op
MainTarget  15: 1 op, 1001026941.68 ns, 1.0010 s/op

// AfterMainRun
Result       1: 1 op, 1000562508.59 ns, 1.0006 s/op
Result       2: 1 op, 1000154525.93 ns, 1.0002 s/op
Result       3: 1 op, 1000243478.12 ns, 1.0002 s/op
Result       4: 1 op, 1000163079.02 ns, 1.0002 s/op
Result       5: 1 op, 1000204418.99 ns, 1.0002 s/op
Result       6: 1 op, 1000112330.66 ns, 1.0001 s/op
Result       7: 1 op, 1000160227.99 ns, 1.0002 s/op
Result       8: 1 op, 1000543976.88 ns, 1.0005 s/op
Result       9: 1 op, 1000763506.34 ns, 1.0008 s/op
Result      10: 1 op, 1000800284.65 ns, 1.0008 s/op
Result      11: 1 op, 1000441909.94 ns, 1.0004 s/op
Result      12: 1 op, 1000334711.14 ns, 1.0003 s/op
Result      13: 1 op, 1000443620.56 ns, 1.0004 s/op
Result      14: 1 op, 1000493513.62 ns, 1.0005 s/op
Result      15: 1 op, 1001026941.68 ns, 1.0010 s/op

For

public class Tests
{
    [Benchmark] public void Sleep100ms() => Thread.Sleep(TimeSpan.FromMilliseconds(100));
}

The output is:

IdleJitting  1: 1 op, 348966.30 ns, 348.9663 us/op
MainJitting  1: 1 op, 101025744.25 ns, 101.0257 ms/op

IdleJitting  2: 16 op, 1038345.81 ns, 64.8966 us/op
MainJitting  2: 16 op, 1609747335.85 ns, 100.6092 ms/op

MainWarmup   1: 16 op, 1608774848.88 ns, 100.5484 ms/op
MainWarmup   2: 16 op, 1607150330.92 ns, 100.4469 ms/op
MainWarmup   3: 16 op, 1605935791.34 ns, 100.3710 ms/op
MainWarmup   4: 16 op, 1606883759.44 ns, 100.4302 ms/op
MainWarmup   5: 16 op, 1606966439.36 ns, 100.4354 ms/op
MainWarmup   6: 16 op, 1610310984.85 ns, 100.6444 ms/op
MainWarmup   7: 16 op, 1608051542.09 ns, 100.5032 ms/op
MainWarmup   8: 16 op, 1606959311.78 ns, 100.4350 ms/op
MainWarmup   9: 16 op, 1605103290.03 ns, 100.3190 ms/op
MainWarmup  10: 16 op, 1608219752.98 ns, 100.5137 ms/op
MainWarmup  11: 16 op, 1609136929.93 ns, 100.5711 ms/op
MainWarmup  12: 16 op, 1607567436.88 ns, 100.4730 ms/op

// BeforeMainRun
MainTarget   1: 16 op, 1605767865.56 ns, 100.3605 ms/op
MainTarget   2: 16 op, 1605735648.90 ns, 100.3585 ms/op
MainTarget   3: 16 op, 1607222747.13 ns, 100.4514 ms/op
MainTarget   4: 16 op, 1605089890.18 ns, 100.3181 ms/op
MainTarget   5: 16 op, 1606944771.52 ns, 100.4340 ms/op
MainTarget   6: 16 op, 1606836717.41 ns, 100.4273 ms/op
MainTarget   7: 16 op, 1605859668.79 ns, 100.3662 ms/op
MainTarget   8: 16 op, 1608290458.57 ns, 100.5182 ms/op
MainTarget   9: 16 op, 1607716830.95 ns, 100.4823 ms/op
MainTarget  10: 16 op, 1607361307.28 ns, 100.4601 ms/op
MainTarget  11: 16 op, 1606241992.16 ns, 100.3901 ms/op
MainTarget  12: 16 op, 1607062234.03 ns, 100.4414 ms/op
MainTarget  13: 16 op, 1607646410.47 ns, 100.4779 ms/op
MainTarget  14: 16 op, 1605199084.70 ns, 100.3249 ms/op
MainTarget  15: 16 op, 1606747480.12 ns, 100.4217 ms/op

// AfterMainRun
Result       1: 16 op, 1605767865.56 ns, 100.3605 ms/op
Result       2: 16 op, 1605735648.90 ns, 100.3585 ms/op
Result       3: 16 op, 1607222747.13 ns, 100.4514 ms/op
Result       4: 16 op, 1605089890.18 ns, 100.3181 ms/op
Result       5: 16 op, 1606944771.52 ns, 100.4340 ms/op
Result       6: 16 op, 1606836717.41 ns, 100.4273 ms/op
Result       7: 16 op, 1605859668.79 ns, 100.3662 ms/op
Result       8: 16 op, 1608290458.57 ns, 100.5182 ms/op
Result       9: 16 op, 1607716830.95 ns, 100.4823 ms/op
Result      10: 16 op, 1607361307.28 ns, 100.4601 ms/op
Result      11: 16 op, 1606241992.16 ns, 100.3901 ms/op
Result      12: 16 op, 1607062234.03 ns, 100.4414 ms/op
Result      13: 16 op, 1607646410.47 ns, 100.4779 ms/op
Result      14: 16 op, 1605199084.70 ns, 100.3249 ms/op
Result      15: 16 op, 1606747480.12 ns, 100.4217 ms/op
GC:  0 0 0 0 0

For

public class Tests
{
    [Benchmark] public void Sleep10ms() => Thread.Sleep(TimeSpan.FromMilliseconds(10));
}

The output is:

IdleJitting  1: 1 op, 380612.76 ns, 380.6128 us/op
MainJitting  1: 1 op, 10690514.22 ns, 10.6905 ms/op

IdleJitting  2: 16 op, 371774.56 ns, 23.2359 us/op
MainJitting  2: 16 op, 171752147.25 ns, 10.7345 ms/op

Pilot        1: 16 op, 169767258.86 ns, 10.6105 ms/op
Pilot        2: 32 op, 337642287.87 ns, 10.5513 ms/op
Pilot        3: 64 op, 676014725.01 ns, 10.5627 ms/op

IdleWarmup   1: 64 op, 5416.96 ns, 84.6400 ns/op
IdleWarmup   2: 64 op, 855.31 ns, 13.3642 ns/op
IdleWarmup   3: 64 op, 1140.41 ns, 17.8189 ns/op
IdleWarmup   4: 64 op, 855.31 ns, 13.3642 ns/op
IdleWarmup   5: 64 op, 855.31 ns, 13.3642 ns/op
IdleWarmup   6: 64 op, 855.31 ns, 13.3642 ns/op

IdleTarget   1: 64 op, 285.10 ns, 4.4547 ns/op
IdleTarget   2: 64 op, 855.31 ns, 13.3642 ns/op
IdleTarget   3: 64 op, 855.31 ns, 13.3642 ns/op
IdleTarget   4: 64 op, 855.31 ns, 13.3642 ns/op
IdleTarget   5: 64 op, 570.21 ns, 8.9095 ns/op
IdleTarget   6: 64 op, 570.21 ns, 8.9095 ns/op
IdleTarget   7: 64 op, 570.21 ns, 8.9095 ns/op
IdleTarget   8: 64 op, 570.21 ns, 8.9095 ns/op
IdleTarget   9: 64 op, 285.10 ns, 4.4547 ns/op
IdleTarget  10: 64 op, 285.10 ns, 4.4547 ns/op
IdleTarget  11: 64 op, 570.21 ns, 8.9095 ns/op
IdleTarget  12: 64 op, 285.10 ns, 4.4547 ns/op
IdleTarget  13: 64 op, 570.21 ns, 8.9095 ns/op
IdleTarget  14: 64 op, 285.10 ns, 4.4547 ns/op
IdleTarget  15: 64 op, 285.10 ns, 4.4547 ns/op
IdleTarget  16: 64 op, 570.21 ns, 8.9095 ns/op
IdleTarget  17: 64 op, 285.10 ns, 4.4547 ns/op
IdleTarget  18: 64 op, 285.10 ns, 4.4547 ns/op
IdleTarget  19: 64 op, 285.10 ns, 4.4547 ns/op
IdleTarget  20: 64 op, 285.10 ns, 4.4547 ns/op

MainWarmup   1: 64 op, 669696553.27 ns, 10.4640 ms/op
MainWarmup   2: 64 op, 673253785.74 ns, 10.5196 ms/op
MainWarmup   3: 64 op, 670058919.42 ns, 10.4697 ms/op
MainWarmup   4: 64 op, 671713943.43 ns, 10.4955 ms/op
MainWarmup   5: 64 op, 671377806.77 ns, 10.4903 ms/op
MainWarmup   6: 64 op, 673961696.96 ns, 10.5307 ms/op

// BeforeMainRun
MainTarget   1: 64 op, 670711520.62 ns, 10.4799 ms/op
MainTarget   2: 64 op, 671468184.48 ns, 10.4917 ms/op
MainTarget   3: 64 op, 673055924.13 ns, 10.5165 ms/op
MainTarget   4: 64 op, 671519217.95 ns, 10.4925 ms/op
MainTarget   5: 64 op, 668730623.67 ns, 10.4489 ms/op
MainTarget   6: 64 op, 670441527.90 ns, 10.4756 ms/op
MainTarget   7: 64 op, 669674030.12 ns, 10.4637 ms/op
MainTarget   8: 64 op, 672770250.74 ns, 10.5120 ms/op
MainTarget   9: 64 op, 671235540.28 ns, 10.4881 ms/op
MainTarget  10: 64 op, 671995055.17 ns, 10.4999 ms/op
MainTarget  11: 64 op, 676752572.06 ns, 10.5743 ms/op
MainTarget  12: 64 op, 674429266.18 ns, 10.5380 ms/op
MainTarget  13: 64 op, 668602612.34 ns, 10.4469 ms/op
MainTarget  14: 64 op, 671500971.35 ns, 10.4922 ms/op
MainTarget  15: 64 op, 673650364.28 ns, 10.5258 ms/op

// AfterMainRun
Result       1: 64 op, 670711050.20 ns, 10.4799 ms/op
Result       2: 64 op, 671467714.06 ns, 10.4917 ms/op
Result       3: 64 op, 673055453.71 ns, 10.5165 ms/op
Result       4: 64 op, 671518747.53 ns, 10.4925 ms/op
Result       5: 64 op, 668730153.25 ns, 10.4489 ms/op
Result       6: 64 op, 670441057.48 ns, 10.4756 ms/op
Result       7: 64 op, 669673559.70 ns, 10.4636 ms/op
Result       8: 64 op, 672769780.32 ns, 10.5120 ms/op
Result       9: 64 op, 671235069.86 ns, 10.4880 ms/op
Result      10: 64 op, 671994584.75 ns, 10.4999 ms/op
Result      11: 64 op, 676752101.64 ns, 10.5743 ms/op
Result      12: 64 op, 674428795.76 ns, 10.5379 ms/op
Result      13: 64 op, 668602141.92 ns, 10.4469 ms/op
Result      14: 64 op, 671500500.93 ns, 10.4922 ms/op
Result      15: 64 op, 673649893.86 ns, 10.5258 ms/op
GC:  0 0 0 0 0

For

public class Tests
{
    [Benchmark] public void Empty() { }
}

The output is:

IdleJitting  1: 1 op, 886100.71 ns, 886.1007 us/op
MainJitting  1: 1 op, 180470.32 ns, 180.4703 us/op

IdleJitting  2: 16 op, 375195.79 ns, 23.4497 us/op
MainJitting  2: 16 op, 350391.82 ns, 21.8995 us/op

Pilot        1: 16 op, 3706.34 ns, 231.6463 ns/op
Pilot        2: 32 op, 570.21 ns, 17.8189 ns/op
Pilot        3: 64 op, 285.10 ns, 4.4547 ns/op
Pilot        4: 128 op, 570.21 ns, 4.4547 ns/op
Pilot        5: 256 op, 570.21 ns, 2.2274 ns/op
Pilot        6: 512 op, 855.31 ns, 1.6705 ns/op
Pilot        7: 1024 op, 1710.62 ns, 1.6705 ns/op
Pilot        8: 2048 op, 3421.24 ns, 1.6705 ns/op
Pilot        9: 4096 op, 7127.58 ns, 1.7401 ns/op
Pilot       10: 8192 op, 13684.95 ns, 1.6705 ns/op
Pilot       11: 16384 op, 27084.80 ns, 1.6531 ns/op
Pilot       12: 32768 op, 51318.57 ns, 1.5661 ns/op
Pilot       13: 65536 op, 107483.90 ns, 1.6401 ns/op
Pilot       14: 131072 op, 214682.70 ns, 1.6379 ns/op
Pilot       15: 262144 op, 416535.76 ns, 1.5890 ns/op
Pilot       16: 524288 op, 866143.48 ns, 1.6520 ns/op
Pilot       17: 1048576 op, 1726584.90 ns, 1.6466 ns/op
Pilot       18: 2097152 op, 3454595.32 ns, 1.6473 ns/op
Pilot       19: 4194304 op, 6902918.37 ns, 1.6458 ns/op
Pilot       20: 8388608 op, 14281103.76 ns, 1.7024 ns/op
Pilot       21: 16777216 op, 27958358.97 ns, 1.6664 ns/op
Pilot       22: 33554432 op, 73826900.17 ns, 2.2002 ns/op
Pilot       23: 67108864 op, 109193950.57 ns, 1.6271 ns/op
Pilot       24: 134217728 op, 211685410.30 ns, 1.5772 ns/op
Pilot       25: 268435456 op, 421330051.99 ns, 1.5696 ns/op
Pilot       26: 536870912 op, 995081970.02 ns, 1.8535 ns/op

IdleWarmup   1: 536870912 op, 843111707.42 ns, 1.5704 ns/op
IdleWarmup   2: 536870912 op, 843451550.42 ns, 1.5711 ns/op
IdleWarmup   3: 536870912 op, 869994942.27 ns, 1.6205 ns/op
IdleWarmup   4: 536870912 op, 884145468.77 ns, 1.6468 ns/op
IdleWarmup   5: 536870912 op, 862387533.92 ns, 1.6063 ns/op
IdleWarmup   6: 536870912 op, 856060809.09 ns, 1.5945 ns/op
IdleWarmup   7: 536870912 op, 864332792.97 ns, 1.6099 ns/op
IdleWarmup   8: 536870912 op, 868011764.50 ns, 1.6168 ns/op
IdleWarmup   9: 536870912 op, 860173422.57 ns, 1.6022 ns/op

IdleTarget   1: 536870912 op, 868420887.57 ns, 1.6176 ns/op
IdleTarget   2: 536870912 op, 873615752.75 ns, 1.6272 ns/op
IdleTarget   3: 536870912 op, 868868499.58 ns, 1.6184 ns/op
IdleTarget   4: 536870912 op, 857457244.50 ns, 1.5971 ns/op
IdleTarget   5: 536870912 op, 865744909.05 ns, 1.6126 ns/op
IdleTarget   6: 536870912 op, 863927661.34 ns, 1.6092 ns/op
IdleTarget   7: 536870912 op, 856742775.91 ns, 1.5958 ns/op
IdleTarget   8: 536870912 op, 863842130.38 ns, 1.6090 ns/op
IdleTarget   9: 536870912 op, 855891457.80 ns, 1.5942 ns/op
IdleTarget  10: 536870912 op, 864548045.87 ns, 1.6103 ns/op
IdleTarget  11: 536870912 op, 858048263.41 ns, 1.5982 ns/op
IdleTarget  12: 536870912 op, 866360161.73 ns, 1.6137 ns/op
IdleTarget  13: 536870912 op, 859913693.56 ns, 1.6017 ns/op
IdleTarget  14: 536870912 op, 873673913.80 ns, 1.6273 ns/op
IdleTarget  15: 536870912 op, 875270776.75 ns, 1.6303 ns/op

MainWarmup   1: 536870912 op, 863368288.88 ns, 1.6081 ns/op
MainWarmup   2: 536870912 op, 868480759.24 ns, 1.6177 ns/op
MainWarmup   3: 536870912 op, 866211337.87 ns, 1.6134 ns/op
MainWarmup   4: 536870912 op, 901341467.52 ns, 1.6789 ns/op
MainWarmup   5: 536870912 op, 860632723.80 ns, 1.6031 ns/op
MainWarmup   6: 536870912 op, 881499996.29 ns, 1.6419 ns/op

// BeforeMainRun
MainTarget   1: 536870912 op, 857459240.22 ns, 1.5971 ns/op
MainTarget   2: 536870912 op, 861146764.85 ns, 1.6040 ns/op
MainTarget   3: 536870912 op, 866314260.12 ns, 1.6136 ns/op
MainTarget   4: 536870912 op, 876332786.13 ns, 1.6323 ns/op
MainTarget   5: 536870912 op, 863502002.28 ns, 1.6084 ns/op
MainTarget   6: 536870912 op, 857892311.96 ns, 1.5979 ns/op
MainTarget   7: 536870912 op, 861463229.39 ns, 1.6046 ns/op
MainTarget   8: 536870912 op, 856467081.13 ns, 1.5953 ns/op
MainTarget   9: 536870912 op, 855617758.74 ns, 1.5937 ns/op
MainTarget  10: 536870912 op, 865237710.48 ns, 1.6116 ns/op
MainTarget  11: 536870912 op, 855895449.24 ns, 1.5942 ns/op
MainTarget  12: 536870912 op, 878788950.08 ns, 1.6369 ns/op
MainTarget  13: 536870912 op, 869387387.38 ns, 1.6194 ns/op
MainTarget  14: 536870912 op, 865371138.78 ns, 1.6119 ns/op
MainTarget  15: 536870912 op, 853708422.69 ns, 1.5902 ns/op

// AfterMainRun
Result       1: 536870912 op, 0.00 ns, 0.0000 ns/op
Result       2: 536870912 op, 0.00 ns, 0.0000 ns/op
Result       3: 536870912 op, 1492515.19 ns, 0.0028 ns/op
Result       4: 536870912 op, 11511041.19 ns, 0.0214 ns/op
Result       5: 536870912 op, 0.00 ns, 0.0000 ns/op
Result       6: 536870912 op, 0.00 ns, 0.0000 ns/op
Result       7: 536870912 op, 0.00 ns, 0.0000 ns/op
Result       8: 536870912 op, 0.00 ns, 0.0000 ns/op
Result       9: 536870912 op, 0.00 ns, 0.0000 ns/op
Result      10: 536870912 op, 415965.55 ns, 0.0008 ns/op
Result      11: 536870912 op, 0.00 ns, 0.0000 ns/op
Result      12: 536870912 op, 13967205.15 ns, 0.0260 ns/op
Result      13: 536870912 op, 4565642.44 ns, 0.0085 ns/op
Result      14: 536870912 op, 549393.84 ns, 0.0010 ns/op
Result      15: 536870912 op, 0.00 ns, 0.0000 ns/op
GC:  0 0 0 0 0

// AfterAll

adamsitnik · 2018-05-24T17:34:00Z

The explicit unroll factor still doesn't work for me. Is it already fixed? Could you tell me the name of the corresponded test?

The test is ForJobsWithExplicitUnrollFactorTheGlobalSetupIsCalledAndMultiActionCodeGetsJitted

The corresponding line of code: if engineParameters.HasUnrollFactor

AndreyAkinshin · 2018-05-24T18:46:37Z

DontRunThePilotIfThePilotRequirementIsMetDuringWarmup is failed on Linux.

adamsitnik · 2018-05-25T14:49:33Z

@AndreyAkinshin fixed.

var unrollFactor = Job.Default.ResolveValue(RunMode.UnrollFactorCharacteristic, DefaultResolver);
var mediumTime =  TimeSpan.FromMilliseconds(IterationTime.TotalMilliseconds / unrollFactor);

void MediumMultiple(long _)
{
	timesBenchmarkCalled += unrollFactor;

	for (int i = 0; i < unrollFactor; i++) // the real unroll factor obviously does not use loop ;)
		Thread.Sleep(mediumTime);
}

Executing MediumMultiple was for some reason taking less than IterationTime on Linux... I bet that Thread.Sleep is implemented in a different way and it was too close. I have increased the value, don't have time for proper investigation now ;(

adamsitnik · 2018-05-25T16:59:37Z

I have just run some ML.NET benchmarks. Without this change: 24 minutes. With: 2 minutes ;)

adamsitnik · 2018-05-27T18:39:55Z

@AndreyAkinshin I am going to merge it now to avoid conflicts with my incoming PR. If something is missing I will add it.

AndreyAkinshin · 2018-05-28T07:31:31Z

@adamsitnik, everything is OK.

…, fixes dotnet#736 * don't execute long operations more than once per iteration, dotnet#736 * generate the right C# code for dotnet#736 * EngineParameters.Resolver was always null or ignored ;), dotnet#736 * don't forget to JIT idle, dotnet#736 * do the math right for unroll factor for JIT, dotnet#736 * generate the right IL code, dotnet#736 * Setup and Cleanup are jitted together with benchmark, dotnet#736 * engine factory is now supposed to create an engine which is ready to run (hence the method name change), dotnet#736 * addressing PR feedback, dotnet#736 * bring back the calls to DummyActions, dotnet#736 * align iteration mode * don't measure the overhead for time consuming benchmarks, don't run pilot if jitting gives the answer, dotnet#736 * fix the Linux build

abelbraaksma · 2019-12-20T18:59:56Z

@adamsitnik, @AndreyAkinshin, this PR may have caused nano benchmarks to fail. It appears that the calculation mentioned in the OP above doesn't hold in practice. The case referenced in my issue, #1338, has 7ns per operation (when run correctly, when run once, it is 300-600ns per op) and 120ms as IterationTime. That's an order of magnitude out of reach of the threshold set by IterationTime.

adamsitnik added 8 commits May 22, 2018 18:04

don't execute long operations more than once per iteration, #736

cc5834e

generate the right C# code for #736

d7b06b0

EngineParameters.Resolver was always null or ignored ;), #736

6902ad1

don't forget to JIT idle, #736

343a603

do the math right for unroll factor for JIT, #736

c1d8ac6

generate the right IL code, #736

38d0606

Setup and Cleanup are jitted together with benchmark, #736

9ce830c

engine factory is now supposed to create an engine which is ready to …

773c47f

…run (hence the method name change), #736

adamsitnik added enhancement Area:Engine labels May 22, 2018

adamsitnik requested a review from AndreyAkinshin May 22, 2018 18:46

AndreyAkinshin reviewed May 22, 2018

View reviewed changes

tests/BenchmarkDotNet.IntegrationTests/CoreRtTests.cs Outdated

@@ -17,6 +17,9 @@ public class CoreRtTests : BenchmarkTestExecutor

[Fact]

public void CoreRtIsSupported()

{

if (IntPtr.Size == sizeof(int)) // CoreRT does not support 32bit yet

This comment was marked as spam.

Sign in to view

addressing PR feedback, #736

d67609f

bring back the calls to DummyActions, #736

e34fc7a

align iteration mode

8ae23e8

don't measure the overhead for time consuming benchmarks, don't run p…

916444c

…ilot if jitting gives the answer, #736

fix the Linux build

8fb4635

adamsitnik merged commit a9664f4 into master May 27, 2018

adamsitnik deleted the unrollFactor1 branch May 28, 2018 10:45

adamsitnik mentioned this pull request Jun 1, 2018

have two main actions: with unroll and without, for no unroll icrease the step by 1 in pilot (not *2) #771

Merged

adamsitnik added this to the v0.11.0 milestone Jul 18, 2018

abelbraaksma mentioned this pull request Dec 20, 2019

Issue with "run one iteration for long operations", it's flaky and considers many nano-operations as "long", leading to one op per cycle #1338

Closed

abelbraaksma mentioned this pull request Dec 20, 2019

Trivial methods can cause BDN to run only 1 iteration, skip pilot phase, and ignore most attributes #1337

Closed

abelbraaksma mentioned this pull request Feb 10, 2020

Op/invocation fail to increase due to [IterationCleanup] #1351

Closed

abelbraaksma mentioned this pull request Oct 2, 2020

Sometimes Hardware Counters per Op are reported as NaNs #1546

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

don't execute long operations more than once per iteration #760

don't execute long operations more than once per iteration #760

adamsitnik commented May 22, 2018

This comment was marked as spam.

AndreyAkinshin commented May 22, 2018

AndreyAkinshin commented May 22, 2018

adamsitnik commented May 22, 2018

AndreyAkinshin commented May 23, 2018

AndreyAkinshin commented May 23, 2018

adamsitnik commented May 23, 2018 •

edited

Loading

adamsitnik commented May 24, 2018 •

edited

Loading

AndreyAkinshin commented May 24, 2018

adamsitnik commented May 24, 2018

adamsitnik commented May 24, 2018

AndreyAkinshin commented May 24, 2018

adamsitnik commented May 25, 2018

adamsitnik commented May 25, 2018

adamsitnik commented May 27, 2018

AndreyAkinshin commented May 28, 2018

abelbraaksma commented Dec 20, 2019 •

edited

Loading

don't execute long operations more than once per iteration #760

don't execute long operations more than once per iteration #760

Conversation

adamsitnik commented May 22, 2018

This comment was marked as spam.

AndreyAkinshin commented May 22, 2018

AndreyAkinshin commented May 22, 2018

adamsitnik commented May 22, 2018

AndreyAkinshin commented May 23, 2018

AndreyAkinshin commented May 23, 2018

adamsitnik commented May 23, 2018 • edited Loading

adamsitnik commented May 24, 2018 • edited Loading

AndreyAkinshin commented May 24, 2018

adamsitnik commented May 24, 2018

adamsitnik commented May 24, 2018

AndreyAkinshin commented May 24, 2018

adamsitnik commented May 25, 2018

adamsitnik commented May 25, 2018

adamsitnik commented May 27, 2018

AndreyAkinshin commented May 28, 2018

abelbraaksma commented Dec 20, 2019 • edited Loading

adamsitnik commented May 23, 2018 •

edited

Loading

adamsitnik commented May 24, 2018 •

edited

Loading

abelbraaksma commented Dec 20, 2019 •

edited

Loading