Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port CoreClr benchmarks to BenchmarkDotNet #36

Merged
merged 107 commits into from
Jun 21, 2018

Conversation

adamsitnik
Copy link
Member

  1. I did not move the files, I kept the old ones and copied them to benchmarks folder and then ported to BDN
  2. I fixed few BDN issues along the way
  3. I did not port all of them yet (it's WIP).
  4. I ported all of them "as is" and kept the old Ids (but I changed the DisplayNames from "Test" to something meanigfull)

return fullPath;
}

internal static int GetFileLength(string filePath) => (int) new FileInfo(filePath).Length;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(int) [](start = 62, length = 5)

Does it need to be int? Can we keep long instead? The API user can cast the result when needed. Thoughts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure thing, I added the cast here because it was hardcoded to an int previously

var baseJob = Job.ShortRun; // let's use the Short Run for better first user experience ;)
var baseJob = Job.Default
.WithWarmupCount(1) // 1 warmup is enough for our purpose
.WithMaxTargetIterationCount(20); // we don't want to run more that 20 iterations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

20 [](start = 45, length = 2)

Can we override this at runtime? Maybe it should be a command line option with default.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no as of today, but I can add it to command line

@@ -58,6 +61,10 @@ private static IEnumerable<Job> GetJobs(Options options, Job baseJob)

if (options.RunClr)
yield return baseJob.With(Runtime.Clr);
if (options.RunLegacyJitX64)
yield return baseJob.With(Runtime.Clr).With(Jit.LegacyJit).With(Platform.X64);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LegacyJit [](start = 64, length = 9)

Wasn't legacy removed just recently from coreclr repo? Is it still needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but this is for Legacy Jit for .NET (not Core), so you can easily compare both Jits (surprisingly the old one can be better in some benchmarks)


namespace Benchstone.BenchF
{
public class DMath
Copy link
Member

@jorive jorive May 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Classes cannot be static? Just wonder #Resolved

Copy link
Member Author

@adamsitnik adamsitnik May 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, BDN derives a type from the type with benchmarks to add all the boilerplate. So the class cannot be sealed either static.

However, if you want you can benchmark static type, but you need non-static type with benchmark to do that:

static class Sth
{
    void Method() { }
}

public class SthBenchmarks
{
    [Benchmark] public void Method() => Sth.Method();
}

#Resolved

@jorive
Copy link
Member

jorive commented Jun 14, 2018

@adamsitnik BenchView handles any metric you throw at it. It is a matter of passing the right units when you upload. Currently, we pass all results in milliseconds, and I would prefer that benchmarks run for longer than nano-seconds.

@jorive
Copy link
Member

jorive commented Jun 14, 2018

@adamsitnik As long as we are avoiding duplication and not losing coverage of a particular scenario, I am all for trimming redundancy.

@adamsitnik
Copy link
Member Author

@jorive To avoid any scaling issues, I reverted the changes in PerfLabTests, so it's 1:1

@adamsitnik adamsitnik changed the title [WIP] Port CoreClr benchmarks to BenchmarkDotNet Port CoreClr benchmarks to BenchmarkDotNet Jun 20, 2018
@adamsitnik
Copy link
Member Author

nit: @"8649423196\t:3000"

I removed the workaround and added proper implementation of \t and \r\n to BDN dotnet/BenchmarkDotNet@5716c14

@adamsitnik
Copy link
Member Author

@jorive Ok, I believe that the port is COMPLETE ;) I have removed all previous comments to keep the issue more clean. I am going to post summary in a few minutes

@adamsitnik
Copy link
Member Author

adamsitnik commented Jun 20, 2018

Summary

Time

BenchmarkDotNet needs 42 minutes to run all 304 CoreCLR benchmarks, xunit-performance needed 38 minutes (on my box of course). Initially BDN needed 4 hours, I had to improve few things to get it so close and I started the preparations to this port long time ago, more details here dotnet/BenchmarkDotNet#550

Quality of the results

The quality of the results has improved, the majority of the benchmarks have better (more narrow) distribution. Some of the benchmarks have much better distribution. Two benchmarks out of 304 have worse distribution (LinqBenchmarks.Where00ForX and Log10SingleBenchmark)

Unstable benchmarks

We have at least two multimodal benchmarks: BinaryTrees_5 (#39) and SpectralNorm_3 (#41)

BenchmarkDotNet prints some nice, user-friendly histograms and gives a very clear warning about the problem.

-------------------- Histogram --------------------
[0.942 ms ; 1.237 ms) | @@@
[1.237 ms ; 1.614 ms) | @@@@@@@@@@
[1.614 ms ; 1.892 ms) | @
[1.892 ms ; 2.269 ms) | @@@@@@@
[2.269 ms ; 2.563 ms) | @@@
[2.563 ms ; 2.940 ms) | @@@@@@@@@
[2.940 ms ; 3.354 ms) | @@@@
[3.354 ms ; 3.731 ms) | @@@
---------------------------------------------------

Result differences

xunit-performance runs all the benchmarks in the same process, BDN spawns a new process per benchmark. It affects the results, but the difference is very small for most of the benchmarks.

A very good example of the difference is Order00ManualX benchmark.

dotnet run -c Release -f netcoreapp2.1 -- --method Order00ManualX

Method Mean Error StdDev Median Min Max Gen 0 Allocated
Order00ManualX 194.5 ms 0.8240 ms 0.6881 ms 194.4 ms 193.5 ms 196.2 ms 2000.0000 15.26 MB

dotnet PerformanceHarness.dll DotNetBenchmark-Linq.dll --perf:collect stopwatch - running only LINQ benchmarks within single process gives very similar results:

DotNetBenchmark-Linq.dll Metric Unit Iterations Average STDEV.S Min Max
LinqBenchmarks.Order00ManualX Duration msec 52 193.099 1.307 191.853 197.534

But when executed with many other benchmarks in same process with xunit was reporting 50% more time:

DotNetBenchmark-Linq.dll Metric Unit Iterations Average STDEV.S Min Max
LinqBenchmarks.Order00ManualX Duration msec 34 301.511 4.907 295.667 321.288
LinqBenchmarks.Order00ManualX Allocation Size on Benchmark Execution Thread bytes 34 1.600E+007 0.000 1.600E+007 1.600E+007

Huge differences: Loop Aligment dependent benchmarks

Few of the benchmarks are very heavy dependent on loop alignment:

dotnet run -c Release -f netcoreapp2.1 -- --join --class IniArray --testAlignment

Method EnvironmentVariables Mean Error StdDev Median Min Max Allocated
IniArray COMPlus_JitAlignLoops=0 95.98 ms 4.170 ms 4.802 ms 93.86 ms 91.87 ms 107.16 ms 56 B
IniArray COMPlus_JitAlignLoops=1 59.62 ms 1.421 ms 1.579 ms 60.27 ms 57.55 ms 63.30 ms 56 B

dotnet run -c Release -f netcoreapp2.1 -- --join --method Log10DoubleBenchmark --testAlignment

Method EnvironmentVariables Mean Error StdDev Median Min Max Allocated
Log10DoubleBenchmark COMPlus_JitAlignLoops=0 34.42 us 1.9390 us 2.2329 us 33.12 us 31.88 us 39.21 us 0 B
Log10DoubleBenchmark COMPlus_JitAlignLoops=1 33.22 us 0.4531 us 0.4017 us 33.31 us 31.85 us 33.49 us 0 B

dotnet run -c Release -f netcoreapp2.1 -- --join --method SinhSingleBenchmark --testAlignment

Method EnvironmentVariables Mean Error StdDev Median Min Max Allocated
SinhSingleBenchmark COMPlus_JitAlignLoops=0 55.45 us 3.3886 us 3.7664 us 53.67 us 51.47 us 63.23 us 0 B
SinhSingleBenchmark COMPlus_JitAlignLoops=1 53.81 us 0.8462 us 0.7066 us 53.97 us 51.51 us 54.24 us 0 B

Huge differences: GC dependent benchmarks:

Single execution of BenchBitOps allocates more than 12GB of managed memory! BDN reports 1/2 or 1/3 of what xunit-performance was reporting (depending on how many other benchmarks were executed before it in the same process)

dotnet run -c Release -f netcoreapp2.1 -- --join --methods BenchBitOps --testAlignment

Method EnvironmentVariables Mean Error StdDev Median Min Max Gen 0 Gen 1 Gen 2 Allocated
BenchBitOps COMPlus_JitAlignLoops=0 671.7 ms 5.663 ms 5.020 ms 671.1 ms 664.9 ms 679.8 ms 4166000.0000 4166000.0000 4166000.0000 12.21 GB
BenchBitOps COMPlus_JitAlignLoops=1 667.9 ms 3.713 ms 2.899 ms 667.1 ms 664.6 ms 674.9 ms 4166000.0000 4166000.0000 4166000.0000 12.21 GB

dotnet PerformanceHarness.dll DotNetBenchmark-Bytemark.dll

DotNetBenchmark-Bytemark.dll Metric Unit Iterations Average STDEV.S Min Max
ByteMark.BenchBitOps Duration msec 8 1395.028 1.747 1392.748 1397.281

dotnet PerformanceHarness.dll --perf:collect stopwatch+gcapi

DotNetBenchmark-Bytemark.dll Metric Unit Iterations Average STDEV.S Min Max
ByteMark.BenchBitOps Duration msec 6 1961.970 78.708 1924.026 2122.209
ByteMark.BenchBitOps Allocation Size on Benchmark Execution Thread bytes 6 1.311E+010 0.000 1.311E+010 1.311E+010

Removed benchmarks

I have not ported the serializer benchmarks because the benchmarks repo contains a lot of them and there was no need to duplicate it.

Ids

All the ids except of removed benchmarks have been preserved, they can still be used in BenchView to keep the track of the performance

Bugs found and fixed

When I was comparing the results I found few huge differences which turned out to be a bugs in existing benchmarks

Bug: empty loops

Following benchmarks have been optimized by JIT to empty loops (the loop body was constant):

DotNetBenchmark-PerfLab.dll Metric Unit Iterations Average STDEV.S Min Max
PerfLabTests.CastingPerf.CheckArrayIsInterfaceYes Duration msec 100 0.067 0.011 0.040 0.090
PerfLabTests.CastingPerf.CheckIsInstAnyIsInterfaceNo Duration msec 100 0.042 0.005 0.040 0.092
PerfLabTests.CastingPerf.CheckIsInstAnyIsInterfaceYes Duration msec 100 0.041 6.735E-004 0.040 0.045
PerfLabTests.CastingPerf.CheckObjIsInterfaceNo Duration msec 100 0.040 0.006 0.038 0.072
PerfLabTests.CastingPerf.CheckObjIsInterfaceYes Duration msec 100 0.043 0.020 0.040 0.237
PerfLabTests.LowLevelPerf.GenericClassGenericStaticField Duration msec 100 0.162 0.035 0.066 0.228
PerfLabTests.LowLevelPerf.GenericClassWithIntGenericInstanceField Duration msec 100 0.080 0.005 0.070 0.094
PerfLabTests.LowLevelPerf.ObjectStringIsString Duration msec 100 0.057 0.004 0.051 0.074

I have fixed them, now they measure the right thing.

Bug: growing multicast delegate

The delegate (md) was growing with every inner iteration of MulticastDelegateCombineInvoke benchmark, that was why xunit-performance was generating very spread results.

public void MulticastDelegateCombineInvoke()
{
    MultiDelegate md = null; // THIS was growing in every iteration
    Object obj = new Object();

    foreach (var iteration in Benchmark.Iterations)
    {
        MultiDelegate md1 = new MultiDelegate(this.Invocable2);
        // more code removed for brevity

        using (iteration.StartMeasurement())
        {
            for (int i = 0; i < Benchmark.InnerIterationCount; i++)
            {
                md = (MultiDelegate)Delegate.Combine(md1, md);
                md = (MultiDelegate)Delegate.Combine(md2, md);
                md = (MultiDelegate)Delegate.Combine(md3, md);
                md = (MultiDelegate)Delegate.Combine(md4, md);
                md = (MultiDelegate)Delegate.Combine(md5, md);
                md = (MultiDelegate)Delegate.Combine(md6, md);
                md = (MultiDelegate)Delegate.Combine(md7, md);
                md = (MultiDelegate)Delegate.Combine(md8, md);
                md = (MultiDelegate)Delegate.Combine(md9, md);
                md = (MultiDelegate)Delegate.Combine(md10, md);
            }
        }
    }

    md(obj, 100, 100);
}

Bugs found and not fixed

https://github.com/dotnet/coreclr/issues/18560 - GC.GetAllocatedBytesForCurrentThread always returns 0 when processor affinity is set

Copy link
Member

@jorive jorive left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@adamsitnik adamsitnik merged commit afb4b7c into dotnet:master Jun 21, 2018
@adamsitnik adamsitnik deleted the coreClrBdn branch October 17, 2018 15:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants