Port CoreClr benchmarks to BenchmarkDotNet #36

adamsitnik · 2018-05-29T16:27:05Z

I did not move the files, I kept the old ones and copied them to benchmarks folder and then ported to BDN
I fixed few BDN issues along the way
I did not port all of them yet (it's WIP).
I ported all of them "as is" and kept the old Ids (but I changed the DisplayNames from "Test" to something meanigfull)

…ve the Id which is exported for BenchView purpose

…s faster

jorive · 2018-05-29T18:00:39Z

src/benchmarks/BenchmarksGame/Inputs/InputFileHelper.cs

+            return fullPath;
+        }
+
+        internal static int GetFileLength(string filePath) => (int) new FileInfo(filePath).Length;


(int) [](start = 62, length = 5)

Does it need to be int? Can we keep long instead? The API user can cast the result when needed. Thoughts?

sure thing, I added the cast here because it was hardcoded to an int previously

jorive · 2018-05-29T18:02:19Z

src/benchmarks/Program.cs

-            var baseJob = Job.ShortRun; // let's use the Short Run for better first user experience ;)
+            var baseJob = Job.Default
+                .WithWarmupCount(1) // 1 warmup is enough for our purpose
+                .WithMaxTargetIterationCount(20);  // we don't want to run more that 20 iterations


20 [](start = 45, length = 2)

Can we override this at runtime? Maybe it should be a command line option with default.

no as of today, but I can add it to command line

jorive · 2018-05-29T18:03:07Z

src/benchmarks/Program.cs

@@ -58,6 +61,10 @@ private static IEnumerable<Job> GetJobs(Options options, Job baseJob)

            if (options.RunClr)
                yield return baseJob.With(Runtime.Clr);
+            if (options.RunLegacyJitX64)
+                yield return baseJob.With(Runtime.Clr).With(Jit.LegacyJit).With(Platform.X64);


LegacyJit [](start = 64, length = 9)

Wasn't legacy removed just recently from coreclr repo? Is it still needed?

yes, but this is for Legacy Jit for .NET (not Core), so you can easily compare both Jits (surprisingly the old one can be better in some benchmarks)

jorive · 2018-05-29T18:08:03Z

src/benchmarks/Benchstones/BenchF/DMath.cs

+
+namespace Benchstone.BenchF
+{
+public class DMath


Classes cannot be static? Just wonder #Resolved

yes, BDN derives a type from the type with benchmarks to add all the boilerplate. So the class cannot be sealed either static.

However, if you want you can benchmark static type, but you need non-static type with benchmark to do that:

static class Sth { void Method() { } } public class SthBenchmarks { [Benchmark] public void Method() => Sth.Method(); }

#Resolved

jorive · 2018-06-14T17:45:21Z

@adamsitnik BenchView handles any metric you throw at it. It is a matter of passing the right units when you upload. Currently, we pass all results in milliseconds, and I would prefer that benchmarks run for longer than nano-seconds.

jorive · 2018-06-14T17:47:16Z

@adamsitnik As long as we are avoiding duplication and not losing coverage of a particular scenario, I am all for trimming redundancy.

…up benchmark, it allows to preserve ID

… in BenchView

adamsitnik · 2018-06-15T10:34:55Z

@jorive To avoid any scaling issues, I reverted the changes in PerfLabTests, so it's 1:1

…e it after we port all the benchmarks

…per fix has been applied to BDN

…vs no affinity and the permutation of it

…enchmarks more stable

…imized to an empty loop

…opy paste it to an excel with some formulas

…rite about default value, command line parser does it out of the box

…s handling

adamsitnik · 2018-06-20T16:40:43Z

nit: @"8649423196\t:3000"

I removed the workaround and added proper implementation of \t and \r\n to BDN dotnet/BenchmarkDotNet@5716c14

adamsitnik · 2018-06-20T16:44:05Z

@jorive Ok, I believe that the port is COMPLETE ;) I have removed all previous comments to keep the issue more clean. I am going to post summary in a few minutes

adamsitnik · 2018-06-20T17:20:12Z

Summary

Time

BenchmarkDotNet needs 42 minutes to run all 304 CoreCLR benchmarks, xunit-performance needed 38 minutes (on my box of course). Initially BDN needed 4 hours, I had to improve few things to get it so close and I started the preparations to this port long time ago, more details here dotnet/BenchmarkDotNet#550

Quality of the results

The quality of the results has improved, the majority of the benchmarks have better (more narrow) distribution. Some of the benchmarks have much better distribution. Two benchmarks out of 304 have worse distribution (LinqBenchmarks.Where00ForX and Log10SingleBenchmark)

Unstable benchmarks

We have at least two multimodal benchmarks: BinaryTrees_5 (#39) and SpectralNorm_3 (#41)

BenchmarkDotNet prints some nice, user-friendly histograms and gives a very clear warning about the problem.

-------------------- Histogram --------------------
[0.942 ms ; 1.237 ms) | @@@
[1.237 ms ; 1.614 ms) | @@@@@@@@@@
[1.614 ms ; 1.892 ms) | @
[1.892 ms ; 2.269 ms) | @@@@@@@
[2.269 ms ; 2.563 ms) | @@@
[2.563 ms ; 2.940 ms) | @@@@@@@@@
[2.940 ms ; 3.354 ms) | @@@@
[3.354 ms ; 3.731 ms) | @@@
---------------------------------------------------

Result differences

xunit-performance runs all the benchmarks in the same process, BDN spawns a new process per benchmark. It affects the results, but the difference is very small for most of the benchmarks.

A very good example of the difference is Order00ManualX benchmark.

dotnet run -c Release -f netcoreapp2.1 -- --method Order00ManualX

Method	Mean	Error	StdDev	Median	Min	Max	Gen 0	Allocated
Order00ManualX	194.5 ms	0.8240 ms	0.6881 ms	194.4 ms	193.5 ms	196.2 ms	2000.0000	15.26 MB

dotnet PerformanceHarness.dll DotNetBenchmark-Linq.dll --perf:collect stopwatch - running only LINQ benchmarks within single process gives very similar results:

DotNetBenchmark-Linq.dll	Metric	Unit	Iterations	Average	STDEV.S	Min	Max
LinqBenchmarks.Order00ManualX	Duration	msec	52	193.099	1.307	191.853	197.534

But when executed with many other benchmarks in same process with xunit was reporting 50% more time:

DotNetBenchmark-Linq.dll	Metric	Unit	Iterations	Average	STDEV.S	Min	Max
LinqBenchmarks.Order00ManualX	Duration	msec	34	301.511	4.907	295.667	321.288
LinqBenchmarks.Order00ManualX	Allocation Size on Benchmark Execution Thread	bytes	34	1.600E+007	0.000	1.600E+007	1.600E+007

Huge differences: Loop Aligment dependent benchmarks

Few of the benchmarks are very heavy dependent on loop alignment:

dotnet run -c Release -f netcoreapp2.1 -- --join --class IniArray --testAlignment

Method	EnvironmentVariables	Mean	Error	StdDev	Median	Min	Max	Allocated
IniArray	COMPlus_JitAlignLoops=0	95.98 ms	4.170 ms	4.802 ms	93.86 ms	91.87 ms	107.16 ms	56 B
IniArray	COMPlus_JitAlignLoops=1	59.62 ms	1.421 ms	1.579 ms	60.27 ms	57.55 ms	63.30 ms	56 B

dotnet run -c Release -f netcoreapp2.1 -- --join --method Log10DoubleBenchmark --testAlignment

Method	EnvironmentVariables	Mean	Error	StdDev	Median	Min	Max	Allocated
Log10DoubleBenchmark	COMPlus_JitAlignLoops=0	34.42 us	1.9390 us	2.2329 us	33.12 us	31.88 us	39.21 us	0 B
Log10DoubleBenchmark	COMPlus_JitAlignLoops=1	33.22 us	0.4531 us	0.4017 us	33.31 us	31.85 us	33.49 us	0 B

dotnet run -c Release -f netcoreapp2.1 -- --join --method SinhSingleBenchmark --testAlignment

Method	EnvironmentVariables	Mean	Error	StdDev	Median	Min	Max	Allocated
SinhSingleBenchmark	COMPlus_JitAlignLoops=0	55.45 us	3.3886 us	3.7664 us	53.67 us	51.47 us	63.23 us	0 B
SinhSingleBenchmark	COMPlus_JitAlignLoops=1	53.81 us	0.8462 us	0.7066 us	53.97 us	51.51 us	54.24 us	0 B

Huge differences: GC dependent benchmarks:

Single execution of BenchBitOps allocates more than 12GB of managed memory! BDN reports 1/2 or 1/3 of what xunit-performance was reporting (depending on how many other benchmarks were executed before it in the same process)

dotnet run -c Release -f netcoreapp2.1 -- --join --methods BenchBitOps --testAlignment

Method	EnvironmentVariables	Mean	Error	StdDev	Median	Min	Max	Gen 0	Gen 1	Gen 2	Allocated
BenchBitOps	COMPlus_JitAlignLoops=0	671.7 ms	5.663 ms	5.020 ms	671.1 ms	664.9 ms	679.8 ms	4166000.0000	4166000.0000	4166000.0000	12.21 GB
BenchBitOps	COMPlus_JitAlignLoops=1	667.9 ms	3.713 ms	2.899 ms	667.1 ms	664.6 ms	674.9 ms	4166000.0000	4166000.0000	4166000.0000	12.21 GB

dotnet PerformanceHarness.dll DotNetBenchmark-Bytemark.dll

DotNetBenchmark-Bytemark.dll	Metric	Unit	Iterations	Average	STDEV.S	Min	Max
ByteMark.BenchBitOps	Duration	msec	8	1395.028	1.747	1392.748	1397.281

dotnet PerformanceHarness.dll --perf:collect stopwatch+gcapi

DotNetBenchmark-Bytemark.dll	Metric	Unit	Iterations	Average	STDEV.S	Min	Max
ByteMark.BenchBitOps	Duration	msec	6	1961.970	78.708	1924.026	2122.209
ByteMark.BenchBitOps	Allocation Size on Benchmark Execution Thread	bytes	6	1.311E+010	0.000	1.311E+010	1.311E+010

Removed benchmarks

I have not ported the serializer benchmarks because the benchmarks repo contains a lot of them and there was no need to duplicate it.

Ids

All the ids except of removed benchmarks have been preserved, they can still be used in BenchView to keep the track of the performance

Bugs found and fixed

When I was comparing the results I found few huge differences which turned out to be a bugs in existing benchmarks

Bug: empty loops

Following benchmarks have been optimized by JIT to empty loops (the loop body was constant):

DotNetBenchmark-PerfLab.dll	Metric	Unit	Iterations	Average	STDEV.S	Min	Max
PerfLabTests.CastingPerf.CheckArrayIsInterfaceYes	Duration	msec	100	0.067	0.011	0.040	0.090
PerfLabTests.CastingPerf.CheckIsInstAnyIsInterfaceNo	Duration	msec	100	0.042	0.005	0.040	0.092
PerfLabTests.CastingPerf.CheckIsInstAnyIsInterfaceYes	Duration	msec	100	0.041	6.735E-004	0.040	0.045
PerfLabTests.CastingPerf.CheckObjIsInterfaceNo	Duration	msec	100	0.040	0.006	0.038	0.072
PerfLabTests.CastingPerf.CheckObjIsInterfaceYes	Duration	msec	100	0.043	0.020	0.040	0.237
PerfLabTests.LowLevelPerf.GenericClassGenericStaticField	Duration	msec	100	0.162	0.035	0.066	0.228
PerfLabTests.LowLevelPerf.GenericClassWithIntGenericInstanceField	Duration	msec	100	0.080	0.005	0.070	0.094
PerfLabTests.LowLevelPerf.ObjectStringIsString	Duration	msec	100	0.057	0.004	0.051	0.074

I have fixed them, now they measure the right thing.

Bug: growing multicast delegate

The delegate (md) was growing with every inner iteration of MulticastDelegateCombineInvoke benchmark, that was why xunit-performance was generating very spread results.

public void MulticastDelegateCombineInvoke()
{
    MultiDelegate md = null; // THIS was growing in every iteration
    Object obj = new Object();

    foreach (var iteration in Benchmark.Iterations)
    {
        MultiDelegate md1 = new MultiDelegate(this.Invocable2);
        // more code removed for brevity

        using (iteration.StartMeasurement())
        {
            for (int i = 0; i < Benchmark.InnerIterationCount; i++)
            {
                md = (MultiDelegate)Delegate.Combine(md1, md);
                md = (MultiDelegate)Delegate.Combine(md2, md);
                md = (MultiDelegate)Delegate.Combine(md3, md);
                md = (MultiDelegate)Delegate.Combine(md4, md);
                md = (MultiDelegate)Delegate.Combine(md5, md);
                md = (MultiDelegate)Delegate.Combine(md6, md);
                md = (MultiDelegate)Delegate.Combine(md7, md);
                md = (MultiDelegate)Delegate.Combine(md8, md);
                md = (MultiDelegate)Delegate.Combine(md9, md);
                md = (MultiDelegate)Delegate.Combine(md10, md);
            }
        }
    }

    md(obj, 100, 100);
}

Bugs found and not fixed

https://github.com/dotnet/coreclr/issues/18560 - GC.GetAllocatedBytesForCurrentThread always returns 0 when processor affinity is set

jorive

LGTM

adamsitnik added 22 commits May 24, 2018 01:08

copy some coreclr benchmarks

9476fed

port some some benchmarks

5071775

copy some coreclr benchmarks 2

f4b81c3

port some some benchmarks 2

cf5951f

rename the folder with CoreClr benchmarks to BenchmarksGame

df07e67

fix SpectralNorm_1 benchmark

46070ae

copy knucleotide benchmarks

702be05

get knucleotide working

1761381

copy RegexRedux

54ff968

make RegexRedux work

7b1d96e

copy reverse-complement

2e40808

make reverse-complement work

2e4b9b6

copy all .txt files to bin

052fe9f

update to the latest version

ea468a3

read the file length, dont hardcode it (it was buggy)

201ba6b

use Description to give benchmarks some nice display name, but preser…

141f2f8

…ve the Id which is exported for BenchView purpose

remove the Rider files from repo, ignore them

79a21f8

dont run more than 20 iterations

ffc9fb2

add possibility to run benchmarks for Legacy Jits

2c5b539

copy all benchstones

0ec180c

port Benchstones to BenchmarkDotNet

5eff7f1

change the default iteration time from 0,5s to 0,25s to run benchmark…

475793f

…s faster

jorive reviewed May 29, 2018

View reviewed changes

adamsitnik added 4 commits May 30, 2018 09:48

update to the version which supports jagged arrays

78bd131

copy Burgers

be2104e

port Burgers

8581484

copy DefaultEqualityComparerPerf

79675a2

adamsitnik added 3 commits June 14, 2018 14:12

don't use argument, keep old benchmark full name

52e6d8e

move the arguments to fields to match existing benchmark id

c3a4ed0

add missing category

5be5f96

adamsitnik added 3 commits June 14, 2018 20:14

use Params from BDN to express things that were arguments used to set…

ed48860

…up benchmark, it allows to preserve ID

should be a part of previous commit

66cc17f

don't make PerfLabTests nano-benchmarks, it would blow up the scaling…

ccba0e1

… in BenchView

adamsitnik added 14 commits June 15, 2018 20:17

make sure ALL ids are the same

e175d9e

ResultsValidator - small helper utility to print a diff, I will remov…

503e8d8

…e it after we port all the benchmarks

remove the workaround for handling whitespaced in benchmarks Ids, pro…

a18473e

…per fix has been applied to BDN

allow the users to specify outlier removal mode from console args

0a0cd71

add median to the default columns

afadbd8

allow the users to test COMPlus_JitAlignLoop 0 vs 1 and affinity set …

56bec57

…vs no affinity and the permutation of it

make sure JIT does not optimize the benchmarks to empty loops!

d5c3bf0

update to latest BDN which gives us some nice attributes

84e0822

apply some non-default settings to make the results from multimodal b…

e38e951

…enchmarks more stable

add comments to benchmarks whic hare very dependent on loop alignment

cd0fa8f

fix ObjectGetTypeNoBoxing, disable ObjectGetType which is getting opt…

101bc24

…imized to an empty loop

change the order of exported columns in CSV to make it more easy to c…

d0e0250

…opy paste it to an excel with some formulas

rename console attributes, don't introduce new standards ;) + don't w…

21da2d6

…rite about default value, command line parser does it out of the box

add docs for test alignment and test affinity + fix a bug in argument…

14013d3

…s handling

adamsitnik changed the title ~~[WIP] Port CoreClr benchmarks to BenchmarkDotNet~~ Port CoreClr benchmarks to BenchmarkDotNet Jun 20, 2018

jorive approved these changes Jun 21, 2018

View reviewed changes

adamsitnik merged commit afb4b7c into dotnet:master Jun 21, 2018

adamsitnik deleted the coreClrBdn branch October 17, 2018 15:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port CoreClr benchmarks to BenchmarkDotNet #36

Port CoreClr benchmarks to BenchmarkDotNet #36

adamsitnik commented May 29, 2018

jorive May 29, 2018

adamsitnik May 30, 2018

jorive May 29, 2018

adamsitnik May 30, 2018

jorive May 29, 2018

adamsitnik May 30, 2018

jorive May 29, 2018 •

edited

Loading

adamsitnik May 30, 2018 •

edited by jorive

Loading

jorive commented Jun 14, 2018

jorive commented Jun 14, 2018

adamsitnik commented Jun 15, 2018

adamsitnik commented Jun 20, 2018

adamsitnik commented Jun 20, 2018

adamsitnik commented Jun 20, 2018 •

edited

Loading

jorive left a comment

Port CoreClr benchmarks to BenchmarkDotNet #36

Port CoreClr benchmarks to BenchmarkDotNet #36

Conversation

adamsitnik commented May 29, 2018

jorive May 29, 2018

Choose a reason for hiding this comment

adamsitnik May 30, 2018

Choose a reason for hiding this comment

jorive May 29, 2018

Choose a reason for hiding this comment

adamsitnik May 30, 2018

Choose a reason for hiding this comment

jorive May 29, 2018

Choose a reason for hiding this comment

adamsitnik May 30, 2018

Choose a reason for hiding this comment

jorive May 29, 2018 • edited Loading

Choose a reason for hiding this comment

adamsitnik May 30, 2018 • edited by jorive Loading

Choose a reason for hiding this comment

jorive commented Jun 14, 2018

jorive commented Jun 14, 2018

adamsitnik commented Jun 15, 2018

adamsitnik commented Jun 20, 2018

adamsitnik commented Jun 20, 2018

adamsitnik commented Jun 20, 2018 • edited Loading

Summary

Time

Quality of the results

Unstable benchmarks

Result differences

Huge differences: Loop Aligment dependent benchmarks

Huge differences: GC dependent benchmarks:

Removed benchmarks

Ids

Bugs found and fixed

Bug: empty loops

Bug: growing multicast delegate

Bugs found and not fixed

jorive left a comment

Choose a reason for hiding this comment

jorive May 29, 2018 •

edited

Loading

adamsitnik May 30, 2018 •

edited by jorive

Loading

adamsitnik commented Jun 20, 2018 •

edited

Loading