Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with default UnrollFactor in V0.11.0 #837

Closed
Tornhoof opened this issue Jul 23, 2018 · 12 comments · Fixed by #1573
Closed

Problems with default UnrollFactor in V0.11.0 #837

Tornhoof opened this issue Jul 23, 2018 · 12 comments · Fixed by #1573
Assignees
Milestone

Comments

@Tornhoof
Copy link
Contributor

Tornhoof commented Jul 23, 2018

Description
I just upgraded SpanJson to v0.11 of Benchmark.NET and the new UnrollFactor logic does not work for codegen-heavy benchmarks.
Actual Behaviour

// Benchmark: ModelBenchmark.DeserializeAccessTokenWithJilSerializer: DefaultJob
// *** Execute ***
// Launch: 1 / 1
// BeforeAnythingElse
// Benchmark Process Environment Information:
// Runtime=.NET Core 2.1.1 (CoreCLR 4.6.26606.02, CoreFX 4.6.26606.05), 64bit RyuJIT
// GC=Concurrent Workstation
// Job: DefaultJob
OverheadJitting  1: 1 op, 327424.34 ns, 327.4243 us/op
WorkloadJitting  1: 1 op, 2934831805.27 ns, 2.9348 s/op
WorkloadWarmup   1: 1 op, 16384.02 ns, 16.3840 us/op
WorkloadWarmup   2: 1 op, 10496.01 ns, 10.4960 us/op
WorkloadWarmup   3: 1 op, 9728.01 ns, 9.7280 us/op
WorkloadWarmup   4: 1 op, 9216.01 ns, 9.2160 us/op
WorkloadWarmup   5: 1 op, 8960.01 ns, 8.9600 us/op
WorkloadWarmup   6: 1 op, 23040.02 ns, 23.0400 us/op
WorkloadWarmup   7: 1 op, 7936.01 ns, 7.9360 us/op
WorkloadWarmup   8: 1 op, 8704.01 ns, 8.7040 us/op
WorkloadWarmup   9: 1 op, 9472.01 ns, 9.4720 us/op
WorkloadWarmup  10: 1 op, 8448.01 ns, 8.4480 us/op
// BeforeActualRun
WorkloadActual   1: 1 op, 10240.01 ns, 10.2400 us/op
WorkloadActual   2: 1 op, 25088.03 ns, 25.0880 us/op

The workloadActual lines then repeat around 90 times with quite a large variation in timings.
Expected behaviour
I get a larger amount of ops with a more consistent timing, the 1 op benchmark is wrong on a large large scale.

// **************************
// Benchmark: ModelBenchmark.DeserializeAccessTokenWithJilSerializer: Job-GXPFMQ(UnrollFactor=2)
// *** Execute ***
// Launch: 1 / 1
// BeforeAnythingElse
// Benchmark Process Environment Information:
// Runtime=.NET Core 2.1.1 (CoreCLR 4.6.26606.02, CoreFX 4.6.26606.05), 64bit RyuJIT
// GC=Concurrent Workstation
// Job: Job-CZJWNW(UnrollFactor=2)
OverheadJitting  1: 2 op, 352000.36 ns, 176.0002 us/op
WorkloadJitting  1: 2 op, 2947211977.95 ns, 1.4736 s/op
WorkloadPilot    1: 4 op, 13568.01 ns, 3.3920 us/op
WorkloadPilot    2: 8 op, 14848.02 ns, 1.8560 us/op
WorkloadPilot    3: 16 op, 15872.02 ns, 992.0010 ns/op
WorkloadPilot    4: 32 op, 25344.03 ns, 792.0008 ns/op
WorkloadPilot    5: 64 op, 44288.05 ns, 692.0007 ns/op
WorkloadPilot    6: 128 op, 84480.09 ns, 660.0007 ns/op
WorkloadPilot    7: 256 op, 152576.16 ns, 596.0006 ns/op
WorkloadPilot    8: 512 op, 300032.31 ns, 586.0006 ns/op
WorkloadPilot    9: 1024 op, 574208.59 ns, 560.7506 ns/op
WorkloadPilot   10: 2048 op, 1207041.24 ns, 589.3756 ns/op
WorkloadPilot   11: 4096 op, 2299650.35 ns, 561.4381 ns/op
WorkloadPilot   12: 8192 op, 4912389.03 ns, 599.6569 ns/op
WorkloadPilot   13: 16384 op, 9578249.81 ns, 584.6100 ns/op
WorkloadPilot   14: 32768 op, 18317842.76 ns, 559.0162 ns/op
WorkloadPilot   15: 65536 op, 36165925.03 ns, 551.8482 ns/op
WorkloadPilot   16: 131072 op, 72334922.07 ns, 551.8717 ns/op
WorkloadPilot   17: 262144 op, 143300498.74 ns, 546.6480 ns/op
WorkloadPilot   18: 524288 op, 285817892.68 ns, 545.1544 ns/op
WorkloadPilot   19: 1048576 op, 572262474.00 ns, 545.7520 ns/op
OverheadWarmup   1: 1048576 op, 3304963.38 ns, 3.1519 ns/op
OverheadWarmup   2: 1048576 op, 3340035.42 ns, 3.1853 ns/op
OverheadWarmup   3: 1048576 op, 3255811.33 ns, 3.1050 ns/op
OverheadWarmup   4: 1048576 op, 3256067.33 ns, 3.1052 ns/op
OverheadWarmup   5: 1048576 op, 3302403.38 ns, 3.1494 ns/op
OverheadWarmup   6: 1048576 op, 3288835.37 ns, 3.1365 ns/op
OverheadActual   1: 1048576 op, 3316483.40 ns, 3.1628 ns/op
OverheadActual   2: 1048576 op, 3276291.35 ns, 3.1245 ns/op
OverheadActual   3: 1048576 op, 3233283.31 ns, 3.0835 ns/op
OverheadActual   4: 1048576 op, 3339267.42 ns, 3.1846 ns/op
OverheadActual   5: 1048576 op, 3338755.42 ns, 3.1841 ns/op
OverheadActual   6: 1048576 op, 3284739.36 ns, 3.1326 ns/op
OverheadActual   7: 1048576 op, 3252483.33 ns, 3.1018 ns/op
OverheadActual   8: 1048576 op, 3229443.31 ns, 3.0798 ns/op
OverheadActual   9: 1048576 op, 3227139.30 ns, 3.0776 ns/op
OverheadActual  10: 1048576 op, 3257347.34 ns, 3.1064 ns/op
OverheadActual  11: 1048576 op, 3340035.42 ns, 3.1853 ns/op
OverheadActual  12: 1048576 op, 3311619.39 ns, 3.1582 ns/op
OverheadActual  13: 1048576 op, 3286019.36 ns, 3.1338 ns/op
OverheadActual  14: 1048576 op, 3352835.43 ns, 3.1975 ns/op
OverheadActual  15: 1048576 op, 3325187.40 ns, 3.1711 ns/op
WorkloadWarmup   1: 1048576 op, 571615561.33 ns, 545.1351 ns/op
WorkloadWarmup   2: 1048576 op, 571237192.95 ns, 544.7742 ns/op
WorkloadWarmup   3: 1048576 op, 572946506.70 ns, 546.4044 ns/op
WorkloadWarmup   4: 1048576 op, 571277128.99 ns, 544.8123 ns/op
WorkloadWarmup   5: 1048576 op, 571070280.78 ns, 544.6151 ns/op
WorkloadWarmup   6: 1048576 op, 571714377.44 ns, 545.2293 ns/op
WorkloadWarmup   7: 1048576 op, 571312713.02 ns, 544.8463 ns/op
// BeforeActualRun
WorkloadActual   1: 1048576 op, 571627337.35 ns, 545.1463 ns/op
WorkloadActual   2: 1048576 op, 571613257.33 ns, 545.1329 ns/op

I think that's bececause the WorkloadJitting is taking that long (yeah it's codegen heavy), as soon as I change the Unrollfactor to 2 it works as expected (atleast as far as I've run the benchmarks).
The benchmark is that codegen heavy, because I initialize ~480 serializers statically in that class (the whole benchmark class is one large t4 script, see https://github.com/Tornhoof/SpanJson/blob/master/SpanJson.Benchmarks/ModelBenchmark.cs)

Repro

  1. Checkout Branch BenchmarkUpdates on https://github.com/Tornhoof/SpanJson
  2. cd SpanJson.Benchmarks
  3. dotnet run -c Release and select 2
@AndreyAkinshin
Copy link
Member

@Tornhoof, thank you very much for the report.

You correctly described the problem: BenchmarkDotNet ignores the Pilot stage (where we should choose the best amount of invocation) because WorkloadJitting takes too much time. We did this "optimization" for heavy macrobenchmark, but obviously it doesn't work for your case.

I think we should always run the Pilot stage if the jitting (cold start) took a reasonable amount of time (e.g., < 20 sec). It should automatically fix the invocation count and solve your problem. Currently, we compare the WorkloadJitting time with engineParameters.IterationTime which is 500 ms by default. Probably it makes sense to introduce additional TimeInterval option for the WorkloadJitting threshold value.
We should discuss it with @adamsitnik, but he will be available only next week.

As a temporary workaround, you can manually set the invocation count or increase the IterationTime value.

@Tornhoof
Copy link
Contributor Author

Thank you for the workaround.

@adamsitnik
Copy link
Member

@Tornhoof big thanks for the bug report and a clear repro case!

I will try to find some solution and fix it during the next weekend

@AndreyAkinshin
Copy link
Member

@adamsitnik do you have any progress with this issue?
I have another idea about how to solve the problem without additional settings. Let's check that we do enough operations during the WorkloadWarmup stage. If it turns out that the number of operations is not enough and our WorkloadJitting heuristics produced a wrong hypothesis about the warmed operation duration, we should restart iterations from the Pilot stage.
What do you think?

@adamsitnik
Copy link
Member

@AndreyAkinshin Great idea!

So far I have observed that this issue happens only for benchmarks that rely heavily on Reflection.ILEmit and my only idea was to check if the benchmark relies on this assembly and if it does, run it one more time. But I think that your solution is much more universal and clean!

@AndreyAkinshin
Copy link
Member

I have observed that this issue happens only for benchmarks that rely heavily on Reflection.ILEmit

In fact, there are too many factors that can slow down the first iteration (from heavy assembly loading to custom user logic).

@adamsitnik
Copy link
Member

In fact, there are too many factors that can slow down the first iteration (from heavy assembly loading to custom user logic).

Totally agree! Would you like to implement it?

@AndreyAkinshin
Copy link
Member

Would you like to implement it?

Yes.

Turnerj added a commit to TurnerSoftware/InfinityCrawler that referenced this issue Jan 3, 2019
Need to perform a warmup so the workload jitting doesn't take too long
dotnet/BenchmarkDotNet#837
AArnott added a commit to AArnott/vs-streamjsonrpc that referenced this issue May 18, 2019
ycrumeyrolle added a commit to uruk-project/Jwt that referenced this issue Jul 14, 2019
ycrumeyrolle added a commit to uruk-project/Jwt that referenced this issue Jul 14, 2019
@kevingosse
Copy link

Hello, any update on this?

I just ran into this issue (and given the list of mentions, I guess this is widespread). Any way I could help?

adamsitnik added a commit that referenced this issue Oct 24, 2020
… magic number based on observations of the reported bug)

we call it one more time to see if Jitting itself has not dominated the first invocation
if it did, it shoud NOT be a single invocation engine (fixes #837, fixes #1337 and fixes #1338)
@adamsitnik
Copy link
Member

Please excuse me for the delay. I've sent a fix: #1573

AndreyAkinshin pushed a commit that referenced this issue Oct 26, 2020
…n lasts longer than IterationTime (#1573)

* add the test

* if the Jitting took more than IterationTime but still less than 1s (a magic number based on observations of the reported bug)

we call it one more time to see if Jitting itself has not dominated the first invocation
if it did, it shoud NOT be a single invocation engine (fixes #837, fixes #1337 and fixes #1338)
@adamsitnik adamsitnik added this to the v0.13.0 milestone Nov 9, 2020
MarkPflug pushed a commit to MarkPflug/Benchmarks that referenced this issue Jan 12, 2021
* Update Cursively benchmark code.
- VisitPartialFieldContents is required for correctness
- Use the original byte array, since we have one, instead of forcing a stream around it
- Set iteration time to 1 second, to work around dotnet/BenchmarkDotNet#837 while we wait for a version of this package that includes dotnet/BenchmarkDotNet#1573

* Change the order of this check

Whether or not this is the optimal order is almost certain to depend on the data: data sets that would almost never use the pool would likely hurt more by doing it this way, especially if this were a real-world application since this would likely push a line out of the CPU cache for no reason.

Then again, a real-world application probably wouldn't use the pool conditionally like this, so there's not much of a reason NOT to either.
@adamsitnik
Copy link
Member

I thought that I've solved this problem in #1573, but today I've run into this problem again when benchmarking PowerShell (PowerShell/PowerShell#15242)

We should most probably implement the approach described by @AndreyAkinshin :

I have another idea about how to solve the problem without additional settings. Let's check that we do enough operations during the WorkloadWarmup stage. If it turns out that the number of operations is not enough and our WorkloadJitting heuristics produced a wrong hypothesis about the warmed operation duration, we should restart iterations from the Pilot stage.

Having said that, I am going to reopen the issue.

cc @daxian-dbw @adityapatwardhan

@adamsitnik
Copy link
Member

It was most probably fixed with #1949

@adamsitnik adamsitnik modified the milestones: v0.13.x, v0.13.2 Oct 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants