Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discussion] Changes on Engine, diagnosers and result reporting? #297

Closed
ig-sinicyn opened this issue Nov 8, 2016 · 10 comments
Closed

[Discussion] Changes on Engine, diagnosers and result reporting? #297

ig-sinicyn opened this issue Nov 8, 2016 · 10 comments

Comments

@ig-sinicyn
Copy link
Contributor

ig-sinicyn commented Nov 8, 2016

DISCLAIMER: this issue is only to start a discussion, I have no good solution for it yet, sorry.

There're two major issues with current engine implementation.

  1. Engine should report measurements by itself, as here. If missed, things go wrong; see [BUG] NRE in OutliersAnalyser #296.

  2. The diagnoser support is implemented on top of result reporting. This hurts as there're additional allocations, and there's no proper synchronization. The diagnoser can receive signal after the benchmark was actually started.

To be honest I have no good idea how to fix it. As initial proposal:

  • Add diagnoser run as a separate engine stage.

  • During run-with-diagnoser stage:

    • Notify the diagnoser using any system synchronization primitive (e.g ManualResetEvent) (we do not want to allocate anything during run).
    • Wait for confirmation from the host (we do not want to run iteration until diagnoser receives the notification).
    • Run the iteration.
    • Notify diagnoser the run was completed & wait for confirmation.
  • During normal stages:

    • Do not write anything into console / output.
    • Result reporting should be responsibility of the caller, engine should only collect measurements into preallocated list.

I think, there's no need to use any framework for IPC as it'll incur additional allocations and it'll add more latency. Not to mention there's no such lightweight framework out-of-the-box, only sockets, pipes and mem-mapped files available for netCore.

@AndreyAkinshin AndreyAkinshin added this to the v0.10.x milestone Nov 8, 2016
@adamsitnik
Copy link
Member

I am not sure if we have any problem today.

  • Currently if there is any Diagnoser attached we perform separate run for Diagnostics only (code). Non-diagnostic output is ignored, so it does not matter if don't we print the results. We don't care about them because they might be affected by diagnosers side-effects.
  • Reading "signals" (code) is done in synchronous way. It means that child process is blocked until parent process finishes the reading. So it's impossible for diagnoser to attach to process too late or too early.

@adamsitnik
Copy link
Member

Proof: with the recent diagnoser changes I was able to get 100% accurate memory diagnostics #284

@ig-sinicyn
Copy link
Contributor Author

@adamsitnik

Non-diagnostic output is ignored, so it does not matter if don't we print the results

Ok, what about case when there's no diagnosers? Can we delay result reporting till all runs are completed? Or, at least, introduce more obvious contract for it?

Reading "signals" (code) is done in synchronous way

Afair standard input/output is buffered. So there can be multiple race combinations, simplest one:

Host                                |   Benchmark process
                                    |
                                    |             writes Signals.AfterSetup
receives Signals.AfterSetup         | 
                                    |             runs benchmark
diagnoser.AfterSetup called         |

both of these are not an issue for common benchmarks as they can easily take minute or two but I'd prefer to have something more robust for perftests as they're typically limited with 1-2 sec timeout.

@adamsitnik
Copy link
Member

Ok, what about case when there's no diagnosers? Can we delay result reporting till all runs are completed? Or, at least, introduce more obvious contract for it?

We could do so, so far @AndreyAkinshin wanted to have continuous updates.

Afair standard input/output is buffered. So there can be multiple race combinations, simplest one

I was not aware of that. Thanks!

@ig-sinicyn I assume that you want to merge your perf test runner changes and the current Engine does not fit in well? Have you thought about creating a new class that would implement IEngine and reuse the code from Engine that you want (running tests, not displaying results to console etc)

@ig-sinicyn
Copy link
Contributor Author

We could do so, so far @AndreyAkinshin wanted to have continuous updates.

Ok, great!:)

I assume that you want to merge your perf test runner changes and the current Engine does not fit in well?

Actually no, at least for now. I'm not sure most of our changes are required at all with bench.net v0.10.
Will do some comparison tests later.

Our current design is following: each benchmark method performs few thousands iterations by itself

            [CompetitionBaseline]
            public bool Test00IsDefined()
            {
                var a = false;
                for (var i = 0; i < Count; i++)
                    a = EnumHelper.IsDefined(F.C | F.D);
                return a;
            }

            [CompetitionBenchmark(28.00, 34.00)]
            public bool Test02EnumIsDefined()
            {
                var a = false;
                for (var i = 0; i < Count; i++)
                    a = Enum.IsDefined(typeof(F), F.C | F.D);
                return a;
            }

(yep, EnumHelper.IsDefined() method is ~30x faster than Enum.IsDefined())

the engine measures time for each call (as if the invokeCount parameter here is 1)
and there're at least 300 runs measured.

This produce precise results and entire test takes ~2.5 sec (no perf optimizations yet, will be faster),


BenchmarkDotNet=v0.9.9-develop
OS=Microsoft Windows NT 10.0.14393.0
Processor=Intel(R) Core(TM) i7-3537U CPU 2.00GHz, ProcessorCount=4
Frequency=2435869 Hz, Resolution=410.5311 ns, Timer=TSC
Host Runtime=Clr 4.0.30319.42000, Arch=64-bit  [RyuJIT]
GC=Concurrent Workstation
JitModules=clrjit-v4.6.1586.0
Job Runtime(s):
    Clr 4.0.30319.42000, Arch=64-bit  [RyuJIT]

Job=CompetitionDefaultJobAnyCpu  AnalyzeLaunchVariance=False  EvaluateOverhead=False  
RemoveOutliers=True  Platform=AnyCpu  Force=False  
Toolchain=InProcessToolchain  InvocationCount=1  LaunchCount=1  
RunStrategy=Throughput  TargetCount=300  UnrollFactor=1  
WarmupCount=100  

              Method |        Mean |     StdDev | Lnml(min) | Lnml(max) |      Median |         Min |         Max | Scaled | Scaled-StdDev |
-------------------- |------------ |----------- |---------- |---------- |------------ |------------ |------------ |------- |-------------- |
     Test00IsDefined |  12.1898 us |  0.5738 us |      1.00 |      1.00 |  12.3159 us |  10.6738 us |  13.9581 us |   1.00 |          0.00 |
 Test02EnumIsDefined | 387.1198 us | 19.6775 us |     31.75 |     31.75 | 384.2571 us | 338.2776 us | 442.5525 us |  31.83 |          2.21 |


============= IsDefinedCase ==============
// ? CodeJam.EnumHelperPerfTests+IsDefinedCase, CodeJam-Tests.Performance

---- Run 1, total runs (expected): 1 -----
// ? #1.1  02.551s, Informational@Analyser: CompetitionAnnotateAnalyser: All competition limits are ok.
// !<-- ------ xml_annotation_begin ------
<CompetitionBenchmarks>
    <Competition Target="CodeJam.EnumHelperPerfTests+IsDefinedCase, CodeJam-Tests.Performance">
        <Candidate Target="Test00IsDefined" Baseline="true" />
        <Candidate Target="Test02EnumIsDefined" MinRatio="28.00" MaxRatio="34.00" />
    </Competition>
</CompetitionBenchmarks>
// !--> ------- xml_annotation_end -------

it uses our own in-process toolchain, so we have no issues with replacing the engine so far.

Our code is very close to the current implementation in bench.net so it looks like we can drop our custom engine with release of v0.10, but there's still an issue with diagnosers.
Our toolchain captures console output into preallocated buffer and I do not want to use it for diagnoser notifications as it'll screw the timings.
We can make custom diagnoser notifications using Mutex or even events (all in all, the hacks are needed for in-process benchmarks only), but I'd prefer to not invent the wheel and to use code from bench.net while it's possible.

@ig-sinicyn
Copy link
Contributor Author

UPD Yeah, I've checked all our perftests and current engine implementation from Benchmark.Net works fine for us. I've just switched to it as a default option. Cheers!

The only issue we have for now is diagnoser support for in-process benchmarks. We definitely need something better than reading the stdout : )

@adamsitnik
Copy link
Member

@ig-sinicyn that's great!

I have finished the universal memory diagnoser, but due to dependency to netcoreapp1.1 which is still in beta we need to wait until the end of November to get it RTM so we don't need to have two separate nuget packages again like we did with dnxcore50.

@AndreyAkinshin IMHO Then we could merge Igor's changes to separate branch of BDN and I could work on having it work together with recent diagnosers changes without passing the results through standard input/output ;)

So we could have something like:
0.10.1 - build-in, accurate and cross platform memory diagnoser
0.10.2 - in-process benchmark runner

What do you guys think?

@AndreyAkinshin
Copy link
Member

LGTM

@ig-sinicyn
Copy link
Contributor Author

@adamsitnik ok!:)

@adamsitnik adamsitnik removed this from the v0.10.x milestone Sep 2, 2017
@adamsitnik
Copy link
Member

InProcessToolchain has been merged a long time ago, I am closing this one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants