Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add benchmark scaffolding #29

Closed
wants to merge 44 commits into from
Closed

[WIP] Add benchmark scaffolding #29

wants to merge 44 commits into from

Conversation

GretaCB
Copy link
Contributor

@GretaCB GretaCB commented Oct 11, 2016

Per #25

Per chat with @springmeyer:

  • Can we write a test that proves memory optimization?
  • node module-ify @springmeyer 's libnew lib for tracking of memory allocations
  • Document how to profile (using Activity Monitor). The skel isn't doing heavy operations, but might be possible to profile during batch bench run.

cc @mapsam @springmeyer

@codecov-io
Copy link

codecov-io commented Oct 11, 2016

Codecov Report

Merging #29 into master will decrease coverage by 0.43%.
The diff coverage is 98.16%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #29      +/-   ##
==========================================
- Coverage   98.82%   98.38%   -0.44%     
==========================================
  Files           2        2              
  Lines          85      186     +101     
==========================================
+ Hits           84      183      +99     
- Misses          1        3       +2
Impacted Files Coverage Δ
src/hello_world.hpp 0% <ø> (ø) ⬆️
src/hello_world.cpp 98.91% <98.16%> (-1.09%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 59d0092...4970302. Read the comment docs.

var fs = require('fs');
var path = require('path');
var argv = require('minimist')(process.argv.slice(2));
var tape = require('tape');
Copy link
Contributor

@springmeyer springmeyer Oct 12, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GretaCB I would propose not depending on tape for the benchmark. Since it is not critical and does not offer anything too useful. For the assert below you could use the var assert = require('assert').

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 awesome thank you! just committed the change.

@springmeyer
Copy link
Contributor

@GretaCB cf913b8 adds a very expensive usage of std::map (because it invokes lots of memory allocation internally in the map, searching of the map, and string comparisons). Now with:

~/projects/node-cpp-skel[bench]$ time node test/bench/bench-batch.js  --iterations 10 --concurrency 10

real    0m3.491s
user    0m11.704s
sys 0m0.535s

I get my CPU usage spiking to > 500%. Running node test/bench/bench-batch.js --iterations 100 --concurrency 10 to keep it going long enough to easily attach in activity monitor gives a callstack 98% idle in the main event loop (as expected if the threads are doing all the work) and with 99.9-100% of the threads reporting busy doing work (:tada:):

screen shot 2016-10-18 at 4 02 53 pm

@GretaCB
Copy link
Contributor Author

GretaCB commented Oct 21, 2016

Currently working on adding a couple more benchmark scenarios before merging.

  • Add a benchmark to demonstrate the cost of interacting with libuv and the threadpool. Demonstrate when not to use async functions, in the case that the function's work is faster than libuv's ability to interact with the threadpool.
  • Add a bench scenario where the code running inside the threadpool locks a mutex. This certainly happens in node-mbgl and node-mapnik. And will be a situation where all threads are full with work, it will not be CPU intensive, and they will be really slow (assuming lock contention is happening). We can create lock contention by having each thread attempt to access a global lock. Perf will be horrible
  • Bump up coverage
  • Update API docs and benchmark docs

@springmeyer
Copy link
Contributor

per chat with @GretaCB - next I'm going to take a look at profiling and tuning a few things in the PR. In particular I'll look at the impl of the mutex lock and make sure there is enough work being done in the async function that locks the global such that we are properly demonstrating (e.g. providing an example you can profile) the kind of thread contention programmers should avoid.

@springmeyer
Copy link
Contributor

springmeyer commented Nov 4, 2016

In particular I'll look at the impl of the mutex lock and make sure there is enough work being done in the async function that locks the global such that we are properly demonstrating (e.g. providing an example you can profile) the kind of thread contention programmers should avoid.

Done in fb1fca8. Now the contentiousThreads demo is properly awful. It can be tested like:

node test/bench/bench-batch.js --iterations 50 --concurrency 10 --mode contentiousThreads
Benchmark speed: 15 runs/s (runs:50 ms:3245 )
Benchmark iterations: 50 concurrency: 10 mode: contentiousThreads

If you bump up --iterations to 500 and profile in Activity Monitor.app you'll see the main loop is idle. This is expected because it is only dispatching work to the threads. The threads however are all "majority busy" in psynch_mutexwait (waiting for a locked mutex) as more time is spent waiting than doing the expensive work. This is because one thread will grab a lock, do work, all others will wait, another will grab the released lock, do work, all other threads will wait. This is all too common and the reason you don't want to use mutex locks. This is the profiling output of this non-ideal situation:

screen shot 2016-11-03 at 5 43 27 pm

When locks are unavoidable in real-world applications, we would hope that the % of time spent in psynch_mutexwait would be very small rather than very big. The real-world optimization would be to either rewrite the code to avoid needing locks or at least to rewrite the code to hold onto a lock for less time (scope the lock more).

@springmeyer
Copy link
Contributor

springmeyer commented Feb 13, 2017

Just looked back at this branch. It's got some great stuff that I think we should merge soon and keep iterating on. My one reservation before merging I'm slightly uncomfortable with how we are mixing best practice/simple/hello-world style code with new code demonstrating both advanced and non-ideal scenarios. I think we should split things apart before merging such that:

  • src/hello_world.hpp and src/hello_world.cpp contain only simple hello world style code: only binding a single standalone function (this is what you would copy if you just want to start writing a module fast and use the skel as a base)
  • src/async_<name>.hpp and src/async_<name>.cpp and src/async_<name>.readme.md that demonstrate common async scenarios in code and have a readme talking about what about them to care about them (good and bad). (This is what you'd profile, and poke at within skel, to learn about the nuances of node async performance).

@GretaCB GretaCB mentioned this pull request Jun 14, 2017
4 tasks
@springmeyer
Copy link
Contributor

springmeyer commented Aug 2, 2017

@GretaCB I just returned here to reflect on next steps. I feel like a good approach would be to split this work into 2 phases:

phase 1

Land a first PR that:

  • Adds a ./bench folder
  • Adds two benchmark scripts to cover the two performance focused functions currently in master:
    • ./bench/hello_async.bench.js
    • ./bench/hello_object_async.test.js
  • These scripts could be based on a simplified version of the test/bench/bench-batch.js currently in this PR
  • Very simple docs added to the README on how to run the benchmark and a comment that it is not realworld yet since the code does not do much, but that the idea is for developers using skel to adapt it and run it to monitor the performance of the code they add.

^^ this gets the key structure in place to make it easy and fast to start benchmarks for any module that uses skel. This is a great first step.

phase 2

We could revisit adding examples of performance scenarios. However I feel like this is a really advanced topic most suitable outside of node-cpp-skel. The skel is complex enough currently without diving deep on performance. But given performance is critical it would be great to cover it and benefit from the skel structure. So, here is an idea. Instead of building into the skel directly, we could:

  • Fork skel, and remove lots of unneeded code and docs (like the sync examples)
  • Add back key async performance scenarios
  • Add docs for this
  • Then link to this fork as an example of a skel implementation
  • The fork would be a separate "performance deep dive example using node-cpp-skel"

@springmeyer
Copy link
Contributor

@GretaCB now that Phase 1 is done in #61 - how about closing this ticket? Phase 2 remains but I feel like my idea is not concrete enough to warrant a ticket. I'm feeling really good with what we have and don't see a major need to ticket more work here. Rather we'll apply node-cpp-skel, learn perf issues, and then - at that time - have ideas of things to build back, or add to the docs.

@springmeyer
Copy link
Contributor

@springmeyer springmeyer deleted the bench branch July 14, 2018 15:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants