Add profiling/performance docs #30

GretaCB · 2016-10-11T23:24:56Z

After some basic benchmarks are setup per #25, add documentation on how to profile. The batch bench test can be useful for this.

Can we write a test that proves memory optimization?
node module-ify @springmeyer 's libnew lib for tracking of memory allocations
Document how to profile (using Activity Monitor).

mikemorris · 2016-10-12T01:52:47Z

There's now a built-in profiling tool in Node.js (v4.4.0 or later) with support for Node/C++ addons that may be useful to document here in addition to tips on using the profiling tools in Instruments.app.

GretaCB · 2016-10-12T19:02:53Z

Awesome suggestion @mikemorris, thank you!

I took a look at Node's built-in profiler and have a couple general thoughts/noticings to consider for docs:

Does it only profile the main thread? Or does it aggregate work across all threads?
Node's built-in profiler didnt seem to have any info on more granular thread-specifics, whereas the Activity Monitor provides does.
Node's built-in profiler seems usable across operating systems, whereas Activity Monitor is only OSX-specific.

While running through some profiling in node-cpp-skel, it also became clear that the skel would benefit from some more benchmark variety in order to demonstrate some of the potential profiling scenarios. Currently, skel has a single iteration bench, also a batch bench that allows you to set concurrency (to mock threads) and the number of iterations. But the logic in shout() is pretty light, doesn't have any heavy logic that could provide valuable profiling info.

So! Here are some ideas for beefing up the skel a bit to allow for more heavy-duty and informative profiling examples:

Add an async function in skel that sleeps in the thread pool. This could mock the scenario where threads are busy, but aren't doing much work. For example, include a new argument in shout() that allows you to specify how long the shout function should run (how long to shout for).
Add an async function in skel that is super CPU intensive and takes a while. For example, fibonacci or some other heavy-duty algorithm. This could mock the scenario where threads are busy and doing a lot of work. Threads would be working like crazy, and the main loop would be relatively idle.
Use wave() sync function to demonstrate idle threads.
Add a benchmark to demonstrate the cost of interacting with libuv and the threadpool. Demonstrate when not to use async functions, in the case that the function's work is faster than libuv's ability to interact with the threadpool.
Add a bench scenario where the code running inside the threadpool locks a mutex. This certainly happens in node-mbgl and node-mapnik. And will be a situation where all threads are full with work, it will not be CPU intensive, and they will be really slow (assuming lock contention is happening). We can create lock contention by having each thread attempt to access a global lock. Perf will be horrible

The idea is to have profiling examples documented for each of these scenarios. These docs would give a clear picture of what performance looks like when using a profiler.

GretaCB · 2016-10-12T23:03:41Z

Capturing some knowledge from chat with @springmeyer...

To properly bench multithreading:

Can utilize Node's UV_THREADPOOL_SIZE env var to actually enable threads to do work in a bench script. You can see this number of threads reflected in Profiling with Activity Monitor.
Activity Monitor will display a few different kinds of threads:
- main thread (event loop)
- worker threads (libuv) will include worker (in node) in the callstack. These are usually unnamed: Thread_2206161 (some of these might not actually be running your code)
- V8 WorkerThread: we dont really need to care about these right now. They dont actually run your code.

What is `libuv`?

A library that handles threadpool, event loop, and uses the threading implementation native to the given operating system (for example: Unix uses pthread)
Standalone, outside of node, but used within node.
Portable interface to multithreading
Created separate from node so that it's portable and others could use outside of node for event loop and async file IO
Written in pure C, not C++
It provides everything cross platform, so it contains a lot of #IFDEF so that it can handle different operating systems.
before libuv was available, developers had to write threads based on what was provided by the operating system (ex: pthreads). libuv is came along as a cross platform, portable interface to multithreading.
C++ 11 standard just came out with a thread library similar to libuv (std::thread), but there are some differences. See below.

What is `std::thread`?

as of C++11, part of the C++ standard library
Interface to interact with the operating system's built in threads (doesnt contain eventloop or a threadpool)
allows you to create threads, interact with them, and stop them

mikemorris · 2016-10-13T15:29:26Z

Does it only profile the main thread? Or does it aggregate work across all threads?

I believe it aggregates time spent across all threads in a single process. Multiple processes started by a single script each end up in separate log files named with the process ID.

Node's built-in profiler didnt seem to have any info on more granular thread-specifics, whereas the Activity Monitor provides does.

I'm not sure if there's a way to get more fine-grained detail here, not seeing any mention of threads in the V8 Profiler docs.

Node's built-in profiler seems usable across operating systems, whereas Activity Monitor is only OSX-specific.

Yep, I was specifically using this earlier this week to profile a Node.js/C++ app running on a Linux EC2.

Your ideas for example functions to demonstrate different profiling scenarios sound 💯

GretaCB · 2016-10-15T00:21:49Z

Per chat with @mapsam and @mikemorris in #node-cpp channel today, quick recap of us attempting to fully understanding UV_THREADPOOL_SIZE, concurrency, and articulate their concrete affects on threads using benchmarks.

Two scenarios where threads are "waiting" for work:

There’s less than maximum load on the process (Figure 1)
A bottleneck in the main thread that’s preventing work from being sent to the workers (Figure 3)

Utilize UV_THREADPOOL_SIZE to help avoid these two scenarios (Figure 2)

by @mapsam

@mapsam Noticed that even though his machine has 8 possible threads (in relation to # of cores), performance flatlined at 4 threads when benchmarking vt-shaver:

confirmed that the worker (node) sits in uv_cond_wait when the threads are lower
like, --concurrency 3
whereas --concurrency 6 the same worker is diggin into AsyncShave

This launched us into discussion about why that is, and discovered that the threads were waiting for work 50% of the time (uv_cond_wait, which is using pthread from the operating system).

Based on the two scenarios listed above, seems like the findings apply to Case # 1. But still curious about the flatlined graph and why performance maxed at 4 threads.

@springmeyer @mikemorris any thoughts?

Awesome post on threads in Node/C++ that is essentially what would be great to have for node-cpp-skel performance docs.

mikemorris · 2016-10-17T15:40:31Z

confirmed that the worker (node) sits in uv_cond_wait when the threads are lower
like, --concurrency 3

Okay, this makes more sense. I think uv_cond_wait is an async task that's waiting for an available libuv worker thread, so you would see this when the libuv threadpool is the bottleneck (figure 1).

If performance flatlines at 4 threads and the CPU is maxed out at 400% (on a dual-core, hyperthreading machine where 4 virtual cores would be available), then that sounds like figure 2.

If the CPU is under-utilized still when more threads are available, then that could be figure 3, indicating a bottleneck in the main loop somewhere, preventing work from being passed on to the threadpool.

GretaCB · 2016-10-18T23:30:53Z

Capturing chat with @springmeyer:

Add a benchmark to demonstrate the cost of interacting with libuv and the threadpool. Demonstrate when not to use async functions, in the case that the function's work is faster than libuv's ability to interact with the threadpool (adding this to check list in comment above)

Possible benchmark idea in node-cpp-skel

Expose a C++ sync function to do allocate a really large chunk of memory or a few by calling new
Expose an async function that does the same thing
Benchmark which is faster (likely the sync function)

Realworld situation in api-gl

We want to:

pass a gl buffer representing pixels to node-mapnik
and then use node-mapnik to encode those as PNG

Step 1 is so fast that we do it sync rather than async (no libuv needed) to avoid the potential slowdown of interacting with the threadpool.

GretaCB · 2016-10-20T23:33:24Z

Per chat with @springmeyer:

The sleep scenario might not actually mimic anything you'd see in the realworld (since few apps actually sleep). So another possible bench scenario:

Add a bench scenario where the code running inside the threadpool locks a mutex. This certainly happens in node-mbgl and node-mapnik. And will be a situation where all threads are full with work, it will not be CPU intensive, and they will be really slow (assuming lock contention is happening). We can create lock contention by having each thread attempt to access a global lock. Perf will be horrible

mapsam · 2016-10-24T14:11:08Z

Thanks for capturing all of the conversation @GretaCB - great to follow along!

springmeyer · 2017-08-25T01:30:32Z

@GretaCB okay if we close this? My recap would be:

We now have glossary entries for nodejs started at https://github.com/mapbox/cpp/blob/master/glossary.md#nodejs--c, some of which relates to performance.
We also have https://github.com/mapbox/cpp/blob/master/node-cpp.md#performance
We have tickets like Enhancement: pass data from node::Buffer into threadpool #67 that serve as good docs
And most of all https://github.com/mapbox/node-cpp-skel/blob/master/docs/benchmarking.md

In retrospec (and per #29 (comment)) the "scenarios" above I think are a bit too advanced for putting in node-cpp-skel directly. But rather things we could learn about and document as they come up in realworld scenarios as we use node-cpp-skel.

springmeyer · 2017-11-27T20:46:37Z

@GretaCB Seeing anything more needed before closing? Looks like mapbox/cpp#43 finished moving some of the great graphics from here into formal docs 💯

GretaCB · 2017-11-27T20:53:48Z

Thanks for the followup @springmeyer . Yup, good to close. This ticket will be here for future reference, if needed.

GretaCB mentioned this issue Oct 11, 2016

[WIP] Add benchmark scaffolding #29

Closed

5 tasks

GretaCB closed this as completed Nov 27, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add profiling/performance docs #30

Add profiling/performance docs #30

GretaCB commented Oct 11, 2016 •

edited

Loading

mikemorris commented Oct 12, 2016

GretaCB commented Oct 12, 2016 •

edited

Loading

GretaCB commented Oct 12, 2016 •

edited

Loading

mikemorris commented Oct 13, 2016

GretaCB commented Oct 15, 2016 •

edited

Loading

mikemorris commented Oct 17, 2016

GretaCB commented Oct 18, 2016 •

edited

Loading

GretaCB commented Oct 20, 2016 •

edited

Loading

mapsam commented Oct 24, 2016

springmeyer commented Aug 25, 2017

springmeyer commented Nov 27, 2017

GretaCB commented Nov 27, 2017

Add profiling/performance docs #30

Add profiling/performance docs #30

Comments

GretaCB commented Oct 11, 2016 • edited Loading

mikemorris commented Oct 12, 2016

GretaCB commented Oct 12, 2016 • edited Loading

GretaCB commented Oct 12, 2016 • edited Loading

What is libuv?

What is std::thread?

mikemorris commented Oct 13, 2016

GretaCB commented Oct 15, 2016 • edited Loading

Two scenarios where threads are "waiting" for work:

mikemorris commented Oct 17, 2016

GretaCB commented Oct 18, 2016 • edited Loading

Possible benchmark idea in node-cpp-skel

Realworld situation in api-gl

GretaCB commented Oct 20, 2016 • edited Loading

mapsam commented Oct 24, 2016

springmeyer commented Aug 25, 2017

springmeyer commented Nov 27, 2017

GretaCB commented Nov 27, 2017

GretaCB commented Oct 11, 2016 •

edited

Loading

GretaCB commented Oct 12, 2016 •

edited

Loading

GretaCB commented Oct 12, 2016 •

edited

Loading

What is `libuv`?

What is `std::thread`?

GretaCB commented Oct 15, 2016 •

edited

Loading

GretaCB commented Oct 18, 2016 •

edited

Loading

GretaCB commented Oct 20, 2016 •

edited

Loading