Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add profiling/performance docs #30

Closed
GretaCB opened this issue Oct 11, 2016 · 12 comments
Closed

Add profiling/performance docs #30

GretaCB opened this issue Oct 11, 2016 · 12 comments

Comments

@GretaCB
Copy link
Contributor

GretaCB commented Oct 11, 2016

After some basic benchmarks are setup per #25, add documentation on how to profile. The batch bench test can be useful for this.

@mikemorris
Copy link

There's now a built-in profiling tool in Node.js (v4.4.0 or later) with support for Node/C++ addons that may be useful to document here in addition to tips on using the profiling tools in Instruments.app.

@GretaCB
Copy link
Contributor Author

GretaCB commented Oct 12, 2016

Awesome suggestion @mikemorris, thank you!

I took a look at Node's built-in profiler and have a couple general thoughts/noticings to consider for docs:

  • Does it only profile the main thread? Or does it aggregate work across all threads?
  • Node's built-in profiler didnt seem to have any info on more granular thread-specifics, whereas the Activity Monitor provides does.
  • Node's built-in profiler seems usable across operating systems, whereas Activity Monitor is only OSX-specific.

While running through some profiling in node-cpp-skel, it also became clear that the skel would benefit from some more benchmark variety in order to demonstrate some of the potential profiling scenarios. Currently, skel has a single iteration bench, also a batch bench that allows you to set concurrency (to mock threads) and the number of iterations. But the logic in shout() is pretty light, doesn't have any heavy logic that could provide valuable profiling info.

So! Here are some ideas for beefing up the skel a bit to allow for more heavy-duty and informative profiling examples:

  • Add an async function in skel that sleeps in the thread pool. This could mock the scenario where threads are busy, but aren't doing much work. For example, include a new argument in shout() that allows you to specify how long the shout function should run (how long to shout for).
  • Add an async function in skel that is super CPU intensive and takes a while. For example, fibonacci or some other heavy-duty algorithm. This could mock the scenario where threads are busy and doing a lot of work. Threads would be working like crazy, and the main loop would be relatively idle.
  • Use wave() sync function to demonstrate idle threads.
  • Add a benchmark to demonstrate the cost of interacting with libuv and the threadpool. Demonstrate when not to use async functions, in the case that the function's work is faster than libuv's ability to interact with the threadpool.
  • Add a bench scenario where the code running inside the threadpool locks a mutex. This certainly happens in node-mbgl and node-mapnik. And will be a situation where all threads are full with work, it will not be CPU intensive, and they will be really slow (assuming lock contention is happening). We can create lock contention by having each thread attempt to access a global lock. Perf will be horrible

The idea is to have profiling examples documented for each of these scenarios. These docs would give a clear picture of what performance looks like when using a profiler.

@GretaCB
Copy link
Contributor Author

GretaCB commented Oct 12, 2016

Capturing some knowledge from chat with @springmeyer...

To properly bench multithreading:

  • Can utilize Node's UV_THREADPOOL_SIZE env var to actually enable threads to do work in a bench script. You can see this number of threads reflected in Profiling with Activity Monitor.
  • Activity Monitor will display a few different kinds of threads:
    • main thread (event loop)
    • worker threads (libuv) will include worker (in node) in the callstack. These are usually unnamed: Thread_2206161 (some of these might not actually be running your code)
    • V8 WorkerThread: we dont really need to care about these right now. They dont actually run your code.

What is libuv?

  • A library that handles threadpool, event loop, and uses the threading implementation native to the given operating system (for example: Unix uses pthread)
  • Standalone, outside of node, but used within node.
  • Portable interface to multithreading
  • Created separate from node so that it's portable and others could use outside of node for event loop and async file IO
  • Written in pure C, not C++
  • It provides everything cross platform, so it contains a lot of #IFDEF so that it can handle different operating systems.
  • before libuv was available, developers had to write threads based on what was provided by the operating system (ex: pthreads). libuv is came along as a cross platform, portable interface to multithreading.
  • C++ 11 standard just came out with a thread library similar to libuv (std::thread), but there are some differences. See below.

What is std::thread?

  • as of C++11, part of the C++ standard library
  • Interface to interact with the operating system's built in threads (doesnt contain eventloop or a threadpool)
  • allows you to create threads, interact with them, and stop them

@mikemorris
Copy link

Does it only profile the main thread? Or does it aggregate work across all threads?

I believe it aggregates time spent across all threads in a single process. Multiple processes started by a single script each end up in separate log files named with the process ID.

Node's built-in profiler didnt seem to have any info on more granular thread-specifics, whereas the Activity Monitor provides does.

I'm not sure if there's a way to get more fine-grained detail here, not seeing any mention of threads in the V8 Profiler docs.

Node's built-in profiler seems usable across operating systems, whereas Activity Monitor is only OSX-specific.

Yep, I was specifically using this earlier this week to profile a Node.js/C++ app running on a Linux EC2.

Your ideas for example functions to demonstrate different profiling scenarios sound 💯

@GretaCB
Copy link
Contributor Author

GretaCB commented Oct 15, 2016

Per chat with @mapsam and @mikemorris in #node-cpp channel today, quick recap of us attempting to fully understanding UV_THREADPOOL_SIZE, concurrency, and articulate their concrete affects on threads using benchmarks.

Two scenarios where threads are "waiting" for work:

  1. There’s less than maximum load on the process (Figure 1)
  2. A bottleneck in the main thread that’s preventing work from being sent to the workers (Figure 3)

Utilize UV_THREADPOOL_SIZE to help avoid these two scenarios (Figure 2)

screen shot 2016-10-14 at 11 58 36 am

by @mapsam

@mapsam Noticed that even though his machine has 8 possible threads (in relation to # of cores), performance flatlined at 4 threads when benchmarking vt-shaver:
screen shot 2016-10-14 at 10 38 54 am

confirmed that the worker (node) sits in uv_cond_wait when the threads are lower
like, --concurrency 3
whereas --concurrency 6 the same worker is diggin into AsyncShave

screen shot 2016-10-14 at 11 00 38 am

This launched us into discussion about why that is, and discovered that the threads were waiting for work 50% of the time (uv_cond_wait, which is using pthread from the operating system).

Based on the two scenarios listed above, seems like the findings apply to Case # 1. But still curious about the flatlined graph and why performance maxed at 4 threads.

@springmeyer @mikemorris any thoughts?


Awesome post on threads in Node/C++ that is essentially what would be great to have for node-cpp-skel performance docs.

@mikemorris
Copy link

confirmed that the worker (node) sits in uv_cond_wait when the threads are lower
like, --concurrency 3

Okay, this makes more sense. I think uv_cond_wait is an async task that's waiting for an available libuv worker thread, so you would see this when the libuv threadpool is the bottleneck (figure 1).

If performance flatlines at 4 threads and the CPU is maxed out at 400% (on a dual-core, hyperthreading machine where 4 virtual cores would be available), then that sounds like figure 2.

If the CPU is under-utilized still when more threads are available, then that could be figure 3, indicating a bottleneck in the main loop somewhere, preventing work from being passed on to the threadpool.

@GretaCB
Copy link
Contributor Author

GretaCB commented Oct 18, 2016

Capturing chat with @springmeyer:

  • Add a benchmark to demonstrate the cost of interacting with libuv and the threadpool. Demonstrate when not to use async functions, in the case that the function's work is faster than libuv's ability to interact with the threadpool (adding this to check list in comment above)

Possible benchmark idea in node-cpp-skel

  1. Expose a C++ sync function to do allocate a really large chunk of memory or a few by calling new
  2. Expose an async function that does the same thing
  3. Benchmark which is faster (likely the sync function)

Realworld situation in api-gl

We want to:

  1. pass a gl buffer representing pixels to node-mapnik
  2. and then use node-mapnik to encode those as PNG

Step 1 is so fast that we do it sync rather than async (no libuv needed) to avoid the potential slowdown of interacting with the threadpool.

@GretaCB
Copy link
Contributor Author

GretaCB commented Oct 20, 2016

Per chat with @springmeyer:

The sleep scenario might not actually mimic anything you'd see in the realworld (since few apps actually sleep). So another possible bench scenario:

  • Add a bench scenario where the code running inside the threadpool locks a mutex. This certainly happens in node-mbgl and node-mapnik. And will be a situation where all threads are full with work, it will not be CPU intensive, and they will be really slow (assuming lock contention is happening). We can create lock contention by having each thread attempt to access a global lock. Perf will be horrible

@mapsam
Copy link
Contributor

mapsam commented Oct 24, 2016

Thanks for capturing all of the conversation @GretaCB - great to follow along!

@springmeyer
Copy link
Contributor

@GretaCB okay if we close this? My recap would be:

In retrospec (and per #29 (comment)) the "scenarios" above I think are a bit too advanced for putting in node-cpp-skel directly. But rather things we could learn about and document as they come up in realworld scenarios as we use node-cpp-skel.

@springmeyer
Copy link
Contributor

@GretaCB Seeing anything more needed before closing? Looks like mapbox/cpp#43 finished moving some of the great graphics from here into formal docs 💯

@GretaCB
Copy link
Contributor Author

GretaCB commented Nov 27, 2017

Thanks for the followup @springmeyer . Yup, good to close. This ticket will be here for future reference, if needed.

@GretaCB GretaCB closed this as completed Nov 27, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants