[discuss] event loop idle metrics #33026

jasnell · 2020-04-23T17:31:43Z

@trevnorris is working on landing a change to libuv that will track and report the amount of time the event loop spends in idle time. It's an extremely useful metric that can provide for us a measurement of "event loop utilization". In a world of worker threads, monitoring CPU no longer becomes an effective way of monitoring performance and event loop delay is not enough on it's own, so having a built in mechanism for measuring event loop utilization would be fantastic. While there is some work still to be done to get the PR landed in libuv and get that new libuv version landed in core, I did want to briefly discuss how the new metric should be exposed in core.

In this comment @trevnorris suggests a simple performance.idleTime() that returns the direct value of this metric, which records the cumulative time spent in idle since the loop was configured to track. To calculate event loop utilization, however, we also need to know how long the event loop has been running (well, to be specific, how long it's been since the loop was configured to collect the data, which can be turned on but not turned off). Assuming we started the loop and started collecting the metric from the start, we do already record the start time of the event loop (using the performance milestones) so someone could calculate the utilization on their own by accessing those values. However, I think it might make more sense for us to just do the calculation for users and provide an API like performance.idleTime() that returns an object with two values { idle: n, utilization: y } where idle is the raw idle time and utilization is the calculated utilization value. The API should be very low cost to sample using AliasedArray or AliasedStruct as a backing.

/cc @nodejs/diagnostics @addaleax @mcollina

The text was updated successfully, but these errors were encountered:

sam-github · 2020-04-23T17:57:42Z

I think its important that what the Node.js API returns is mappable to Prometheus in as high-value a way as possible, I'll take a look at this in the next day or so.

jasnell · 2020-04-23T20:55:26Z

@sam-github >>> nodejs/TSC#853

Flarna · 2020-04-23T21:14:01Z

Is utilization here the wall clock time since measurement starts or the actual CPU time consumed by the thread executing the loop tasks?

trevnorris · 2020-04-23T21:21:32Z

@Flarna It's a bit in depth, but you should read my post on libuv/libuv#2725 (comment) about the differences.

TL;DR "idle time" is measured using uv_hrtime(), which I believe defaults to monotonic, and represents the amount of time the event loop was idle, not the amount of time the event loop was spent in the event provider (e.g. poll). "event loop utilization" is "idle time" divided by the amount of time the event loop has been running (also measured using uv_hrtime()).

Flarna · 2020-04-23T22:21:17Z

@trevnorris Good post!

I think a common use case would be to calculate utilization for regular intervals not always from loop startup. Therefore not sure if adding this overall utilization from loop start is that useful.
But as long as a call to performance.idleTime() isn't stateful this is possible for everyone to do.

Looks very useful to me.

jasnell · 2020-04-23T22:32:14Z

Currently this mechanism uses uv_configure_loop() to enable the metrics tracking which enables configuration options but currently does not implement any semantics for disabling those options. Once enabled, the idle stats in libuv are calculated continuously over the lifetime of the event loop and are not reset when it is read. The idle time would accumulate indefinitely until the uint64_t overflows.

In theory, we could record the moment performance.idleTime() is called internally in Node.js and track that in the response and as part of the calculation of the utilization but that starts to make the api more complicated... for instance...

// Get the idle time since the start of the event loop
let idle = performance.idleTime()
// idle = { idle: n, mark: x, utilization: y }
// where mark == the uv_hrtime() when performance.idleTime() was processed.

// then later...
idle = performance.idleTime(idle)
// idle = { idle: n, mark: x, utilitization: y }
// where idle is now the difference between the current idle time and the provided idle.idle.
// mark is the new uv_hrtime()
// and utilization is the new n / the time since the provided idle.mark

Definitely possible but obviously starts to increase the complexity of the API so we need to be careful.

trevnorris · 2020-04-23T22:33:20Z

@Flarna Instead of reporting "utilization", if idleTime() returned { idle: n, active: m }, where active is the cumulative time spent in uv_run(), then it would be easy enough for users to implement their own.

A simple example:

function loopUtilization(last, current = performance.idleTime()) {
  return (current.idle - last.idle) / (current.active - last.active);
}

const now = performance.idleTime();
const t = Date.now();
while (Date.now() - t < 3000);
console.log(loopUtilization(now));

Though if active is included then maybe it shouldn't be called idleTime() anymore. :P

jasnell · 2020-04-23T22:36:00Z

Yep, so the key question here is: do we provide the calculation of utilization for users or do we just provide the idle time and leave it to them to calculate.

trevnorris · 2020-04-23T22:39:34Z

If it's similar to my example above, to prevent yet another one-liner module we might just as well add it. So maybe it would operate something like so:

// This simply logs out the loop utilization for a three second duration
setTimeout((last) => {
  console.log(performance.loopUtilization(last));
}, 3000, performance.idleTime());

Internally we just store active_time = uv_hrtime(); just before calling uv_run(). So, some pseudo code to show the calls between JS and C++, the return value of idleTime() would be something like:

{
  idle: uv_metrics_idle_time(event_loop()),
  active: active_time,
}

Flarna · 2020-04-23T22:42:36Z

I'm in favor of an easy, stateless API. If we add state we have to think about multi-user support.
As shown above utilization is easy calculated if idle time is available. I assume active above is actually the wall clock time in the same unit as idle time.

trevnorris · 2020-04-23T22:43:48Z

I assume active above is actually the wall clock time in the same unit as idle time.

Yes. It is the same unit as idle time.

jasnell · 2020-04-23T23:04:22Z

Given the likely complexity of any other option, I'm +1 on the really simple Just Give Me The Idle Time API. We can calculate active from the existing perf milestones, we don't actually need the new API to give us that.

addaleax · 2020-04-24T14:18:44Z

@jasnell Just as a note here, we should probably make idle time include Atomics.wait() once we implement an API for this in Node.js. That’s not hard to do, just something that seems like it’s easy to overlook :)

jasnell · 2020-04-24T14:56:01Z

I'm not super familiar with how we'd be able to observe waits but +1 to including those. @trevnorris , have you looked into that at all?

addaleax · 2020-04-24T15:01:20Z

@jasnell Luckily, I’ve done quite a bit of work around that in the past: https://github.com/v8/v8/blob/1d00b7856f9071c2c62ad66dbdbfdebff6dc370c/include/v8.h#L8867-L8944

Adding support for measuring that timing should be fairly straightforward, we basically register a callback that’s called once when Atomics.wait() starts and once when it ends.

(I’ve also been meaning to add a logging API for those events to Node.js, for better debugging Atomics.wait() deadlocks.)

Flarna · 2020-04-24T15:03:47Z

To my understanding the time spent in Atomic.wait is no idle time in the sense that this thread can process something else. Therefore I don't think we should add the time here.
But monitoring the time spent in Atomic.wait is for sure a good enhancement.

addaleax · 2020-04-24T15:15:31Z

@Flarna Well … that’s a good point, but I don’t think it’s a bit more complex than that. Maybe we need to make this customizable.

I’m currently working on code that uses Atomics.wait() when nothing else is happening on the event loop in a Worker, as a way of reducing latency and overall overhead. That means that the Atomics.wait() call effectively fulfills the function of the event loop wait, and should count as idle time.

trevnorris · 2020-04-24T18:51:03Z

@addaleax

That means that the Atomics.wait() call effectively fulfills the function of the event loop wait, and should count as idle time.

The metric should only measure idle time in the event provider. Specifically the amount of time idle while it waits for an event to be received (reason for this is explained in libuv/libuv#2725 (comment)). Looking at AtomicsWaitEvent, it seems like Atomic.wait() could be considered a type of event provider. Though if it's working in conjunction with uv_run() it'd probably be better to not include it.

Well, I guess that actually depends on whether the Worker's event loop has any handles from which it expects to receive an event. If not then Atomic.wait() would be considered the event provider. And since signal handlers can't be added from a Worker thread, there wouldn't be another way to interrupt the event provider. And if you're running with UV_RUN_NOWAIT then there technically wouldn't ever be idle time in uv_run()'s event provider. So I can see what you're getting at.

That being said, measuring the amount of time spent in Atomic.wait() would be a great step towards calculating total CPU idle time per thread. Since there isn't a platform agnostic way to do getrusage(RUSAGE_THREAD, ..., being able to include all other types of idle time is immensely helpful for analysis.

So for usability maybe just provide both times as separate values? IMO that would be the most helpful.

TL;DR idleTime() should only return the amount of idle time within the event provider. Not just idle time in uv_run(). So if Atomic.wait() is the acting event provider, and uv_run() is running with UV_RUN_NOWAIT, then idleTime() should return the amount of time in wait().

mhdawson · 2020-04-24T20:13:05Z

In respect to the question about whether to just return the raw number since idle time metric was enabled or trying to return a calculated value over an interval, my preference is to leave it as the raw number.

Otherwise it's a slippery slope of people wanting to be able to specify the interval etc. so probably better to just let them calculate from the raw number in the first place and keep it as simple as possible.

jasnell · 2020-04-24T21:32:54Z

Just had a great catch up with @trevnorris... it's been too long since we've chatted! On this specific API, I'm going to move forward with a draft PR that pulls in his libuv changes and adds a field to the existing Performance API nodeTiming object to get the current idle time. Using that, and the already existing loopStart timestamp, it ought to be trivial to calculate the utilization value.

nodeTiming.idleTime / (process.hrtime() - nodeTiming.loopStart)

We can provide the cumulative Atomics.wait time as a separate property off nodeTiming.

trevnorris · 2020-08-25T18:35:16Z

Now that the patch has landed in libuv v1.39.0 I'll create a PR with this feature ~~once it has~~ now since it has landed in node.

cjihrig · 2020-08-25T18:48:29Z

It has already landed in Node in #34915.

EDIT: Ah, I see the original comment was already updated. Ignore this comment 😄

trevnorris · 2020-08-26T19:03:08Z

PR created: #34938

mhdawson · 2020-09-02T22:04:16Z

PR #34938 landed @trevnorris I assume this can be closed now?

trevnorris · 2020-09-17T20:31:26Z

Probably. :)

devSahinur · 2024-06-10T04:12:23Z

why show this ?

jasnell added perf_hooks Issues and PRs related to the implementation of the Performance Timing API. performance Issues and PRs related to the performance of Node.js. labels Apr 23, 2020

jasnell added the feature request Issues that request new features to be added to Node.js. label Jul 6, 2020

jasnell closed this as completed Sep 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[discuss] event loop idle metrics #33026

[discuss] event loop idle metrics #33026

jasnell commented Apr 23, 2020 •

edited

Loading

sam-github commented Apr 23, 2020

jasnell commented Apr 23, 2020

Flarna commented Apr 23, 2020

trevnorris commented Apr 23, 2020

Flarna commented Apr 23, 2020

jasnell commented Apr 23, 2020

trevnorris commented Apr 23, 2020 •

edited

Loading

jasnell commented Apr 23, 2020

trevnorris commented Apr 23, 2020 •

edited

Loading

Flarna commented Apr 23, 2020

trevnorris commented Apr 23, 2020 •

edited

Loading

jasnell commented Apr 23, 2020

addaleax commented Apr 24, 2020

jasnell commented Apr 24, 2020

addaleax commented Apr 24, 2020

Flarna commented Apr 24, 2020

addaleax commented Apr 24, 2020

trevnorris commented Apr 24, 2020 •

edited

Loading

mhdawson commented Apr 24, 2020

jasnell commented Apr 24, 2020

trevnorris commented Aug 25, 2020 •

edited

Loading

cjihrig commented Aug 25, 2020 •

edited

Loading

trevnorris commented Aug 26, 2020

mhdawson commented Sep 2, 2020

trevnorris commented Sep 17, 2020

devSahinur commented Jun 10, 2024

[discuss] event loop idle metrics #33026

[discuss] event loop idle metrics #33026

Comments

jasnell commented Apr 23, 2020 • edited Loading

sam-github commented Apr 23, 2020

jasnell commented Apr 23, 2020

Flarna commented Apr 23, 2020

trevnorris commented Apr 23, 2020

Flarna commented Apr 23, 2020

jasnell commented Apr 23, 2020

trevnorris commented Apr 23, 2020 • edited Loading

jasnell commented Apr 23, 2020

trevnorris commented Apr 23, 2020 • edited Loading

Flarna commented Apr 23, 2020

trevnorris commented Apr 23, 2020 • edited Loading

jasnell commented Apr 23, 2020

addaleax commented Apr 24, 2020

jasnell commented Apr 24, 2020

addaleax commented Apr 24, 2020

Flarna commented Apr 24, 2020

addaleax commented Apr 24, 2020

trevnorris commented Apr 24, 2020 • edited Loading

mhdawson commented Apr 24, 2020

jasnell commented Apr 24, 2020

trevnorris commented Aug 25, 2020 • edited Loading

cjihrig commented Aug 25, 2020 • edited Loading

trevnorris commented Aug 26, 2020

mhdawson commented Sep 2, 2020

trevnorris commented Sep 17, 2020

devSahinur commented Jun 10, 2024

why show this ?

jasnell commented Apr 23, 2020 •

edited

Loading

trevnorris commented Apr 23, 2020 •

edited

Loading

trevnorris commented Apr 23, 2020 •

edited

Loading

trevnorris commented Apr 23, 2020 •

edited

Loading

trevnorris commented Apr 24, 2020 •

edited

Loading

trevnorris commented Aug 25, 2020 •

edited

Loading

cjihrig commented Aug 25, 2020 •

edited

Loading