Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add eventloop utilization default metric #518

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

ivan-tymoshenko
Copy link

Close #506

@ivan-tymoshenko
Copy link
Author

@Eomm, I find you as a contributor of fastify-elu-scaler. If you have some time and interested, PTAL.

Copy link

@mcollina mcollina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@ivan-tymoshenko
Copy link
Author

To measure ELU, you need to get the first utilization value, wait for some timeout, get the second utilization value and then count the "difference". The are at least two ways to implement this.

  1. Implementation that you can see in this PR. It's pretty straightforward. When you call the collect method, it starts the measurement, waits for a timeout, and returns the result. It is an async metric, which means you will have to wait for the timeout to get the result, and because all other default metrics are sync, it might be confusing if you set up a big timeout.

  2. You can find this implementation here. It gets the elu metric all the time in setInterval. That means I can get the last measurement in a sync way. But it adds some side effect behavior (timeout) and makes observations without an explicit call of the collect function.

I don't know which way is more acceptable from user perspective.

@zbjornson
Copy link
Collaborator

Thank you both for the PR and review.

@trevnorris you wrote a nice blog post on this (and I think did a lot of the impl?), any recommendation whether we should use setInterval to measure continuously or setTimeout to only measure when probed? One possible difference is that on-probe will be biased by other prom-client collectors.

@trevnorris
Copy link

@zbjornson I'd recommend using a setInterval. The ELU is meant to be recorded as regular time series data, the same as CPU.

@ivan-tymoshenko
Copy link
Author

@zbjornson Does summary sliding window work for sum and count parameters? It looks like not. How I can count the average value for some time period?

@zeldrinn
Copy link

zeldrinn commented Dec 5, 2022

@zbjornson what's the status on getting this merged? thanks!

@johnytiago
Copy link

Hey @zbjornson @ivan-tymoshenko is this PR blocked? Anything we can do to help get this through? 🙏🏽

@kibertoad
Copy link

@glensc @SimenB Is there anything we can do to help push this over the finish line?

@ivan-tymoshenko
Copy link
Author

I don't remember why it was blocked. I will take a look this week.

@ivan-tymoshenko
Copy link
Author

@trevnorris @glensc @SimenB @johnytiago Please take a look if this implementation makes sence and works for you.

@simon-paris
Copy link

Hey, looking forward to seeing this merged! 🚀

I've been running my own implementation of this metric, using an interval that updates a guage, with a period of 1 second. You can see this chart is still very noisy, so IMO your default period of 100ms might be too small.

Screenshot 2024-01-12 at 2 23 38 PM

And also, this code I think could give incorrect data:

  • It could sample at the wrong frequency, if timeMs / intervalTimeout is stable but not a whole number. E.g. if it's consistently 1.4 it would sample too slowly and if it's consistently 1.6 it would sample too often.
  • It could mistakenly report a small utilization value multiple times if a long blocking event happens near the end of an interval period. E.g. if it's idle for 99ms, then blocks for 101ms, you'd report [0.505, 0.505] when you should report [0.01, 1]

I'd appreciate it if you could also expose a guage version of this metric as well as the histogram, like how eventLoopLag does.

const blockedIntervalsNumber = Math.round(timeMs / intervalTimeout);
for (let i = 0; i < blockedIntervalsNumber; i++) {
	summary.observe(value);
	histogram.observe(value);
}

@ivan-tymoshenko
Copy link
Author

I've been running my own implementation of this metric, using an interval that updates a guage, with a period of 1 second. You can see this chart is still very noisy, so IMO your default period of 100ms might be too small.

I don't have much experience measuring elu, but I can see in articles/examples that people use a relatively small timeout for measuring elu around 50-100ms. Maybe @mcollina @trevnorris can help here.
https://nodesource.com/blog/event-loop-utilization-nodejs/
https://github.com/nearform/fastify-elu-scaler/blob/42efb4bf84ed1ffe7373e5f0d1ede7c92c2a7683/plugins/elu.js#L26

It could sample at the wrong frequency, if timeMs / intervalTimeout is stable but not a whole number. E.g. if it's consistently 1.4 it would sample too slowly and if it's consistently 1.6 it would sample too often.

The only posible situattion when timeMs > intervalTimeout is when event loop completelly blocked for timeMs and timeMs is bigger than intervalTimeout. In this case elu metrics equal 1. Of cource rounding this value is an approximation, but I simply don't see a better way to cover this case.

I'd appreciate it if you could also expose a guage version of this metric as well as the histogram, like how eventLoopLag does.

The main question for me here is when we should start and stop measuring elu in this case.

@simon-paris
Copy link

simon-paris commented Jan 15, 2024

I don't have much experience measuring elu, but I can see in articles/examples that people use a relatively small timeout for measuring elu around 50-100ms. Maybe @mcollina @trevnorris can help here.

You're right, I tried it out at 100ms and it looks good.

The only posible situattion when timeMs > intervalTimeout is when event loop completelly blocked for timeMs and timeMs is bigger than intervalTimeout. In this case elu metrics equal 1.

Here's some repro code that causes it to output small values multiple times. It happens when you've got blocking code in a timeout callback, it doesn't happen when it's in an io callback.

setInterval(() => {
  setTimeout(() => {
    const t1 = Date.now();
    while (Date.now() < t1 + 110) {}
  }, 50);
}, 100);

@ivan-tymoshenko
Copy link
Author

Here's some repro code that causes it to output small values multiple times. It happens when you've got blocking code in a timeout callback, it doesn't happen when it's in an io callback.

I understand. If you have a suggestion on how to measure elu in a more accurate way, you are welcome.

@kibertoad
Copy link

@ivan-tymoshenko is this mostly complete and just need conflicts resolved, or something is still missing?

@trevnorris
Copy link

@ivan-tymoshenko Just want to make sure it's understood, ELU as measured by Node isn't approximated. I patched libuv so it tracks it down to the system call. The interval length you choose to get the ELU won't change what it returns. It's always precise.

@SleepWalker
Copy link

Why not just go with implementation like this one?

#474

The majority of users just need an average value. What will you do with the whole distribution of loop utilization? This looks more like profiling rather then metric. You aren't interested in load spikes when you are running production code (will you scale on each spike? Or what's the use case?). But in case you are — it looks like out of scope of the default metrics.

@simon-paris
Copy link

simon-paris commented Nov 8, 2024

Why not just go with implementation like this one?
#474

I tried this out, it works, but it tends to give exactly same data as CPU usage because it gets averaged over a too long period.

(The same thing happens with the ELL metrics - which isn't a problem because quantile metrics are provided. https://github.com/siimon/prom-client/blob/master/lib/metrics/eventLoopLag.js#L74-L132)

The usefulness of the ELU metric is that it works well when measured over a small timespan (like 1s) so you can respond to performance drops very quickly.

Here's an example.

CPU usage
Screenshot 2024-11-08 at 12 16 09 PM

15s sum(rate(idleTime)) (its the same)
Screenshot 2024-11-08 at 12 16 15 PM

1s ELU window. This chart paints the correct picture, it shows that the server is actually not very healthy.
Screenshot 2024-11-08 at 12 16 23 PM

You aren't interested in load spikes when you are running production code

I am very interested in small load spikes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add event loop utilization as a default metrics
9 participants