Update the resource-utilization calc: average over the last N ms #127

jfinken · 2023-10-17T22:44:08Z

Overview

Previously, resource-utilization (RU) was calculated as an average since program-start. This PR changes the calculation to average over the last period milliseconds, where period is the sampling flag, e.g. ---sampling 1000.

This change brings RU more inline with programs like htop and is the idea, and request, of @mauropasse

Note the question below. Should this change be a flag exposed by irobot_benchmark? E.g. exposing the ability to choose the desired behavior?
Also note the change to the default resource sampling period. That isn't necessarily required.

Tests

Tests were conducted within a rolling docker container

white-mountain topology on x86
60sec test duration
sampling period of 1000ms
note for example the percent-CPU-utilization differences:

this branch:

time[ms]       cpu[%]    arena[KB]      in_use[KB]     mmap[KB]       rss[KB]        vsz[KB]
0              0         0              0              0              0              0      
1000           2.60      45296          39335          129552         120092         3588216
2000           0.40      45428          39345          129552         120092         3588216
3000           0.25      45428          39349          129552         120092         3588216
4000           0.19      45428          39336          129552         120092         3588164
5000           0.16      45428          39352          129552         120092         3588216
6000           0.13      45428          39355          129552         120092         3588216
7000           0.12      45428          39339          129552         120092         3588164
8000           0.27      45428          39339          129552         120092         3588164
9000           0.23      45428          39357          129552         120092         3588216
10000          0.23      45428          39341          129552         120092         3588164
...
59000          0.013     45428          39342          129552         120092         3588164
60000          0.035     45428          39342          129552         120092         3588164
61000          0.0092    45560          39218          129552         121960         3522564

the target master branch:

time[ms]       cpu[%]    arena[KB]      in_use[KB]     mmap[KB]       rss[KB]        vsz[KB]
0              0         0              0              0              0              0
1000           2.5       45236          39333          129552         119924         3588148
2000           1.7       45368          39343          129552         119924         3588148
3000           1.6       45368          39348          129552         119924         3588148
4000           1.4       45368          39334          129552         119924         3588096
5000           1.3       45368          39352          129552         119924         3588148
6000           1.2       45368          39353          129552         119924         3588148
7000           1.3       45368          39337          129552         119924         3588096
8000           1.3       45368          39339          129552         119924         3588096
9000           1.2       45368          39354          129552         119924         3588148
10000          1.2       45368          39338          129552         119924         3588096
...
59000          1.2       45368          39340          129552         119924         3588096
60000          1.2       45368          39338          129552         119924         3588096
61000          1.2       45368          39214          129552         122084         3456960

the avg cpu[%] falls off much more quickly
and the final numbers are not impacted by the initial spike (during object construction and initialization, etc)
although with this change one may want to look at the full set of numbers in resources.txt vs. simply the last line or two...

…lliseconds. Previously, RU was calculated since program-start. This commit changes this calculation to average over the last N milliseconds.

mauropasse · 2023-10-18T12:42:09Z

performance_metrics/src/resource_usage_logger.cpp

@@ -49,7 +49,9 @@ void ResourceUsageLogger::start(std::chrono::milliseconds period)
  std::cout << "[ResourceUsageLogger]: Logging to " << m_filename << std::endl;

  m_t1_real_start = std::chrono::steady_clock::now();
-  m_t1_user = std::clock();
+  // Question: make this a flag? That is, calc RU since program start or over


If we have a situation in which we need to use the previous behavior, we could add a flag.
I don't foresee an use case for this though, and wrt some kind of "backward compatibility", we always looked into the last values of the CPU, right before the program finishes (which now we could get more quickly, after maybe just 5 seconds instead of 60).

Sounds good, I'll remove the question/comment. Thanks.

mauropasse · 2023-10-18T12:44:20Z

I'm going to give it a try, but looks good!

mauropasse · 2023-10-18T15:29:05Z

I just tried it, and it didn't seem to work well

TOP reported ~23% CPU, while the resources.txt:

root@DDB-379b20:/home/logger/logs/ros2-application-dart-2023_10_18/irobot_benchmark
# cat white_mountain_log/resources.txt 
time[ms]       cpu[%]    arena[KB]      in_use[KB]     mmap[KB]       rss[KB]        vsz[KB]        
0              0         0              0              0              0              0              
500            25        6176           5860           768            24120          124012         
1000           14        16116          15718          20700          33896          153752         
1500           9.9       23268          22295          20700          39984          492312         
2000           1.3       23268          22306          20700          39992          492312         
2500           1.2       23268          22313          20700          39992          492312         
3000           1.1       23268          22317          20700          39992          492312         
3500           0.88      23268          22317          20700          39992          492312         
4000           0.82      23268          22318          20700          39992          492312

Like, just the 1st CPU line is correct but then goes very low

jfinken · 2023-10-18T16:47:49Z

I just tried it, and it didn't seem to work well TOP reported ~23% CPU, while the resources.txt:

Like, just the 1st CPU line is correct but then goes very low

Yep it isn't correct. Looking again, that denominator continues to increase, which is why the result gets small very quickly. Working on it...

jfinken · 2023-10-18T17:48:40Z

performance_metrics/src/resource_usage_logger.cpp

+        // resource utilization only over the last `period` milliseconds, *not* 
+        // since program start.
+        m_t1_user = std::clock();
+        m_t1_real = std::chrono::steady_clock::now();


@mauropasse

Ok note here how both of these members are updated right before sleep_until. Now the denominator, time_elapsed_real_ms, should reflect the period duration here.

Note that I don't think you can compare the actual value from top to what you see in the resources.txt file. As I understand it, top expresses %CPU as a percentage of total CPU time. However, here we are expressing our cpu% as the approximate CPU time used by our process over monotonic wall-clock time. e.g.

// cpu time of our process / monotonic wall clock time, using these std::clock() / std::chrono::steady_clock::now();

https://en.cppreference.com/w/cpp/chrono/c/clock

https://en.cppreference.com/w/cpp/chrono/steady_clock/now

I think top and our calculation are somewhat different, but I could certainly be wrong.

Yes you're right, top and our calculation are a bit different. If you multiply the CPU% we get in our way, by the number of cores of your system, the values are quite similar.
I mostly refered to the top approach of averaging every 1 or 2 seconds, so it doesn't reflect the past before that.
For that, the CPU load is what helps. I don't think we care about that so our approach is fine!

mauropasse · 2023-10-19T17:20:12Z

Nice, with your latest commit we see the correct values. Now CPU converges right away to the steady CPU usage.
Here's a plot (left) of the difference between the new and previous CPU computing approach (sampling every 500ms):

After 30 seconds, the original CPU computation didn't get to the steady value.
The image in the right is using 10ms sampling. We can see now a fine level of detail of what's going up with the CPU usage when the program starts, looks cool!

mauropasse · 2023-10-20T12:37:12Z

Still playing with the sampling rate, using 1ms we can get nice memory plots, allowing to see how memory scales over time when we create the system, and memory steps when shared libs are loaded, etc.
In this plot, RSS of white_mountain topology:

This could be useful to compare memory used by the different ROS2 distros, when the system is the same.

jfinken · 2023-10-23T15:21:11Z

Sounds good and great plots @mauropasse
Thanks for approving @alsora

Update the resource-utilization calc to an average over the last N mi…

1058363

…lliseconds. Previously, RU was calculated since program-start. This commit changes this calculation to average over the last N milliseconds.

jfinken requested review from alsora and mauropasse as code owners October 17, 2023 22:44

mauropasse reviewed Oct 18, 2023

View reviewed changes

Update m_t1_real start

9ef7240

jfinken commented Oct 18, 2023

View reviewed changes

PR build_and_test fixes

84e4d27

alsora approved these changes Oct 23, 2023

View reviewed changes

mauropasse merged commit f1e7eba into irobot-ros:master Oct 30, 2023
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update the resource-utilization calc: average over the last N ms #127

Update the resource-utilization calc: average over the last N ms #127

jfinken commented Oct 17, 2023

mauropasse Oct 18, 2023

jfinken Oct 18, 2023

mauropasse commented Oct 18, 2023

mauropasse commented Oct 18, 2023

jfinken commented Oct 18, 2023 •

edited

Loading

jfinken Oct 18, 2023 •

edited

Loading

mauropasse Oct 19, 2023 •

edited

Loading

mauropasse commented Oct 19, 2023 •

edited

Loading

mauropasse commented Oct 20, 2023

jfinken commented Oct 23, 2023

Update the resource-utilization calc: average over the last N ms #127

Update the resource-utilization calc: average over the last N ms #127

Conversation

jfinken commented Oct 17, 2023

Overview

Tests

mauropasse Oct 18, 2023

Choose a reason for hiding this comment

jfinken Oct 18, 2023

Choose a reason for hiding this comment

mauropasse commented Oct 18, 2023

mauropasse commented Oct 18, 2023

jfinken commented Oct 18, 2023 • edited Loading

jfinken Oct 18, 2023 • edited Loading

Choose a reason for hiding this comment

mauropasse Oct 19, 2023 • edited Loading

Choose a reason for hiding this comment

mauropasse commented Oct 19, 2023 • edited Loading

mauropasse commented Oct 20, 2023

jfinken commented Oct 23, 2023

jfinken commented Oct 18, 2023 •

edited

Loading

jfinken Oct 18, 2023 •

edited

Loading

mauropasse Oct 19, 2023 •

edited

Loading

mauropasse commented Oct 19, 2023 •

edited

Loading