Sometimes heavy CPU load #749

Barry-Xu-2018 · 2024-03-19T06:49:15Z

Bug report

Required Info:

Operating System:
- ubuntu 22.04
Installation type:
- binary
Version or commit hash:
- ros-humble-fastrtps 2.6.7-1jammy.20240217.034748
- ros-humble-rmw-fastrtps-cpp 6.2.6-1jammy.20240217.063351
- ros-humble-rmw-fastrtps-shared-cpp 6.2.6-1jammy.20240217.062037
DDS implementation:
- FastDDS
Client library (if applicable):
- rclcpp

Steps to reproduce issue

The reproduced code is located at https://github.com/Barry-Xu-2018/test_pkg.

It will load 60 publishers in different process and one component to run 60 subscribers.
They use the same topic. Each publisher sends a message every second.

Please build it in humble environment

$ source /opt/ros/humble/setup.bash
$ mkdir -p ros2_humble_ws/src && cd ros2_humble_ws/src
$ git clone https://github.com/Barry-Xu-2018/test_pkg
$ cd ..
$ colcon build --merge-install --symlink-install --cmake-args -DCMAKE_BUILD_TYPE=Release

Run

$ source install/setup.bash
$ ros2 launch test_pkg node_to_container.launch.py

While you terminate program by ctrl + c, shared memory maybe not cleared (Shared memory will be used in this program).
You can do it by manual. (e.g. rm -rf /dev/shm/*)

Sometimes you will see below warning. Please terminate this program, clear /dev/shm and run it again.

[component_container_isolated-61] [WARN] [1710829092.513748843] [component_container.rclcpp]: failed to send response to /component_container/_container/load_node (timeout): client will not receive response, at ./src/rmw_response.cpp:154, at ./src/rcl/service.c:314

Expected behavior

Low CPU load for component process

Actual behavior

Sometimes, you will find heavy CPU load for componet process.

The reproducibility rate varies on different machines.

Additional information

In my environment, if I compile the program with debug version, I find that the issue cannot be reproduced.
I found a simple method to reproduce a similar phenomenon.
After ros2 launch test_pkg node_to_container.launch.py, you use gdb attach component_container_isolated process and then continue to run.
Using this method, it's easy to reproduce the issue regardless of whether running the Debug or Release version.

For ROS2 rolling version, I've done more than 30 tests, and I still haven't been able to reproduce this issue. But if I use gdb way, this problem can be reproduced.

The text was updated successfully, but these errors were encountered:

Barry-Xu-2018 · 2024-03-19T07:12:40Z

After deep investigation, I find why CPU usage is high.

While issue occur, rcl_wait() in every thread is constantly awakened by two events: one is the subscription, and the other is the guard_condition.
However, during the processing of the subscription event, I found that rcl_take() always reports RCL_RET_SUBSCRIPTION_TAKE_FAILED.

The reason for the error returned by rcl_take() is
eProsima/Fast-DDS/src/cpp/rtps/reader/StatefulReader.cpp

bool StatefulReader::begin_sample_access_nts(
        CacheChange_t* change,
        WriterProxy*& wp,
        bool& is_future_change)
{
    const GUID_t& writer_guid = change->writerGUID;
    is_future_change = false;

    if (matched_writer_lookup(writer_guid, &wp))
    {
        SequenceNumber_t seq;
        seq = wp->available_changes_max();
        if (seq < change->sequenceNumber)      // change->sequenceNumber is alwasy bigger than wp->available_changes_max()
                                               // So this is future change. This leads that nothing is taken.
       {
            is_future_change = true;     
        }
    }

    return true;
}

About rcl_wait() is constantly awakened, it is related to

The subscription already has data (But the data is not available). This leads to skip rmw_wait().

rmw_fastrtps/rmw_fastrtps_shared_cpp/src/rmw_wait.cpp

Lines 66 to 68 in 0bbb772

    
           skip_wait |= 
        
             (ReturnCode_t::RETCODE_OK == 
        
             custom_subscriber_info->data_reader_->get_first_untaken_info(&sample_info));

After executing the executable's processing (After subscription process), the interrupt_guard_condition is triggered.
https://github.com/ros2/rclcpp/blob/47c977d1bc82fc76dd21f870bcd3ea473eca2f59/rclcpp/src/rclcpp/executor.cpp#L540

void
Executor::execute_any_executable(AnyExecutable & any_exec)
{
  ...
  interrupt_guard_condition_.trigger();
  ...
}

Barry-Xu-2018 · 2024-03-19T07:33:37Z

@MiguelCompany @EduPonz

Could you help to support this issue ?
What circumstances would cause WriteProxy::changes_from_writer_low_mark_ to update abnormally?

I add below log，

bool StatefulReader::begin_sample_access_nts(
        CacheChange_t* change,
        WriterProxy*& wp,
        bool& is_future_change)
{
    const GUID_t& writer_guid = change->writerGUID;
    is_future_change = false;

    if (matched_writer_lookup(writer_guid, &wp))
    {
        SequenceNumber_t seq;
        seq = wp->available_changes_max();
       std::cout << writer_guid << " : " << seq << " --- " << change->sequenceNumber << std::endl;
        if (seq < change->sequenceNumber)  
       {
            is_future_change = true;     
        }
    }

    return true;
}

While the issue occurs, the output as below

[component_container_isolated-61] 01.0f.3a.2c.fc.cc.07.8c.01.00.00.00|0.0.12.3 : 65 --- 70
[component_container_isolated-61] 01.0f.3a.2c.55.cd.90.36.01.00.00.00|0.0.12.3 : 62 --- 67
[component_container_isolated-61] 01.0f.3a.2c.72.cd.2f.e4.01.00.00.00|0.0.12.3 : 62 --- 67
[component_container_isolated-61] 01.0f.3a.2c.c4.cc.29.37.01.00.00.00|0.0.12.3 : 65 --- 70
[component_container_isolated-61] 01.0f.3a.2c.de.cc.91.14.01.00.00.00|0.0.12.3 : 65 --- 70
[component_container_isolated-61] 01.0f.3a.2c.fa.cc.2a.ef.01.00.00.00|0.0.12.3 : 65 --- 70
[component_container_isolated-61] 01.0f.3a.2c.f6.cc.e1.49.01.00.00.00|0.0.12.3 : 62 --- 67
[component_container_isolated-61] 01.0f.3a.2c.fc.cc.07.8c.01.00.00.00|0.0.12.3 : 65 --- 69
[component_container_isolated-61] 01.0f.3a.2c.c0.cc.90.6e.01.00.00.00|0.0.12.3 : 66 --- 70
[component_container_isolated-61] 01.0f.3a.2c.72.cd.2f.e4.01.00.00.00|0.0.12.3 : 62 --- 66
[component_container_isolated-61] 01.0f.3a.2c.c4.cc.29.37.01.00.00.00|0.0.12.3 : 65 --- 69
[component_container_isolated-61] 01.0f.3a.2c.de.cc.91.14.01.00.00.00|0.0.12.3 : 65 --- 69
[component_container_isolated-61] 01.0f.3a.2c.fa.cc.2a.ef.01.00.00.00|0.0.12.3 : 65 --- 69
[component_container_isolated-61] 01.0f.3a.2c.f6.cc.e1.49.01.00.00.00|0.0.12.3 : 62 --- 66
[component_container_isolated-61] 01.0f.3a.2c.dc.cc.2b.ca.01.00.00.00|0.0.12.3 : 66 --- 70
[component_container_isolated-61] 01.0f.3a.2c.29.ce.5a.f0.01.00.00.00|0.0.12.3 : 61 --- 65

fujitatomoya · 2024-03-22T00:34:54Z

@Barry-Xu-2018 thanks for creating issue.

as far as i tried on my environment, rolling ros2/ros2@30e9231 still also has the problem. that is so unlikely but i did see the problem once in 30 times. the problem is, once this happens it does not go away, it just keeps running the CPU resource and does not get back to normal state.

fujitatomoya · 2024-04-03T04:07:00Z

@MiguelCompany @EduPonz friendly ping.

MiguelCompany · 2024-04-03T07:11:56Z

@fujitatomoya We will try to reproduce and investigate this soon. According to the information provided by @Barry-Xu-2018 I think this will only happen with reliable topics.

Barry-Xu-2018 · 2024-04-03T09:33:11Z

@MiguelCompany Thank you for your help.

I think this will only happen with reliable topics.

Yes.

BTW, I discovered that the transport never invokes ReceiverResource::OnDataReceived().

fujitatomoya · 2024-04-17T22:57:24Z

@MiguelCompany @EduPonz

Could you help to support this issue ? What circumstances would cause WriteProxy::changes_from_writer_low_mark_ to update abnormally?

I add below log，

bool StatefulReader::begin_sample_access_nts(
        CacheChange_t* change,
        WriterProxy*& wp,
        bool& is_future_change)
{
    const GUID_t& writer_guid = change->writerGUID;
    is_future_change = false;

    if (matched_writer_lookup(writer_guid, &wp))
    {
        SequenceNumber_t seq;
        seq = wp->available_changes_max();
       std::cout << writer_guid << " : " << seq << " --- " << change->sequenceNumber << std::endl;
        if (seq < change->sequenceNumber)  
       {
            is_future_change = true;     
        }
    }

    return true;
}

While the issue occurs, the output as below

[component_container_isolated-61] 01.0f.3a.2c.fc.cc.07.8c.01.00.00.00|0.0.12.3 : 65 --- 70
[component_container_isolated-61] 01.0f.3a.2c.55.cd.90.36.01.00.00.00|0.0.12.3 : 62 --- 67
[component_container_isolated-61] 01.0f.3a.2c.72.cd.2f.e4.01.00.00.00|0.0.12.3 : 62 --- 67
[component_container_isolated-61] 01.0f.3a.2c.c4.cc.29.37.01.00.00.00|0.0.12.3 : 65 --- 70
[component_container_isolated-61] 01.0f.3a.2c.de.cc.91.14.01.00.00.00|0.0.12.3 : 65 --- 70
[component_container_isolated-61] 01.0f.3a.2c.fa.cc.2a.ef.01.00.00.00|0.0.12.3 : 65 --- 70
[component_container_isolated-61] 01.0f.3a.2c.f6.cc.e1.49.01.00.00.00|0.0.12.3 : 62 --- 67
[component_container_isolated-61] 01.0f.3a.2c.fc.cc.07.8c.01.00.00.00|0.0.12.3 : 65 --- 69
[component_container_isolated-61] 01.0f.3a.2c.c0.cc.90.6e.01.00.00.00|0.0.12.3 : 66 --- 70
[component_container_isolated-61] 01.0f.3a.2c.72.cd.2f.e4.01.00.00.00|0.0.12.3 : 62 --- 66
[component_container_isolated-61] 01.0f.3a.2c.c4.cc.29.37.01.00.00.00|0.0.12.3 : 65 --- 69
[component_container_isolated-61] 01.0f.3a.2c.de.cc.91.14.01.00.00.00|0.0.12.3 : 65 --- 69
[component_container_isolated-61] 01.0f.3a.2c.fa.cc.2a.ef.01.00.00.00|0.0.12.3 : 65 --- 69
[component_container_isolated-61] 01.0f.3a.2c.f6.cc.e1.49.01.00.00.00|0.0.12.3 : 62 --- 66
[component_container_isolated-61] 01.0f.3a.2c.dc.cc.2b.ca.01.00.00.00|0.0.12.3 : 66 --- 70
[component_container_isolated-61] 01.0f.3a.2c.29.ce.5a.f0.01.00.00.00|0.0.12.3 : 61 --- 65

@MiguelCompany

this does not fix the problem here, but is this supposed to be like this? otherwise, it always returns true...

diff --git a/src/cpp/rtps/reader/StatefulReader.cpp b/src/cpp/rtps/reader/StatefulReader.cpp
index ef4a19913..e47c7ced5 100644
--- a/src/cpp/rtps/reader/StatefulReader.cpp
+++ b/src/cpp/rtps/reader/StatefulReader.cpp
@@ -1449,6 +1449,11 @@ bool StatefulReader::begin_sample_access_nts(
             is_future_change = true;
         }
     }
+    else
+    {
+      // if matched writer is not available, remove this from history
+      return false;
+    }

     return true;
 }

MiguelCompany · 2024-04-18T07:05:38Z

@Barry-Xu-2018 @fujitatomoya eProsima/Fast-DDS#4696 should address this

MiguelCompany · 2024-04-18T07:18:57Z

this does not fix the problem here, but is this supposed to be like this? otherwise, it always returns true...

What you suggest was an error that was fixed in eProsima/Fast-DDS#2363. Let me explain:

If the reader receives a sample with a correct sequence (i.e. not in the future) from a writer, and the writer is gone before the user takes the sample, we should keep the sample there for the user to take it.

What we did in eProsima/Fast-DDS#2363 was to remove all future changes from the reader's history when the writer is unmatched, but keep the already notified changes so they can be taken

Barry-Xu-2018 · 2024-04-18T08:49:09Z

@MiguelCompany

Thank you for your fix.
After 30 tests in the Humble environment, the issue did not recur.

MiguelCompany · 2024-04-18T09:16:47Z

@MiguelCompany

Thank you for your fix. After 30 tests in the Humble environment, the issue did not recur.

Kudos to @Mario-DL who did all the work

fujitatomoya · 2024-04-18T15:58:23Z

eProsima/Fast-DDS#4696 works like a magic, i confirmed on my local env as well.

@Mario-DL @MiguelCompany thanks a bunch 🚀🚀🚀

Mario-DL mentioned this issue Apr 17, 2024

[20706] Make reader get_first_untaken_info() coherent with read()/take() eProsima/Fast-DDS#4696

Merged

13 tasks

KKSTB mentioned this issue Jul 8, 2024

Timer hanging and high CPU load when using MultiThreadedExecutor ros2/rclpy#1223

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sometimes heavy CPU load #749

Sometimes heavy CPU load #749

Barry-Xu-2018 commented Mar 19, 2024

Barry-Xu-2018 commented Mar 19, 2024

Barry-Xu-2018 commented Mar 19, 2024

fujitatomoya commented Mar 22, 2024

fujitatomoya commented Apr 3, 2024

MiguelCompany commented Apr 3, 2024

Barry-Xu-2018 commented Apr 3, 2024 •

edited

Loading

fujitatomoya commented Apr 17, 2024

MiguelCompany commented Apr 18, 2024

MiguelCompany commented Apr 18, 2024

Barry-Xu-2018 commented Apr 18, 2024

MiguelCompany commented Apr 18, 2024

fujitatomoya commented Apr 18, 2024

Sometimes heavy CPU load #749

Sometimes heavy CPU load #749

Comments

Barry-Xu-2018 commented Mar 19, 2024

Bug report

Steps to reproduce issue

Expected behavior

Actual behavior

Additional information

Barry-Xu-2018 commented Mar 19, 2024

Barry-Xu-2018 commented Mar 19, 2024

fujitatomoya commented Mar 22, 2024

fujitatomoya commented Apr 3, 2024

MiguelCompany commented Apr 3, 2024

Barry-Xu-2018 commented Apr 3, 2024 • edited Loading

fujitatomoya commented Apr 17, 2024

MiguelCompany commented Apr 18, 2024

MiguelCompany commented Apr 18, 2024

Barry-Xu-2018 commented Apr 18, 2024

MiguelCompany commented Apr 18, 2024

fujitatomoya commented Apr 18, 2024

Barry-Xu-2018 commented Apr 3, 2024 •

edited

Loading