Global logger mutex acquisition should be scoped smaller than entire node initialization #2147

sloretz · 2023-03-30T21:11:06Z

#1125 made NodeBase acquire the global logging mutex before calling rcl_node_init().

rclcpp/rclcpp/src/rclcpp/node_interfaces/node_base.cpp

Lines 58 to 63 in a5368e6

    
           std::lock_guard<std::recursive_mutex> guard(*logging_mutex); 
        
           // TODO(ivanpauno): /rosout Qos should be reconfigurable. 
        
           // TODO(ivanpauno): Instead of mutually excluding rcl_node_init with the global logger mutex, 
        
           // rcl_logging_rosout_init_publisher_for_node could be decoupled from there and be called 
        
           // here directly. 
        
           ret = rcl_node_init(

The rclcpp output handler also acquires the global logging mutex.

rclcpp/rclcpp/src/rclcpp/context.cpp

Line 133 in a5368e6

std::lock_guard<std::recursive_mutex> guard(*logging_mutex);

ros2/rmw_fastrtps#671 added logging in a callback in CustomParticipantInfo, meaning when that callback is called, it will try to acquire the global logging mutex.

https://github.com/ros2/rmw_fastrtps/blob/901339f274fc07fad757fb32bd16f00815217302/rmw_fastrtps_shared_cpp/include/rmw_fastrtps_shared_cpp/custom_participant_info.hpp#L214-L220

In eProsima's PDP class, there's another mutex that gets acquired. One way it's acquired is before calling the above callback in CustomParticipantInfo.

https://github.com/eProsima/Fast-DDS/blob/8a5a9160482b1543495c1ba49f3100fcceda12d9/src/cpp/rtps/builtin/discovery/participant/PDP.cpp#L773-L847

Another way it get's acquired is when creating a datawriter, as rmw_fastrtps_cpp does while creating the ros_discovery_info topic.

https://github.com/ros2/rmw_fastrtps/blob/901339f274fc07fad757fb32bd16f00815217302/rmw_fastrtps_cpp/src/init_rmw_context_impl.cpp#L91-L97

Which leads to the mutex being acquired here:

https://github.com/eProsima/Fast-DDS/blob/8a5a9160482b1543495c1ba49f3100fcceda12d9/src/cpp/rtps/builtin/discovery/participant/PDP.cpp#L858-L869

I'm seeing a case where a cyclonedds subscriber is already started, and I'm starting a FastDDS publisher. The main thread acquires the logging mutex, tries to init the node, and is blocked trying to acquire the PDP mutex while creating the ros_discovery_info topic. The reason it's blocked is there's another thread that learned of the cyclonedds subscriber, acquired the PDP mutex, notified the custom participant listener, tried to log a message about a type hash mismatch, which tries to acquire the logging mutex that's held in the main thread.

I think the acquisition of the logging mutex should be reduced in scope so that it doesn't cover the entire rcl_node_init() call, so that deadlock like this is avoided when logging happens during the initialization process.

A workaround for the deadlock is to make rmw_fastrtps not use RCUTILS logging, so that it won't try to acquire the global logging mutex.

The text was updated successfully, but these errors were encountered:

fujitatomoya · 2023-04-27T22:38:53Z

@sloretz as you and @ivanpauno mentioned in the doc section, those can be decoupled and acquisition of the logging mutex should be reduced to avoid deadlock. i created a few PRs to address this issue, can you take a look when you have time?

CC: @clalancette

sloretz added the bug Something isn't working label Mar 30, 2023

clalancette added the backlog label Apr 6, 2023

This was referenced Apr 27, 2023

Decouple rosout publisher init from node init. ros2/rcl#1065

Merged

Decouple rosout publisher init from node init. ros2/rclpy#1121

Merged

Decouple rosout publisher init from node init. #2174

Merged

This was referenced Apr 27, 2023

Decouple rosout publisher init from node init. ros2/rclc#351

Merged

Revert "Revert "Decouple rosout publisher init from node init. (#351)… ros2/rclc#353

Closed

fujitatomoya mentioned this issue Sep 27, 2023

Revert "Revert "Decouple rosout publisher init from node init. (#351)… ros2/rclc#392

Open

hwoithe mentioned this issue Jun 5, 2024

Hang loading components ros2/rmw_zenoh#182

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Global logger mutex acquisition should be scoped smaller than entire node initialization #2147

Global logger mutex acquisition should be scoped smaller than entire node initialization #2147

sloretz commented Mar 30, 2023

fujitatomoya commented Apr 27, 2023

Global logger mutex acquisition should be scoped smaller than entire node initialization #2147

Global logger mutex acquisition should be scoped smaller than entire node initialization #2147

Comments

sloretz commented Mar 30, 2023

fujitatomoya commented Apr 27, 2023