Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Global logger mutex acquisition should be scoped smaller than entire node initialization #2147

Open
sloretz opened this issue Mar 30, 2023 · 1 comment
Labels
backlog bug Something isn't working

Comments

@sloretz
Copy link
Contributor

sloretz commented Mar 30, 2023

#1125 made NodeBase acquire the global logging mutex before calling rcl_node_init().

std::lock_guard<std::recursive_mutex> guard(*logging_mutex);
// TODO(ivanpauno): /rosout Qos should be reconfigurable.
// TODO(ivanpauno): Instead of mutually excluding rcl_node_init with the global logger mutex,
// rcl_logging_rosout_init_publisher_for_node could be decoupled from there and be called
// here directly.
ret = rcl_node_init(

The rclcpp output handler also acquires the global logging mutex.

std::lock_guard<std::recursive_mutex> guard(*logging_mutex);

ros2/rmw_fastrtps#671 added logging in a callback in CustomParticipantInfo, meaning when that callback is called, it will try to acquire the global logging mutex.

https://github.com/ros2/rmw_fastrtps/blob/901339f274fc07fad757fb32bd16f00815217302/rmw_fastrtps_shared_cpp/include/rmw_fastrtps_shared_cpp/custom_participant_info.hpp#L214-L220

In eProsima's PDP class, there's another mutex that gets acquired. One way it's acquired is before calling the above callback in CustomParticipantInfo.

https://github.com/eProsima/Fast-DDS/blob/8a5a9160482b1543495c1ba49f3100fcceda12d9/src/cpp/rtps/builtin/discovery/participant/PDP.cpp#L773-L847

Another way it get's acquired is when creating a datawriter, as rmw_fastrtps_cpp does while creating the ros_discovery_info topic.

https://github.com/ros2/rmw_fastrtps/blob/901339f274fc07fad757fb32bd16f00815217302/rmw_fastrtps_cpp/src/init_rmw_context_impl.cpp#L91-L97

Which leads to the mutex being acquired here:

https://github.com/eProsima/Fast-DDS/blob/8a5a9160482b1543495c1ba49f3100fcceda12d9/src/cpp/rtps/builtin/discovery/participant/PDP.cpp#L858-L869

I'm seeing a case where a cyclonedds subscriber is already started, and I'm starting a FastDDS publisher. The main thread acquires the logging mutex, tries to init the node, and is blocked trying to acquire the PDP mutex while creating the ros_discovery_info topic. The reason it's blocked is there's another thread that learned of the cyclonedds subscriber, acquired the PDP mutex, notified the custom participant listener, tried to log a message about a type hash mismatch, which tries to acquire the logging mutex that's held in the main thread.

I think the acquisition of the logging mutex should be reduced in scope so that it doesn't cover the entire rcl_node_init() call, so that deadlock like this is avoided when logging happens during the initialization process.

A workaround for the deadlock is to make rmw_fastrtps not use RCUTILS logging, so that it won't try to acquire the global logging mutex.

@fujitatomoya
Copy link
Collaborator

@sloretz as you and @ivanpauno mentioned in the doc section, those can be decoupled and acquisition of the logging mutex should be reduced to avoid deadlock. i created a few PRs to address this issue, can you take a look when you have time?

CC: @clalancette

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlog bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants