Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rclcpp::Node constructor and destructor crash with multithreading #1042

Closed
rotu opened this issue Mar 31, 2020 · 13 comments · Fixed by #1125
Closed

rclcpp::Node constructor and destructor crash with multithreading #1042

rotu opened this issue Mar 31, 2020 · 13 comments · Fixed by #1125
Assignees
Labels
bug Something isn't working

Comments

@rotu
Copy link
Contributor

rotu commented Mar 31, 2020

Use of rclcpp::Node is unsafe in an asynchronous setting and can cause segfaults and undefined behavior if such objects are constructed and/or destructed simultaneously.

See ros2/rosbag2#329 where this directly caused a reproducible crash. While the particular crash was averted in ros2/rosbag2#338, the constructor/destructor still use the unsafe functions rcl_node_init/rcl_node_fini.

Either these functions should be made threadsafe or they should be documented as unsafe and audited/instrumented to detect such unsafe usage.

@rotu rotu changed the title rclcpp::Node constructor and destructor are not threadsafe rclcpp::Node constructor and destructor crash with multithreading Apr 1, 2020
@rotu
Copy link
Contributor Author

rotu commented Apr 3, 2020

@thomas-moulard re: https://github.com/ros-tooling/aws-oncall/issues/107 (can't comment on my own ticket in aws-oncall repo since I'm not a collaborator)

@hidmic hidmic added the bug Something isn't working label Apr 16, 2020
@fujitatomoya
Copy link
Collaborator

@rotu

does anybody work on the fix?

@rotu
Copy link
Contributor Author

rotu commented May 20, 2020

As far as I can tell, no. @dirk-thomas is listed as the package maintainer so maybe he knows.

@ivanpauno
Copy link
Member

@rotu how it was concluded that the thread safety issue is in Node constructor/destructor?
From the information available in ros2/rosbag2#329, it's not obvious to me.
Can you clarify how that conclusion was achieved?

I will try to track down the issue, but it would be great to have clearer information (and preferably, a simple example with reproduction steps).

@ivanpauno ivanpauno added the more-information-needed Further information is required label May 20, 2020
@dirk-thomas
Copy link
Member

@dirk-thomas is listed as the package maintainer

While I am listed in the manifest this repo is maintainer by the whole @ros2/team.

@ivanpauno
Copy link
Member

I think I understand the issue:

We really should get rid of those global variables in rcl ...

I don't see a way of completely fixing the issue, as the rosout output handler does access the same map (https://github.com/ros2/rcl/blob/94b5a1d7d0899aa84a1026e21488bee95e67bbd8/rcl/src/rcl/logging_rosout.c#L241), and adding a lock to mutually exclude all node constructor calls, node destructor calls and all logs seems like a pretty bad idea.

For the time being, I think that adding a global mutex protecting node construction/destruction will greatly reduce the issue (there might be a multithread crash between one thread constructing/destructing a node and another one logging something).

@wjwwood @hidmic any ideas?

@ivanpauno ivanpauno removed the more-information-needed Further information is required label May 20, 2020
@ivanpauno
Copy link
Member

For the time being, I think that adding a global mutex protecting node construction/destruction will greatly reduce the issue (there might be a multithread crash between one thread constructing/destructing a node and another one logging something).

Thinking about this again, I think we should actually add mutexes in rcl itself, protecting its global variables (e.g. https://github.com/ros2/rcl/blob/85ef95514d6ee47bb8285009add84e956595c018/rcl/src/rcl/logging_rosout.c#L83 and https://github.com/ros2/rcl/blob/85ef95514d6ee47bb8285009add84e956595c018/rcl/src/rcl/logging.c#L37).

I know there was some resistance about adding locks in rcl, but until we get rid of all the global state I think it's the best solution (loggers seem to be the only place where we have global state).

@rotu
Copy link
Contributor Author

rotu commented May 20, 2020

I think that locks in rcl are an excellent idea. If you instrument rcl for it, you could even add an option to compile with or without locks, or to verbosely log lock contention to report unsafe usage of the API.

@rotu
Copy link
Contributor Author

rotu commented May 20, 2020

@dirk-thomas is listed as the package maintainer

While I am listed in the manifest this repo is maintainer by the whole @ros2/team.

Please update the package to reflect reality

https://www.ros.org/reps/rep-0149.html#maintainer-multiple-but-at-least-one

@wjwwood
Copy link
Member

wjwwood commented May 21, 2020

👎 to adding locks to rcl. Why not add the locking in rclcpp?

We really should get rid of those global variables in rcl ...

This is the right solution that I have pointed out several times during the development of the logging API's. It is also needed to address long standing issues with static destruction order... Any other solutions are band-aids in my opinion and this particular proposed one happens to also be a step backwards in design as well, in my opinion.

@wjwwood
Copy link
Member

wjwwood commented May 21, 2020

and adding a lock to mutually exclude all node constructor calls, node destructor calls and all logs seems like a pretty bad idea.

Why?

@ivanpauno
Copy link
Member

This is the right solution that I have pointed out several times during the development of the logging API's. It is also needed to address long standing issues with static destruction order...

Yes, I regret having worked around a similar issue recently, instead of have pushed for a refactor of the loggers API.
Doing that refactor is not completely trivial, and we should push for that in early G-Turtle.

Any other solutions are band-aids in my opinion

Completely agreed.

and this particular proposed one happens to also be a step backwards in design as well, in my opinion.

I agree that writing "object oriented like" in rcl and then adding mutually exclusive protection of that object in the specific client library (rclpy/rclcpp/etc) is the way to go, and should be the goal.
In this case, I think it's not trivially fixable in an upper layer.

and adding a lock to mutually exclude all node constructor calls, node destructor calls and all logs seems like a pretty bad idea.

Why?

I will expand on this point ...
If we want to protect the global state in rcl logging (1, 2), we need to protect in rclcpp:

👎 to adding locks to rcl. Why not add the locking in rclcpp?

In general, I also vote 👎 for adding this kind of thing to rcl, but I really don't think this is fixable from an upper client layer without a thorough redesign of rcl logging API (and maybe, some changes in rcutilis logging API too).
And unfortunately, I don't think we have time for a thorough redesign before Foxy release.

@hidmic
Copy link
Contributor

hidmic commented May 21, 2020

I have to agree with @ivanpauno. For Foxy, assuming this issue is strictly related to rcl_logging's global state, we're past the time to land a proper a fix i.e. to remove all globals. Limiting the scope of the workaround by locking at the rcl level, as close to this global state as possible, is unfortunate but doable. That workaround can also be backported.

sloretz added a commit to ros2/rclpy that referenced this issue Mar 11, 2021
Reverts most of #562

ros2/rclcpp#1042 describes a crash that can occur in rclcpp when rcl
logging functions are called in different threads. This was fixed in
ros2/rclcpp#1125, and a similar fix was made for rclpy in #562.

This fix is unnecessary in rclpy because it cannot call the logging
macros from multiple threads unless the GIL is released. The only place
the GIL is released is around rcl_wait(), so the logging methods are
already protected.

Removing this code makes it a little easier to divide the remaining work
of porting _rclpy.c to pybind11. If for some reason we decide to release
the GIL around logging methods in the future, then they can be protected
in the future using `pybind11::call_guard<T>` with a type that locks a
global logging mutex when it is default constructed and unlocks it when
its destructed.

Signed-off-by: Shane Loretz <[email protected]>
Signed-off-by: Shane Loretz <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants