-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TestInteractiveMarkerClient.states
fails when run with rmw_connextdds
#40
Comments
@ivanpauno @clalancette please take a look at the analysis and let me know if you have any thoughts on how to resolve the issue. |
Uff 🤦♂️
The suggested patch seems to be only a workaround. |
I think ros2/rclcpp#1668 should fix the issue. |
@ivanpauno thank you for looking into this. Just wondering if you validated the fix with I don't see CI results (the one in ros2/rclcpp#1668 only tested Cyclone) and this error only occurred when |
Good question, I didn't sorry. |
Great, looks like it fixed the issue. Thanks again! |
Test
TestInteractiveMarkerClient.states
from package ros-visualization/interactive_markers fails when run withrmw_connextdds
(ref):Both failures [1][2] are caused by an
interactive_markers::InteractiveMarkerClient
object not transitioning toClientState::STATE_IDLE
within a pretty generous timeout (3s). Both times, these state transitions would normally be triggered byrclcpp::Client::service_is_ready()
called on memberget_interactive_markers_client_
returningfalse
[1][2]. This is expected to occur because both times themock_server
has been removed from theexecutor
and deleted.Unfortunately, the client never seems to "lose service", and the transitions never happen.
The key to why this is happening lies in the
ERROR
line:This error is thrown by rclcpp when the node associated with an
rcl_service_t
has already been deleted at the time of the service's deletion [1]. Since the nodehandle
is not valid anymore,rcl_service_fini()
andrmw_destroy_service()
are never called for that service.I have been doing some digging around and I've determined that the reason for this leak lies in the following sequence of events:
rcl_service_t
becomes active while the test is spinning the executor [1][2][3].rcl_server_t
remains referenced byAllocatorMemoryStrategy::service_handles_
.RUNNING
state [1] which causes the test to delete themock_server
. Since thercl_service_t
is still referenced, its "destructor" is not run.IDLE
. This in turn causes the executor to finally notice that thercl_service_t
is associated with a deleted node, and finally clears its reference, causing the destructor to run, and the error to be printed [1].Unfortunately the client will never transition to IDLE because in order to do this, its "request publisher" and "reply subscriber" would have to unmatch all endpoints from matching services [1], but this will never happen thanks to the leaked
rcl_service_t
(sincermw_destroy_service()
was not called and the underlying DDS endpoints will never be deleted). I believe these leaked endpoints are also the reason for the finalization errors reported by Connext andrmw_connextdds
at the end of the test.Fix/Workaround
No changes are required in
rmw_connextdds
.The only change required to make the test pass consistently is to modify
Executor::get_next_ready_executable_from_map()
in rclcpp/src/rclcpp/executor.cpp so that services are checked before subscriptions:The reason why this small change fixes the leak, and the test in general, is because if the
rcl_service_t
is detected as active, it ends up being removed fromAllocatorMemoryStrategy::service_handles_
and it can be properly finalized alongside its parent node. The subscription is still detected on the following wait, making the test progress as expected.This is probably not The Fix, since this issue seems to outline a general memory management problem (i.e. the leak of
rcl_service_t
). While making the change inrclcpp
will make the test pass, I suspect more problems will arise inevitably until the code is refactored so that all RMW resources are always guaranteed to be correctly finalized.Sidenote
Turns out these errors have nothing to do with the reliability protocol's configuration, as I had suspected when I opened #26. I'm not quite sure why a very fast heartbeat rate is also able to make the test pass, but I suspect it has something to do with altering the timing and ordering in which the executor clears its associated objects.
EDIT: grammar and link to the correct related PR (#26)
The text was updated successfully, but these errors were encountered: