-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MessageFilter stuck due to transform timeouts #3352
Comments
Can you try using Cyclone DDS? I'm not sure if it would fix your issue, but I suspect this is not really much to do with Nav2 or TF as much as it is to do with networking getting bogged down or not working properly. We have other issue ticket reports about Fast-DDS causing issues. This is a new one (to me) but its a good easy first step. I don't see anything of this nature happening for me on my side (I develop on the main branch every day). Have you tried in Docker or Linux machine? Perhaps its something windows-y. I've never used a Windows machine for programming |
Thanks a lot for the quick reply! I've been running the tb3 simulation with Cyclone DDS now for over an hour without any issue. Also my own robot setup seems fine for the moment. I'll observe the situation for a week or two and then post my findings here. And no, I haven't tried on a Linux machine or docker yet, but will also do that to see if it has an influence. Regarding docker I wouldn't expect any difference, since the WSL2 based engine is the preferred choice for docker on Windows. Anyhow, if Cyclone DDS proves to be more stable, I'll stick with it... |
Hi @gislers Please check with PR eProsima/Fast-DDS/pull/3195 for the backport to Fast-DDS 2.6.x (Humble). This could produce an internal delay which, in turn, can potentially result in TF2 and TfMessageFilters loosing their time tolerances. Friendly ping @SteveMacenski |
Thanks for the info! |
Same here,,, I was at a loss! |
Hi @gislers Can you share your cyclonedds.xml configuration (remove IP address) ? I am currently facing same issue as yours, and i tried with cycloneDDS, but still not yet get the cyclonedds.xml to be configured correctly. I hope can have your cyclonedds.xml as reference. Best, |
Hi @HappySamuel, In order to solve the MessageFilter issue, I simply activated Cyclone DDS by following the official docs. Install Cyclone DDS: Then export the environment variable: Note:
|
Bug report
Required Info:
Steps to reproduce issue
Run turtlebot3 simulation from binary ROS Humble installation, set 2D pose in rviz and wait for the bug to appear:
Expected behavior
Message filters used in global and local costmaps should provide transformed messages and not constantly drop messages.
Actual behavior
After a random amount of time (can be 1 minute, can be 15 minutes), the global costmap message filter starts dropping messages with the following log:
As a consequence the global footprint as seen in rviz freezes, and navigation doesn't work anymore (wait for the second run):
https://user-images.githubusercontent.com/54321736/211038437-4a03465b-0006-4819-ad78-fe3705b86da2.mp4
The issue can also be observed for local costmap in the same manner as for the global costmap. Sometimes, also both nodes show this behavior.
Additional information
Full log of an example run where global costmap starts failing at t=81.555. As far as I understand, the initial message filter logs are normal and they disappear after a while: full-log1.txt
Sometimes, I do observe similar messages coming from rviz (note the different reason):
Here is a full log of a situation, where global costmap, amcl and rviz alltogether were spamming message filter logs: full-log2.txt
I then continued trying to solve the issue by enabling debug logs for the controller and planner servers. Here is a sample debug log from the controller server running from source with a couple of added log statements, but no code changes otherwise (git hash: 7657f2f).
Full log: controller_server_12897_1673021602748.log
Excerpt showing the transition to faulty behavior:
The following points can be observed:
Messages are added to the MessageFilter every 200ms (which makes sense, as the lidar is simulated with 5Hz: https://github.com/ros-planning/navigation2/blob/da53ff53744dd3d653092c56ff9aedbb6bcb0272/nav2_bringup/worlds/waffle.model#L137)
At t=1673022427.636889009 a second message is added to the queue. Afterwards the error starts appearing. The frame at t=819.758 is discarded by the message filter because of a timeout. The timeout is configured by the
transform_tolerance
member inObstacleLayer
. The standard value is 0.3 seconds, and it can be observed that the t=819.758 message is approximately discarded after that time. I've had a look at the TF callback jungle and what happens is that ultimatelyMessageFilter::transformReadyCallback()
catches the exception set by the TF buffer.message_filter.h
where the exception is caught: https://github.com/ros2/geometry2/blob/52f079339f60ec656e8d0e0f5e4042dfef1f0a09/tf2_ros/include/tf2_ros/message_filter.h#L552buffer.cpp
: https://github.com/ros2/geometry2/blob/52f079339f60ec656e8d0e0f5e4042dfef1f0a09/tf2_ros/src/buffer.cpp#L304I guess it's natural that depending on CPU scheduling a message filter can queue up. What I don't understand is why the queue is never reduced to zero anymore. It can be observed from the logs that the discarding only happens at the rate of the subscribed laser scan topic which is 5Hz. It is unlikely that the tf node is overloaded, as I have also tested it with a lidar scan rate of 30Hz with which the messages were transformed and ready within the timeout (until the error occurs obviously).
Finally, because no message ever comes through again, the global costamp doesn't have an up to date footprint. It keeps on publishing to /global_costmap/published_footprint, but the timestamp freezes.
I originally thought that this problem is related to my specific robot setup only and I've tried configuring my node rates following this advice here: Post by @SteveMacenski in SteveMacenski/slam_toolbox#391 (comment)_
However I wasn't able to get rid of the error and I then realized that it's also present in the standard tb3 simulation. I don't know if my environment somewhat influences this, but I doubt it. Also, I'm only 4 months into ROS and C++ programming coming from the Java world. So I'm happy to learn if I have overseen something.
Further, I'm aware that this problem might be in tf2 and not in nav2. However, at the moment I don't know if the error comes from tf2 itself or just suboptimal usage of it (i.e. incompatible transform_tolerances, node rates, etc.). So I'm posting this here.
May be related to:
The text was updated successfully, but these errors were encountered: