Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MessageFilter stuck due to transform timeouts #3352

Closed
gislers opened this issue Jan 6, 2023 · 8 comments
Closed

MessageFilter stuck due to transform timeouts #3352

gislers opened this issue Jan 6, 2023 · 8 comments

Comments

@gislers
Copy link

gislers commented Jan 6, 2023

Bug report

Required Info:

  • Operating System:
    • Ubuntu 22.04 inside WSL2 on Windows 11
  • ROS2 Version:
    • ROS Humble from binary and from source (running from source for debugging only)
  • Version or commit hash:
    • ROS Humble binary version: 1.1.3-1jammy.20221207.081347
    • ROS Humble source version: 7657f2f
  • DDS implementation:
    • rmw_fastrtps_cpp

Steps to reproduce issue

Run turtlebot3 simulation from binary ROS Humble installation, set 2D pose in rviz and wait for the bug to appear:

export TURTLEBOT3_MODEL=waffle
export GAZEBO_MODEL_PATH=$GAZEBO_MODEL_PATH:/opt/ros/humble/share/turtlebot3_gazebo/models
ros2 launch nav2_bringup tb3_simulation_launch.py

Expected behavior

Message filters used in global and local costmaps should provide transformed messages and not constantly drop messages.

Actual behavior

After a random amount of time (can be 1 minute, can be 15 minutes), the global costmap message filter starts dropping messages with the following log:

[component_container_isolated-5] [INFO] [1673002777.338380503] [global_costmap.global_costmap]: Message Filter dropping message: frame 'base_scan' at time 88.555 for reason 'the timestamp on the message is earlier than all the data in the transform cache'

As a consequence the global footprint as seen in rviz freezes, and navigation doesn't work anymore (wait for the second run):
https://user-images.githubusercontent.com/54321736/211038437-4a03465b-0006-4819-ad78-fe3705b86da2.mp4

The issue can also be observed for local costmap in the same manner as for the global costmap. Sometimes, also both nodes show this behavior.

Additional information

Full log of an example run where global costmap starts failing at t=81.555. As far as I understand, the initial message filter logs are normal and they disappear after a while: full-log1.txt

Sometimes, I do observe similar messages coming from rviz (note the different reason):

[rviz2-4] [INFO] [1673002305.454345754] [rviz]: Message Filter dropping message: frame 'odom' at time 219.881 for reason 'discarding message because the queue is full'
[rviz2-4] [INFO] [1673002305.454443322] [rviz]: Message Filter dropping message: frame 'odom' at time 223.941 for reason 'discarding message because the queue is full'

Here is a full log of a situation, where global costmap, amcl and rviz alltogether were spamming message filter logs: full-log2.txt

I then continued trying to solve the issue by enabling debug logs for the controller and planner servers. Here is a sample debug log from the controller server running from source with a couple of added log statements, but no code changes otherwise (git hash: 7657f2f).

Full log: controller_server_12897_1673021602748.log

Excerpt showing the transition to faulty behavior:

[DEBUG] [1673022427.230778145] [local_costmap.local_costmap]: Updating map...
[DEBUG] [1673022427.231428087] [nav2_costmap_2d]: Updating area x: [0, 59] y: [0, 60]
[DEBUG] [1673022427.231544470] [local_costmap.local_costmap]: Publishing footprint
[DEBUG] [1673022427.231556228] [local_costmap.local_costmap]: Timestamp robot footprint: 819.547000000s
[DEBUG] [1673022427.231649795] [local_costmap.local_costmap]: Map update time: 0.000883419
[DEBUG] [1673022427.236191001] [tf2_ros_message_filter]: MessageFilter [target=odom ]: Added message in frame base_scan at time 819.557, count now 1
[DEBUG] [1673022427.293432387] [tf2_ros_message_filter]: MessageFilter [target=odom ]: Message ready in frame base_scan at time 819.557, count now 0
[DEBUG] [1673022427.293514857] [local_costmap.local_costmap]: laserScanCallback. LaserScan timestamp: 819.557000000s
[DEBUG] [1673022427.430653178] [local_costmap.local_costmap]: Updating map...
[DEBUG] [1673022427.431679220] [nav2_costmap_2d]: Updating area x: [0, 59] y: [0, 60]
[DEBUG] [1673022427.431993682] [local_costmap.local_costmap]: Publishing footprint
[DEBUG] [1673022427.432022208] [local_costmap.local_costmap]: Timestamp robot footprint: 819.649000000s
[DEBUG] [1673022427.432274594] [local_costmap.local_costmap]: Map update time: 0.001626264
[DEBUG] [1673022427.438137393] [tf2_ros_message_filter]: MessageFilter [target=odom ]: Added message in frame base_scan at time 819.758, count now 1
[DEBUG] [1673022427.630638633] [local_costmap.local_costmap]: Updating map...
[DEBUG] [1673022427.631086143] [nav2_costmap_2d]: Updating area x: [0, 59] y: [0, 60]
[DEBUG] [1673022427.631188648] [local_costmap.local_costmap]: Publishing footprint
[DEBUG] [1673022427.631197019] [local_costmap.local_costmap]: Timestamp robot footprint: 819.649000000s
[DEBUG] [1673022427.631288983] [local_costmap.local_costmap]: Map update time: 0.000657425
[DEBUG] [1673022427.631313951] [local_costmap.local_costmap]: Publish costmap at local_costmap
[DEBUG] [1673022427.636889009] [tf2_ros_message_filter]: MessageFilter [target=odom ]: Added message in frame base_scan at time 819.957, count now 2
[DEBUG] [1673022427.696681497] [tf2_ros_message_filter]: MessageFilter [target=odom ]: Discarding message in frame base_scan at time 819.758, count now 1
[INFO] [1673022427.696753595] [local_costmap.local_costmap]: Message Filter dropping message: frame 'base_scan' at time 819.758 for reason 'the timestamp on the message is earlier than all the data in the transform cache'
[DEBUG] [1673022427.830632559] [local_costmap.local_costmap]: Updating map...
[DEBUG] [1673022427.831227103] [nav2_costmap_2d]: Updating area x: [0, 59] y: [0, 60]
[DEBUG] [1673022427.831412084] [local_costmap.local_costmap]: Publishing footprint
[DEBUG] [1673022427.831427490] [local_costmap.local_costmap]: Timestamp robot footprint: 819.649000000s
[DEBUG] [1673022427.831556301] [local_costmap.local_costmap]: Map update time: 0.000927655
[DEBUG] [1673022427.838045591] [tf2_ros_message_filter]: MessageFilter [target=odom ]: Added message in frame base_scan at time 820.157, count now 2
[DEBUG] [1673022427.896762256] [tf2_ros_message_filter]: MessageFilter [target=odom ]: Discarding message in frame base_scan at time 819.957, count now 1
[INFO] [1673022427.896820250] [local_costmap.local_costmap]: Message Filter dropping message: frame 'base_scan' at time 819.957 for reason 'the timestamp on the message is earlier than all the data in the transform cache'
[DEBUG] [1673022428.030806723] [local_costmap.local_costmap]: Updating map...
[DEBUG] [1673022428.031333237] [nav2_costmap_2d]: Updating area x: [0, 59] y: [0, 60]
[DEBUG] [1673022428.031520489] [local_costmap.local_costmap]: Publishing footprint
[DEBUG] [1673022428.031529601] [local_costmap.local_costmap]: Timestamp robot footprint: 819.649000000s
[DEBUG] [1673022428.031653320] [local_costmap.local_costmap]: Map update time: 0.000851960
[DEBUG] [1673022428.039223554] [tf2_ros_message_filter]: MessageFilter [target=odom ]: Added message in frame base_scan at time 820.357, count now 2
[DEBUG] [1673022428.097574994] [tf2_ros_message_filter]: MessageFilter [target=odom ]: Discarding message in frame base_scan at time 820.157, count now 1
[INFO] [1673022428.097632462] [local_costmap.local_costmap]: Message Filter dropping message: frame 'base_scan' at time 820.157 for reason 'the timestamp on the message is earlier than all the data in the transform cache'

The following points can be observed:

Messages are added to the MessageFilter every 200ms (which makes sense, as the lidar is simulated with 5Hz: https://github.com/ros-planning/navigation2/blob/da53ff53744dd3d653092c56ff9aedbb6bcb0272/nav2_bringup/worlds/waffle.model#L137)

At t=1673022427.636889009 a second message is added to the queue. Afterwards the error starts appearing. The frame at t=819.758 is discarded by the message filter because of a timeout. The timeout is configured by the transform_tolerance member in ObstacleLayer. The standard value is 0.3 seconds, and it can be observed that the t=819.758 message is approximately discarded after that time. I've had a look at the TF callback jungle and what happens is that ultimately MessageFilter::transformReadyCallback() catches the exception set by the TF buffer.

I guess it's natural that depending on CPU scheduling a message filter can queue up. What I don't understand is why the queue is never reduced to zero anymore. It can be observed from the logs that the discarding only happens at the rate of the subscribed laser scan topic which is 5Hz. It is unlikely that the tf node is overloaded, as I have also tested it with a lidar scan rate of 30Hz with which the messages were transformed and ready within the timeout (until the error occurs obviously).

Finally, because no message ever comes through again, the global costamp doesn't have an up to date footprint. It keeps on publishing to /global_costmap/published_footprint, but the timestamp freezes.

I originally thought that this problem is related to my specific robot setup only and I've tried configuring my node rates following this advice here: Post by @SteveMacenski in SteveMacenski/slam_toolbox#391 (comment)_

However I wasn't able to get rid of the error and I then realized that it's also present in the standard tb3 simulation. I don't know if my environment somewhat influences this, but I doubt it. Also, I'm only 4 months into ROS and C++ programming coming from the Java world. So I'm happy to learn if I have overseen something.

Further, I'm aware that this problem might be in tf2 and not in nav2. However, at the moment I don't know if the error comes from tf2 itself or just suboptimal usage of it (i.e. incompatible transform_tolerances, node rates, etc.). So I'm posting this here.

May be related to:

@SteveMacenski
Copy link
Member

Can you try using Cyclone DDS? I'm not sure if it would fix your issue, but I suspect this is not really much to do with Nav2 or TF as much as it is to do with networking getting bogged down or not working properly. We have other issue ticket reports about Fast-DDS causing issues. This is a new one (to me) but its a good easy first step.

I don't see anything of this nature happening for me on my side (I develop on the main branch every day). Have you tried in Docker or Linux machine? Perhaps its something windows-y. I've never used a Windows machine for programming

@gislers
Copy link
Author

gislers commented Jan 6, 2023

Thanks a lot for the quick reply! I've been running the tb3 simulation with Cyclone DDS now for over an hour without any issue. Also my own robot setup seems fine for the moment. I'll observe the situation for a week or two and then post my findings here.

And no, I haven't tried on a Linux machine or docker yet, but will also do that to see if it has an influence. Regarding docker I wouldn't expect any difference, since the WSL2 based engine is the preferred choice for docker on Windows. Anyhow, if Cyclone DDS proves to be more stable, I'll stick with it...

@SteveMacenski
Copy link
Member

SteveMacenski commented Jan 6, 2023

@EduPonz FYI, seems like Humble's having an issue with TF2

@gislers you may want to file this with Fast-DDS / RMW. I'm going to close this out here since its not a nav2-related issue we can resolve. But it would be good for you to characterize this to them since this is new (to me, at least)

@Mario-DL
Copy link

Hi @gislers

Please check with PR eProsima/Fast-DDS/pull/3195 for the backport to Fast-DDS 2.6.x (Humble). This could produce an internal delay which, in turn, can potentially result in TF2 and TfMessageFilters loosing their time tolerances.

Friendly ping @SteveMacenski

@SteveMacenski
Copy link
Member

Thanks for the info!

@corot
Copy link

corot commented Jan 11, 2023

Same here,,, I was at a loss!
I confirm that with Cyclone DDS the problem is gone

@HappySamuel
Copy link

Thanks a lot for the quick reply! I've been running the tb3 simulation with Cyclone DDS now for over an hour without any issue. Also my own robot setup seems fine for the moment. I'll observe the situation for a week or two and then post my findings here.

And no, I haven't tried on a Linux machine or docker yet, but will also do that to see if it has an influence. Regarding docker I wouldn't expect any difference, since the WSL2 based engine is the preferred choice for docker on Windows. Anyhow, if Cyclone DDS proves to be more stable, I'll stick with it...

Hi @gislers

Can you share your cyclonedds.xml configuration (remove IP address) ? I am currently facing same issue as yours, and i tried with cycloneDDS, but still not yet get the cyclonedds.xml to be configured correctly. I hope can have your cyclonedds.xml as reference.

Best,
Samuel

@gislers
Copy link
Author

gislers commented Aug 27, 2024

Hi @HappySamuel,

In order to solve the MessageFilter issue, I simply activated Cyclone DDS by following the official docs.

Install Cyclone DDS:
sudo apt install ros-humble-rmw-cyclonedds-cpp

Then export the environment variable:
export RMW_IMPLEMENTATION=rmw_cyclonedds_cpp

Note:

  • I haven't checked anymore the behavior with FastDDS. But as you describe, it is still buggy. :-(
  • You don't need a cyclonedds.xml to get the tb3 simulation running. In case you face problems with your LAN setup or DDS discovery, you might want to check on Robotics StackExchange. But this is unrelated to the issue here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants