Solitary messages published from ROS 1 publishers that do not latch will sometimes not arrive at the ROS 2 subscriber #130

dljsjr · 2018-07-30T21:09:44Z

Bug report

Required Info:

Operating System:
Ubuntu 16.04
Installation type:
Binaries
Version or commit hash:
Ardent
DDS implementation:
Fast-RTPS
Client library (if applicable):
N/A

Steps to reproduce issue

Create a ROS 2 subscriber and start it, then start a ros1_bridge dynamic_bridge WITHOUT --bridge-all-topics.

Create a ROS 1 node with a publisher targeting the ROS 2 subscriber that does not latch and only publishes a single message (does not stream messages), and that exits after publishing its message. Run this node.

Expected behavior

The ROS 2 subscriber should receive the message via the bridge.

Actual behavior

Sometimes, the bridge will accept the message but the subscriber will never receive it. It's not reliably reproducible, but it DOES reliably STOP happening if you use --bridge-all-topics. It is also reliably fixed by using a latching publisher.

The text was updated successfully, but these errors were encountered:

dirk-thomas · 2018-07-30T23:18:05Z

From the description I have my doubts that the single message would be received by any ROS 1 subscriber / node. There is a none-zero time between creating a publisher and the connections between all interested subscribers being established. During that interval "early" published messages are expected to be lost. That is by design in an asynchronous publish / subscribe system (without any kind of caching which the latching does provide). If you can confirm that this is the source of your problem I don't think there is anything in the bridge to improve this behavior.

dljsjr · 2018-07-31T17:13:01Z

Just to avoid confusion, this is an orthogonal issue to the hard hang issue. We're still investigating that. It was just something we discovered during testing.

The scripts we're using to test (the ones without the latching publishers) are acceptance testing scripts from the University of Edinburgh that have worked fine in a pure ROS 1 environment in the past, they use it to shake out their robot every time they have a maintenance visit from NASA and NASA has used them a bit in the past as well. You can find them here: https://github.com/ipab-slmc/valkyrie_testing_edi. Our API is moving from custom comms w/ optional custom ROS 1 translator to DDS + ROS 2 compliant conventions w/ optional ROS 1 bridge. So we're updating their scripts to use as an acceptance test for the ROS 1 bridge layer. And on the ROS 2 side of things we're using Reliable QoS configurations for Fast-RTPS.

It just seems like it's a regression that we have to make the publisher scripts latch when using the bridge and we didn't before. But I understand that the bridge is fundamentally different than pure ROS 1 so it might be useful to document more formally in the bridge usage instructions instead of "fixing" the "issue" which might not be an issue at all and just a side effect of the technology. It could also be specific to Fast-RTPS but we don't have the ability to change our DDS implementation currently but we're investigating that.

dirk-thomas · 2018-07-31T17:36:57Z

But I understand that the bridge is fundamentally different than pure ROS 1

The ros1_bridge is a "normal" ROS 1 node polling the master frequently for information about available topics / publishers / subscribers / servers / clients and then creates ROS 1 publishers / subscribers / servers / clients on demand. I would say nothing "fundamentally different" and just "pure ROS 1".

Since the master needs to be polled for information the delay between you publisher getting created and the bridge actually subscribing to the topic will likely be longer than in the case where you already have a subscriber running. I would guess that the increased delay is causing your msg loss.

You either update your code to not rely on the time until the connections are established to be very short or you want to start the bridge with explicitly bridging the topic in question without relying on the polled information from the master.

dljsjr · 2018-07-31T17:51:20Z

Maybe a better way to phrase it is that I acknowledge using the bridge to connect a ROS 1 publisher to a ROS 2 subscriber is a very different beast than publishing from a ROS 1 node to a ROS 2 subscriber with no middleman.

I guess where I'm coming from is that the way the docs are written seems to imply that the flags for things like --bridge-all-topics should be used sparingly. It mentions that it's useful for things like rqt and listing topics and that it's off by default for efficiency reasons but their is no discussion about the tradeoff when running in the purely dynamic mode. The issue I'm experiencing is admittedly not a bug based on the way you are describing it but it's also not an intuitive behavior to a user that doesn't have a firm understanding of the bridge internals and who would maybe think to try and avoid using the forced bridging flags based on the language in the docs (we spent almost a week of our testing time in TX avoiding using those flags because we thought we "weren't supposed to" use them, and about 60% of our issues went away when we enabled them).

dirk-thomas · 2018-07-31T18:00:02Z

Using --bridge-all-topics implies a significant overhead since you bridge potentially a lot of message unnecessarily - that is why it is not usually recommended.

I don't think that code which creates a publisher and publishes one message immediately is a common use case. Simply because ROS doesn't guarantee that this works all the time - even without the bridge being involved. Please feel free to update the docs to add a paragraph (or more) about this scenario to guide future readers.

dljsjr · 2018-07-31T18:09:32Z

I don't have a good intuition for what exactly the overhead is here, is it computational or memory? We haven't noticed any issues when running the bridge with those flags but we have relatively beefy systems.

I'd love to make a PR against the README but I'm far from an expert on how the bridge internals work so I just want to make sure I have a firm grasp on it to make sure we're using it correctly.

dirk-thomas · 2018-07-31T18:13:44Z

I don't have a good intuition for what exactly the overhead is here, is it computational or memory?

If your ROS graph has many topics / services you are not interested in to bridge all of these will be subscribed to and messages will be sent to the bridge unnecessarily. Depending on the size of your system and the size and frequency of the messages that can pose a significant overhead, e.g. just consider a camera node advertising raw as well as compressed topics. Maybe without the bridge most of them are not even being used so the node doesn't perform any computation for them. With the bridge using --bridge-all-topics there are now subscribers for every topic so the message need to be generated, serialized, transferred, and deserialized just to be thrown away.

dljsjr · 2018-07-31T18:24:07Z

That makes sense. Thanks for discussing it with me. I might try to make a pass at the README's later this week after I play around with it a bit more.

dirk-thomas added question Further information is requested more-information-needed Further information is required labels Jul 30, 2018

dljsjr closed this as completed Jul 31, 2018

This was referenced Jul 31, 2018

ros1_bridge silently stops working and cannot be restarted #129

Closed

ros1_bridge crashes when a rospy ROS 1 subscriber that is subscribing to a ROS 2 topic disconnects if bridge is not using --bridge-all-topics #131

Closed

calvertdw mentioned this issue Aug 2, 2018

create publisher even without subscriber #132

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Solitary messages published from ROS 1 publishers that do not latch will sometimes not arrive at the ROS 2 subscriber #130

Solitary messages published from ROS 1 publishers that do not latch will sometimes not arrive at the ROS 2 subscriber #130

dljsjr commented Jul 30, 2018

dirk-thomas commented Jul 30, 2018

dljsjr commented Jul 31, 2018 •

edited

Loading

dirk-thomas commented Jul 31, 2018

dljsjr commented Jul 31, 2018 •

edited

Loading

dirk-thomas commented Jul 31, 2018

dljsjr commented Jul 31, 2018

dirk-thomas commented Jul 31, 2018

dljsjr commented Jul 31, 2018

Solitary messages published from ROS 1 publishers that do not latch will sometimes not arrive at the ROS 2 subscriber #130

Solitary messages published from ROS 1 publishers that do not latch will sometimes not arrive at the ROS 2 subscriber #130

Comments

dljsjr commented Jul 30, 2018

Bug report

Steps to reproduce issue

Expected behavior

Actual behavior

dirk-thomas commented Jul 30, 2018

dljsjr commented Jul 31, 2018 • edited Loading

dirk-thomas commented Jul 31, 2018

dljsjr commented Jul 31, 2018 • edited Loading

dirk-thomas commented Jul 31, 2018

dljsjr commented Jul 31, 2018

dirk-thomas commented Jul 31, 2018

dljsjr commented Jul 31, 2018

dljsjr commented Jul 31, 2018 •

edited

Loading

dljsjr commented Jul 31, 2018 •

edited

Loading