Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solitary messages published from ROS 1 publishers that do not latch will sometimes not arrive at the ROS 2 subscriber #130

Closed
dljsjr opened this issue Jul 30, 2018 · 8 comments
Labels
more-information-needed Further information is required question Further information is requested

Comments

@dljsjr
Copy link

dljsjr commented Jul 30, 2018

Bug report

Required Info:

  • Operating System:
    Ubuntu 16.04
  • Installation type:
    Binaries
  • Version or commit hash:
    Ardent
  • DDS implementation:
    Fast-RTPS
  • Client library (if applicable):
    N/A

Steps to reproduce issue

Create a ROS 2 subscriber and start it, then start a ros1_bridge dynamic_bridge WITHOUT --bridge-all-topics.

Create a ROS 1 node with a publisher targeting the ROS 2 subscriber that does not latch and only publishes a single message (does not stream messages), and that exits after publishing its message. Run this node.

Expected behavior

The ROS 2 subscriber should receive the message via the bridge.

Actual behavior

Sometimes, the bridge will accept the message but the subscriber will never receive it. It's not reliably reproducible, but it DOES reliably STOP happening if you use --bridge-all-topics. It is also reliably fixed by using a latching publisher.

@dirk-thomas dirk-thomas added question Further information is requested more-information-needed Further information is required labels Jul 30, 2018
@dirk-thomas
Copy link
Member

From the description I have my doubts that the single message would be received by any ROS 1 subscriber / node. There is a none-zero time between creating a publisher and the connections between all interested subscribers being established. During that interval "early" published messages are expected to be lost. That is by design in an asynchronous publish / subscribe system (without any kind of caching which the latching does provide). If you can confirm that this is the source of your problem I don't think there is anything in the bridge to improve this behavior.

@dljsjr
Copy link
Author

dljsjr commented Jul 31, 2018

Just to avoid confusion, this is an orthogonal issue to the hard hang issue. We're still investigating that. It was just something we discovered during testing.

The scripts we're using to test (the ones without the latching publishers) are acceptance testing scripts from the University of Edinburgh that have worked fine in a pure ROS 1 environment in the past, they use it to shake out their robot every time they have a maintenance visit from NASA and NASA has used them a bit in the past as well. You can find them here: https://github.com/ipab-slmc/valkyrie_testing_edi. Our API is moving from custom comms w/ optional custom ROS 1 translator to DDS + ROS 2 compliant conventions w/ optional ROS 1 bridge. So we're updating their scripts to use as an acceptance test for the ROS 1 bridge layer. And on the ROS 2 side of things we're using Reliable QoS configurations for Fast-RTPS.

It just seems like it's a regression that we have to make the publisher scripts latch when using the bridge and we didn't before. But I understand that the bridge is fundamentally different than pure ROS 1 so it might be useful to document more formally in the bridge usage instructions instead of "fixing" the "issue" which might not be an issue at all and just a side effect of the technology. It could also be specific to Fast-RTPS but we don't have the ability to change our DDS implementation currently but we're investigating that.

@dirk-thomas
Copy link
Member

But I understand that the bridge is fundamentally different than pure ROS 1

The ros1_bridge is a "normal" ROS 1 node polling the master frequently for information about available topics / publishers / subscribers / servers / clients and then creates ROS 1 publishers / subscribers / servers / clients on demand. I would say nothing "fundamentally different" and just "pure ROS 1".

Since the master needs to be polled for information the delay between you publisher getting created and the bridge actually subscribing to the topic will likely be longer than in the case where you already have a subscriber running. I would guess that the increased delay is causing your msg loss.

You either update your code to not rely on the time until the connections are established to be very short or you want to start the bridge with explicitly bridging the topic in question without relying on the polled information from the master.

@dljsjr
Copy link
Author

dljsjr commented Jul 31, 2018

Maybe a better way to phrase it is that I acknowledge using the bridge to connect a ROS 1 publisher to a ROS 2 subscriber is a very different beast than publishing from a ROS 1 node to a ROS 2 subscriber with no middleman.

I guess where I'm coming from is that the way the docs are written seems to imply that the flags for things like --bridge-all-topics should be used sparingly. It mentions that it's useful for things like rqt and listing topics and that it's off by default for efficiency reasons but their is no discussion about the tradeoff when running in the purely dynamic mode. The issue I'm experiencing is admittedly not a bug based on the way you are describing it but it's also not an intuitive behavior to a user that doesn't have a firm understanding of the bridge internals and who would maybe think to try and avoid using the forced bridging flags based on the language in the docs (we spent almost a week of our testing time in TX avoiding using those flags because we thought we "weren't supposed to" use them, and about 60% of our issues went away when we enabled them).

@dirk-thomas
Copy link
Member

Using --bridge-all-topics implies a significant overhead since you bridge potentially a lot of message unnecessarily - that is why it is not usually recommended.

I don't think that code which creates a publisher and publishes one message immediately is a common use case. Simply because ROS doesn't guarantee that this works all the time - even without the bridge being involved. Please feel free to update the docs to add a paragraph (or more) about this scenario to guide future readers.

@dljsjr
Copy link
Author

dljsjr commented Jul 31, 2018

I don't have a good intuition for what exactly the overhead is here, is it computational or memory? We haven't noticed any issues when running the bridge with those flags but we have relatively beefy systems.

I'd love to make a PR against the README but I'm far from an expert on how the bridge internals work so I just want to make sure I have a firm grasp on it to make sure we're using it correctly.

@dirk-thomas
Copy link
Member

I don't have a good intuition for what exactly the overhead is here, is it computational or memory?

If your ROS graph has many topics / services you are not interested in to bridge all of these will be subscribed to and messages will be sent to the bridge unnecessarily. Depending on the size of your system and the size and frequency of the messages that can pose a significant overhead, e.g. just consider a camera node advertising raw as well as compressed topics. Maybe without the bridge most of them are not even being used so the node doesn't perform any computation for them. With the bridge using --bridge-all-topics there are now subscribers for every topic so the message need to be generated, serialized, transferred, and deserialized just to be thrown away.

@dljsjr
Copy link
Author

dljsjr commented Jul 31, 2018

That makes sense. Thanks for discussing it with me. I might try to make a pass at the README's later this week after I play around with it a bit more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
more-information-needed Further information is required question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants