Discovery: too slow and high network usage #281

alsora · 2019-05-23T17:12:05Z

Hi,

after updating to Fast-RTPS 1.8.0 I have again issues during discovery when running applications with approximately 20 nodes.

The behavior is the same I found in Fast-RTPS 1.7.0 #249

I can tell you that all the nodes discover each other, however the Endpoint Discovery Phase hangs up forever.

Moreover, during the discovery, I can see a network usage of approximately 50Kb per second in upload.

The application I'm trying to run has 20 nodes, 23 publishers and 35 subscriptions.
https://github.com/irobot-ros/ros2-performance/tree/master/performances/benchmark

Could it be related to this change?
eProsima/Fast-DDS@af648ac

The text was updated successfully, but these errors were encountered:

MiguelCompany · 2019-05-24T07:46:49Z

Hi @alsora

I can tell you that all the nodes discover each other, however the Endpoint Discovery Phase hangs up forever.

How have you checked the nodes have discovered each other? Because if they havent, there is a chance for the discovery to depend on the participant's announcement period.

Moreover, during the discovery, I can see a network usage of approximately 50Kb per second in upload.

Will you be so kind to send Wireshark captures of the working and non-working case, in order to better understand where the problem may be?

Could it be related to this change?
eProsima/Fast-RTPS@af648ac

I don't think so, since that change was recovering the timings to be the same as for v1.7.2. The constructor of Duration_t has changed to receive nanoseconds instead of fraction, so that commit made the timings be the correct nanoseconds.

We will reproduce the issue with the simple example shown in #249 and will look for the commit that provoked the regression on discovery.

alsora · 2019-05-24T10:17:37Z

@MiguelCompany Thank you for replying

Here you can see the functions that I'm using for checking PDP and EDP
https://github.com/irobot-ros/ros2-performance/blob/master/performances/performance_test/src/ros2/system.cpp#L110

I use the following APIs: Node::get_node_names() for PDP and Node::count_subscribers(topic_name) for EDP.

For what concerns reproducing the issue: note that I am testing 2 applications:

10 nodes 10 pub 13 sub
20 nodes 23 pub 35 sub

In the first one I don't see any issues.

At the moment, the solution that I'm using in order to run the second one, I'm waiting 1 second between the creation of each node.

I will get to you some data from Wireshark as soon as possible

MiguelCompany · 2019-05-24T11:15:53Z

I use the following APIs: Node::get_node_names() for PDP and Node::count_subscribers(topic_name) for EDP.

The nightly sanitizer jobs have found a data race on custom_participant_info.hpp (discovered_names and discovered_namespaces are not properly protected). This means Node::get_node_names() could be giving wrong results.

At the moment, the solution that I'm using in order to run the second one, I'm waiting 1 second between the creation of each node.

So, let's summarize ...

Creating 10 nodes 10 pub 13 sub everything works
Creating 20 nodes 23 pub 35 sub and waiting 1 second between the creation of each node everything works
Creating 20 nodes 23 pub 35 sub without waiting
a. wait_pdp_discovery returns
b. wait_edp_discovery waits for more than 30 seconds

Is this what happens? If so, how much time does it take for wait_pdp_discovery to return?

Thank you for helping us understanding the issue.

MiguelCompany · 2019-05-24T11:22:01Z

By the way, we are trying to reproduce the problem with the example you provided on #249. We added the example to ros2 demos repo here and haven't been able to reproduce the problem

We are also adding a blackbox test for a similar situation: 30 participants each creating one publisher and one subscriber to the same topic here (Still WIP)

alsora · 2019-05-24T11:51:37Z

Yes that's exactly what happens.

wait_pdp_discovery returns almost immediately (less than 100 milliseconds).

Adding some logs here and there, I see always a small number of subscriptions not matched (less than 3).

I think that in any case adding some tests like this can be really useful also for the future!

However, keep in mind that each ROS2 node also creates a Parameter Server, i.e. 6 RTPSReader and 6 RTPSWriter.
For example, if in my application I disable the Parameter Server, the discovery works

alsora · 2019-05-24T11:59:07Z

I tried again the old "stress test": I start seeing problems when I have 1 publisher 50 subscribers.
However, I think it's more interesting to wait for the discovery rather than looking at if all the publishers receive messages.

alsora · 2019-05-24T14:08:13Z

@MiguelCompany Here the wireshark data.
For these tests I set the discovery timeout to 50 seconds.

TEST 1: no wait between nodes creation

PDP time: 50ms
EDP time: timeout

TEST 2: 1 sec wait between nodes creation

PDP time: 0
EDP time: 0

In the second test, the nodes creation takes 20 seconds (1 sec per node). During this time Wireshark shows a Network usage of 10Kb per second.
In the same situation, Linux System Monitor is showing the 50Kb I was referring before.

wireshark_captures.tar.gz

MiguelCompany · 2019-05-27T13:24:55Z

We found the issue. It was related with a change necessary for the implementation of the lifespan QoS. A fix is on the way in eProsima/Fast-DDS#541, a new blackbox test is being added in eProsima/Fast-DDS#542, and a new unit test is under development.

dirk-thomas · 2019-05-27T16:53:10Z

@MiguelCompany great news!

@alsora can you please retest with the latest code including the fix.

alsora · 2019-05-28T10:24:21Z

I tested again with the latest updates.

The situation is definitely improved, but it's not fixed.

Considering 10 runs:

8/10 discovery is completed within 1 second
2/10 PDP hangs up (timeout 50 seconds)

This is different than what I saw last week, where PDP was working and EDP was not.

MiguelCompany · 2019-05-28T13:59:32Z

PDP hangs up

This may be related with the data race I mentioned on my previous comment.

The data race affects the data structures consulted by Node::get_node_names(), so that may be the reason of the hung up.

I created #283 with a proposed fix.
@alsora Will you be so kind to test against it?

alsora · 2019-05-28T14:37:12Z

@MiguelCompany The problem persists even after fixing the data race.

MiguelCompany · 2020-09-07T10:16:19Z

@alsora I don't know if this issue is still relevant or not. Do you think it can be closed?

alsora · 2020-09-07T14:06:46Z

Yes, I think it can be closed.
At least in Foxy I haven't seen this issue anymore.

alsora mentioned this issue May 23, 2019

FastRTPS 1.8.0 causes hangs in Navigation2 #280

Closed

dirk-thomas added the bug Something isn't working label May 23, 2019

MiguelCompany mentioned this issue May 27, 2019

Added discovery regression test [5479] eProsima/Fast-DDS#542

Merged

richiware mentioned this issue May 27, 2019

Fix on get_change with new History ordering [5477] eProsima/Fast-DDS#541

Merged

MiguelCompany mentioned this issue May 28, 2019

Fixing data race on custom_participant_info. #283

Merged

dirk-thomas closed this as completed in #283 May 28, 2019

nuclearsandwich reopened this May 28, 2019

alsora closed this as completed Sep 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discovery: too slow and high network usage #281

Discovery: too slow and high network usage #281

alsora commented May 23, 2019

MiguelCompany commented May 24, 2019

alsora commented May 24, 2019

MiguelCompany commented May 24, 2019

MiguelCompany commented May 24, 2019

alsora commented May 24, 2019

alsora commented May 24, 2019 •

edited

Loading

alsora commented May 24, 2019

MiguelCompany commented May 27, 2019

dirk-thomas commented May 27, 2019

alsora commented May 28, 2019 •

edited

Loading

MiguelCompany commented May 28, 2019

alsora commented May 28, 2019 •

edited

Loading

MiguelCompany commented Sep 7, 2020

alsora commented Sep 7, 2020

Discovery: too slow and high network usage #281

Discovery: too slow and high network usage #281

Comments

alsora commented May 23, 2019

MiguelCompany commented May 24, 2019

alsora commented May 24, 2019

MiguelCompany commented May 24, 2019

MiguelCompany commented May 24, 2019

alsora commented May 24, 2019

alsora commented May 24, 2019 • edited Loading

alsora commented May 24, 2019

TEST 1: no wait between nodes creation

TEST 2: 1 sec wait between nodes creation

MiguelCompany commented May 27, 2019

dirk-thomas commented May 27, 2019

alsora commented May 28, 2019 • edited Loading

MiguelCompany commented May 28, 2019

alsora commented May 28, 2019 • edited Loading

MiguelCompany commented Sep 7, 2020

alsora commented Sep 7, 2020

alsora commented May 24, 2019 •

edited

Loading

alsora commented May 28, 2019 •

edited

Loading

alsora commented May 28, 2019 •

edited

Loading