Optimize QoS to improve responsiveness of reliable endpoints #26

asorbini · 2021-04-10T04:46:23Z

This PR introduces some "on by default" QoS optimizations for the RTPS reliability protocol parameters used by reliable DataWriters and DataReaders.

By default, Connext uses a pretty slow "heartbeat period" of 3 seconds. This configuration makes the reliability protocol pretty slow to react to missed samples.

With these changes, rmw_connextdds will reduce the heartbeat period to 100ms, with a minimum of 20ms. This should be in line with the default settings of other DDS vendors.

The optimizations can be disabled with variable RMW_CONNEXT_DISABLE_RELIABILITY_OPTIMIZATIONS, for example if a user would prefer to customize these parameters via XML (which is ultimately always the best solution for a complex system).

asorbini · 2021-04-10T04:51:16Z

This PR resolves some test failures in interactive_markers which were identified in rmw_connextdds#22.

ivanpauno

LGTM!

It would be great if there would be a way to check if the user set these qos settings in the qos profile file, and only apply these defaults if they were not set.
In that way, RMW_CONNEXT_ENV_DISABLE_RELIABILITY_OPTIMIZATIONS wouldn't be needed.

We have been using the environment variable trick for other things though, so applying it here sounds fine.

asorbini · 2021-04-13T01:44:58Z

It would be great if there would be a way to check if the user set these qos settings in the qos profile file, and only apply these defaults if they were not set.

Unfortunately there is no way to determine where the value of a certain field of QoS returned by get_default_qos() came from (Connext default, user qos, etc.), so the only way to "detect the default" would be to hard-code some checks for the exact "factory defaults", but that would probably be very brittle. It would also still have to be a "whole or nothing" kind of thing (i.e. if any of the fields we want to modify has been customized, then don't customize any of the others).

We have been using the environment variable trick for other things though, so applying it here sounds fine.

Yeah, the list is growing, but at the moment it's the "cleanest" strategy I can think of.

asorbini · 2021-04-13T18:31:53Z

I've have reviewed these changes with @GerardoPardo and we both agreed that it would be better to align these settings with the built-in profile Optimization.ReliabilityProtocol.Common (available in 6.x, used to be included in Generic.StrictReliable in 5.3.1).

The reason for that is that that profile has been reviewed and vetted, and it has been successfully deployed in customer scenarios as the de-facto new "default settings" for the reliability protocol. The main differences with the current settings are that

it uses a HB period of 200ms (instead of 100ms)
it fixes the size of the "send windows" to a limited amount (40 samples) and enables piggyback HB (1 every 4).
it takes better advantage of the high/low watermarks settings.

The last point brings me to a couple of issues that I have identified in my current proposal:

The default setting for high/low watermarks is 1/0. This means that a DataWriter will immediately switch to the fast HB period, as soon as it has at least 1 unacknowledged sample in its queue. Basically, it means the DW will always use the fast period (in my changes: 20ms).
I misunderstood the purpose of max_nack_response_delay: I thought it was a time during which a DW will ignore consecutive ACKNACK from a DR. It is instead a delay that the DW will exercise in sending a HB in reply to an ACKNACK, mostly meant to "coalesce" multiple ACKNACKs coming from readers receiving data over multicast, and avoid re-transmitting the same sample multiple times unnecessarily. It is actually best to set this value to 0 for optimal performance, and only consider tuning it if necessary (e.g. bad system performance over multicast).

With all of this being said, I have been re-running TestInteractiveMarkerClient.states with this (further) modified QoS, and I realize the reason why the current changes in the PR make the test pass is because they are effectively imposing a HB period of 20ms (because of the high/low watermark). I have verified that with a HB period of 50ms the test fails about 3% of the time instead, while 100ms is enough to make it fail most of the time.

I have experimented with changing different QoS settings, but only very aggressive reliability settings (which I wouldn't want to set as the global default) get the test to pass.

I still haven't been able to nail down the exact reason why the test is failing in the first place. AFAICT the test has a timeout of 3 seconds, which should be plenty of time for samples to be repaired, even with "slower" HB periods of 100 or 200 ms (certainly not Connext's default of 3s though...). The test also fails on a check which, again AFAICT, only depends on a client::is_service_available() call, which should only be affected by built-in discovery endpoints (since it checks matched publications and subscriptions on the client), and not be affected by changing the configuration of user endpoints.

So, at this point, I'd like to put this PR on hold, until I can identify the exact reason why the test is failing, and an exact configuration that makes it pass consistently.

In the meantime, maybe we can disable the test for rmw_connextdds.

asorbini · 2021-04-15T06:49:36Z

Turns out the failures in TestInteractiveMarkerClient.states have unrelated to reliability protocol configuration (see #40).

I'm still leaving this PR open for now, because I think it would be good to change the reliability protocol settings regardless, since the default ones are pretty bad.

README.md

hidmic · 2021-06-15T20:18:53Z

This needs a rebase and CI to get into Rolling.

Signed-off-by: Andrea Sorbini <[email protected]>

…ReliabilityProtocol.Common Signed-off-by: Andrea Sorbini <[email protected]>

clalancette · 2023-07-31T14:04:20Z

I've found that using this commit improves the speed of a bunch of our tests, which is a priority right now.

So I've gone ahead and rebased this onto the latest. I'm going to run CI here to see how this does:

Linux
Linux-aarch64
Windows

clalancette · 2023-07-31T20:21:05Z

CI is yellow, but that may just be a flake. Here is another run to check:

Windows

clalancette · 2023-08-01T12:51:54Z

OK, I'm going to go ahead and merge this one in, since CI looks green and this should improve our test times a bit.

asorbini requested a review from ivanpauno April 10, 2021 04:46

asorbini mentioned this pull request Apr 10, 2021

Resolve issues identified while investigating #21 #22

Merged

ivanpauno approved these changes Apr 12, 2021

View reviewed changes

ivanpauno requested review from hidmic and clalancette April 12, 2021 15:07

asorbini mentioned this pull request Apr 12, 2021

Allow sharing DomainParticipant with C++ applications #25

Merged

asorbini marked this pull request as draft April 14, 2021 00:31

asorbini added the test-failure The issue or PR relates to a test failure label Apr 14, 2021

asorbini added humble PR scheduled for the H-turtle and removed test-failure The issue or PR relates to a test failure labels Apr 15, 2021

This was referenced Apr 15, 2021

TestInteractiveMarkerClient.states fails when run with rmw_connextdds #40

Closed

Numerous test failures for Nav2 when using rmw_connextdds #21

Open

clalancette assigned asorbini May 6, 2021

hidmic approved these changes Jun 15, 2021

View reviewed changes

README.md Outdated Show resolved Hide resolved

asorbini mentioned this pull request Sep 24, 2021

Wait for publisher and subscription to match ros2/rclcpp#1777

Merged

audrow changed the base branch from master to rolling June 28, 2022 14:22

Optimize QoS to improve responsiveness of reliable endpoints

dfc2419

Signed-off-by: Andrea Sorbini <[email protected]>

clalancette force-pushed the asorbini/reliability-optimizations branch from bac4df4 to 92cd8d7 Compare July 31, 2023 14:02

Replace reliability optimizations with built-in profile Optimization.…

1d9cad3

…ReliabilityProtocol.Common Signed-off-by: Andrea Sorbini <[email protected]>

clalancette force-pushed the asorbini/reliability-optimizations branch from 92cd8d7 to 1d9cad3 Compare July 31, 2023 14:03

clalancette marked this pull request as ready for review August 1, 2023 12:51

clalancette merged commit b57d032 into rolling Aug 1, 2023
1 check passed

delete-merged-branch bot deleted the asorbini/reliability-optimizations branch August 1, 2023 12:52

Crola1702 mentioned this pull request Aug 17, 2023

🧑‍🌾 test_rosbag2_play_end_to_end flaky on Windows ros2/rosbag2#1437

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize QoS to improve responsiveness of reliable endpoints #26

Optimize QoS to improve responsiveness of reliable endpoints #26

asorbini commented Apr 10, 2021

asorbini commented Apr 10, 2021

ivanpauno left a comment

asorbini commented Apr 13, 2021

asorbini commented Apr 13, 2021

asorbini commented Apr 15, 2021

hidmic commented Jun 15, 2021

clalancette commented Jul 31, 2023

clalancette commented Jul 31, 2023

clalancette commented Aug 1, 2023

Optimize QoS to improve responsiveness of reliable endpoints #26

Optimize QoS to improve responsiveness of reliable endpoints #26

Conversation

asorbini commented Apr 10, 2021

asorbini commented Apr 10, 2021

ivanpauno left a comment

Choose a reason for hiding this comment

asorbini commented Apr 13, 2021

asorbini commented Apr 13, 2021

asorbini commented Apr 15, 2021

hidmic commented Jun 15, 2021

clalancette commented Jul 31, 2023

clalancette commented Jul 31, 2023

clalancette commented Aug 1, 2023