Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize QoS to improve responsiveness of reliable endpoints #26

Merged
merged 2 commits into from
Aug 1, 2023

Conversation

asorbini
Copy link
Collaborator

This PR introduces some "on by default" QoS optimizations for the RTPS reliability protocol parameters used by reliable DataWriters and DataReaders.

By default, Connext uses a pretty slow "heartbeat period" of 3 seconds. This configuration makes the reliability protocol pretty slow to react to missed samples.

With these changes, rmw_connextdds will reduce the heartbeat period to 100ms, with a minimum of 20ms. This should be in line with the default settings of other DDS vendors.

The optimizations can be disabled with variable RMW_CONNEXT_DISABLE_RELIABILITY_OPTIMIZATIONS, for example if a user would prefer to customize these parameters via XML (which is ultimately always the best solution for a complex system).

@asorbini
Copy link
Collaborator Author

This PR resolves some test failures in interactive_markers which were identified in rmw_connextdds#22.

Copy link
Member

@ivanpauno ivanpauno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!


It would be great if there would be a way to check if the user set these qos settings in the qos profile file, and only apply these defaults if they were not set.
In that way, RMW_CONNEXT_ENV_DISABLE_RELIABILITY_OPTIMIZATIONS wouldn't be needed.

We have been using the environment variable trick for other things though, so applying it here sounds fine.

@asorbini
Copy link
Collaborator Author

It would be great if there would be a way to check if the user set these qos settings in the qos profile file, and only apply these defaults if they were not set.

Unfortunately there is no way to determine where the value of a certain field of QoS returned by get_default_qos() came from (Connext default, user qos, etc.), so the only way to "detect the default" would be to hard-code some checks for the exact "factory defaults", but that would probably be very brittle. It would also still have to be a "whole or nothing" kind of thing (i.e. if any of the fields we want to modify has been customized, then don't customize any of the others).

We have been using the environment variable trick for other things though, so applying it here sounds fine.

Yeah, the list is growing, but at the moment it's the "cleanest" strategy I can think of.

@asorbini
Copy link
Collaborator Author

I've have reviewed these changes with @GerardoPardo and we both agreed that it would be better to align these settings with the built-in profile Optimization.ReliabilityProtocol.Common (available in 6.x, used to be included in Generic.StrictReliable in 5.3.1).

The reason for that is that that profile has been reviewed and vetted, and it has been successfully deployed in customer scenarios as the de-facto new "default settings" for the reliability protocol. The main differences with the current settings are that

  • it uses a HB period of 200ms (instead of 100ms)
  • it fixes the size of the "send windows" to a limited amount (40 samples) and enables piggyback HB (1 every 4).
  • it takes better advantage of the high/low watermarks settings.

The last point brings me to a couple of issues that I have identified in my current proposal:

  • The default setting for high/low watermarks is 1/0. This means that a DataWriter will immediately switch to the fast HB period, as soon as it has at least 1 unacknowledged sample in its queue. Basically, it means the DW will always use the fast period (in my changes: 20ms).
  • I misunderstood the purpose of max_nack_response_delay: I thought it was a time during which a DW will ignore consecutive ACKNACK from a DR. It is instead a delay that the DW will exercise in sending a HB in reply to an ACKNACK, mostly meant to "coalesce" multiple ACKNACKs coming from readers receiving data over multicast, and avoid re-transmitting the same sample multiple times unnecessarily. It is actually best to set this value to 0 for optimal performance, and only consider tuning it if necessary (e.g. bad system performance over multicast).

With all of this being said, I have been re-running TestInteractiveMarkerClient.states with this (further) modified QoS, and I realize the reason why the current changes in the PR make the test pass is because they are effectively imposing a HB period of 20ms (because of the high/low watermark). I have verified that with a HB period of 50ms the test fails about 3% of the time instead, while 100ms is enough to make it fail most of the time.

I have experimented with changing different QoS settings, but only very aggressive reliability settings (which I wouldn't want to set as the global default) get the test to pass.

I still haven't been able to nail down the exact reason why the test is failing in the first place. AFAICT the test has a timeout of 3 seconds, which should be plenty of time for samples to be repaired, even with "slower" HB periods of 100 or 200 ms (certainly not Connext's default of 3s though...). The test also fails on a check which, again AFAICT, only depends on a client::is_service_available() call, which should only be affected by built-in discovery endpoints (since it checks matched publications and subscriptions on the client), and not be affected by changing the configuration of user endpoints.

So, at this point, I'd like to put this PR on hold, until I can identify the exact reason why the test is failing, and an exact configuration that makes it pass consistently.

In the meantime, maybe we can disable the test for rmw_connextdds.

@asorbini asorbini marked this pull request as draft April 14, 2021 00:31
@asorbini asorbini added the test-failure The issue or PR relates to a test failure label Apr 14, 2021
@asorbini
Copy link
Collaborator Author

Turns out the failures in TestInteractiveMarkerClient.states have unrelated to reliability protocol configuration (see #40).

I'm still leaving this PR open for now, because I think it would be good to change the reliability protocol settings regardless, since the default ones are pretty bad.

@asorbini asorbini added humble PR scheduled for the H-turtle and removed test-failure The issue or PR relates to a test failure labels Apr 15, 2021
README.md Outdated Show resolved Hide resolved
@hidmic
Copy link
Contributor

hidmic commented Jun 15, 2021

This needs a rebase and CI to get into Rolling.

@audrow audrow changed the base branch from master to rolling June 28, 2022 14:22
@clalancette clalancette force-pushed the asorbini/reliability-optimizations branch from bac4df4 to 92cd8d7 Compare July 31, 2023 14:02
…ReliabilityProtocol.Common

Signed-off-by: Andrea Sorbini <[email protected]>
@clalancette clalancette force-pushed the asorbini/reliability-optimizations branch from 92cd8d7 to 1d9cad3 Compare July 31, 2023 14:03
@clalancette
Copy link
Contributor

I've found that using this commit improves the speed of a bunch of our tests, which is a priority right now.

So I've gone ahead and rebased this onto the latest. I'm going to run CI here to see how this does:

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Windows Build Status

@clalancette
Copy link
Contributor

CI is yellow, but that may just be a flake. Here is another run to check:

  • Windows Build Status

@clalancette clalancette marked this pull request as ready for review August 1, 2023 12:51
@clalancette
Copy link
Contributor

OK, I'm going to go ahead and merge this one in, since CI looks green and this should improve our test times a bit.

@clalancette clalancette merged commit b57d032 into rolling Aug 1, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
humble PR scheduled for the H-turtle
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants