XDS backoff is broken #461

iffyio · 2022-01-07T15:31:35Z

Backing off when the xds management server is down no longer works as intended after introducing the tryhard crate. Currently we do retry but with a constant, short 500ms delay which results in spamming the logs and the xds server.

{"msg":"Unable to connect to the XDS server","level":"ERRO","ts":"2022-01-07T16:05:14.220039213+01:00","source":"xds::AdsClient","source":"server::DynamicResourceManager","source":"server::Server","error":"transport error"}
{"msg":"Unable to connect to the XDS server","level":"ERRO","ts":"2022-01-07T16:05:14.723603366+01:00","source":"xds::AdsClient","source":"server::DynamicResourceManager","source":"server::Server","error":"transport error"}
{"msg":"Unable to connect to the XDS server","level":"ERRO","ts":"2022-01-07T16:05:15.226190241+01:00","source":"xds::AdsClient","source":"server::DynamicResourceManager","source":"server::Server","error":"transport error"}

More importantly, looking at the crate's implementation of exponential backoff it looks like it doubles the delay time on every attempt which would be quite wrong and dangerous as we'll end up driving the proxy into an unresponsive state where it waits for an unrealistic amount of time.

I think we need to resolve the latter issue especially before cutting a new release since this would be a regression that can potentially cause an outage. I haven't looked much into the tryhard crate, if there isn't another way and we can't fix this upstream in time, we should swap back to the previous backoff or another crate that has a reasonable backoff implementation.

To repro, run quilkin locally with this config (the server doesn't need to exist)

version: v1alpha1
proxy:
  port: 7001
dynamic:
  management_servers:
  - address: http://localhost:18000

expected behavior is that we should backoff exponentially, with a max delay like in the following (note the time stamp ts field vs the previous snippet)

{"msg":"Unable to connect to the XDS server","level":"ERRO","ts":"2022-01-07T15:59:00.369060583+01:00","source":"xds::AdsClient","source":"server::DynamicResourceManager","source":"server::Server","error":"transport error","address":"http://localhost:18000"}
{"msg":"Unable to connect to the XDS server","level":"ERRO","ts":"2022-01-07T15:59:01.156613480+01:00","source":"xds::AdsClient","source":"server::DynamicResourceManager","source":"server::Server","error":"transport error","address":"http://localhost:18000"}
{"msg":"Unable to connect to the XDS server","level":"ERRO","ts":"2022-01-07T15:59:02.072053317+01:00","source":"xds::AdsClient","source":"server::DynamicResourceManager","source":"server::Server","error":"transport error","address":"http://localhost:18000"}
{"msg":"Unable to connect to the XDS server","level":"ERRO","ts":"2022-01-07T15:59:04.332760360+01:00","source":"xds::AdsClient","source":"server::DynamicResourceManager","source":"server::Server","error":"transport error","address":"http://localhost:18000"}
{"msg":"Unable to connect to the XDS server","level":"ERRO","ts":"2022-01-07T15:59:08.005171575+01:00","source":"xds::AdsClient","source":"server::DynamicResourceManager","source":"server::Server","error":"transport error","address":"http://localhost:18000"}
{"msg":"Unable to connect to the XDS server","level":"ERRO","ts":"2022-01-07T15:59:11.584600083+01:00","source":"xds::AdsClient","source":"server::DynamicResourceManager","source":"server::Server","error":"transport error","address":"http://localhost:18000"}
{"msg":"Unable to connect to the XDS server","level":"ERRO","ts":"2022-01-07T15:59:16.652149371+01:00","source":"xds::AdsClient","source":"server::DynamicResourceManager","source":"server::Server","error":"transport error","address":"http://localhost:18000"}

The text was updated successfully, but these errors were encountered:

XAMPPRocky · 2022-01-07T21:51:13Z

Replacing the backoff strategy should be straightforward to fix in the project without upstream changes to tryhard (though we should make those improvements). We're already using a custom backoff strategy that calls the ExponentialBackoff based on what error we got back. So all that should be required is implementing OnRetry and calling that in the function (or writing it entirely in that custom_backoff fn).

quilkin/src/xds/ads_client.rs

Line 100 in 1653ec4

    
           let mut backoff = ExponentialBackoff::new(std::time::Duration::from_millis(500));

As for why it's always a flat 500ms I'm less sure.

markmandel · 2022-01-08T05:36:40Z

More importantly, looking at the crate's implementation of exponential backoff it looks like it doubles the delay time on every attempt which would be quite wrong and dangerous as we'll end up driving the proxy into an unresponsive state where it waits for an unrealistic amount of time.

Double the delay each time is an exponential backoff of 2, or am I missing something here?

If we're looking at 10 retries, for example, we end up with:

Run	Seconds	Timestamp
0	0.000	1/8/2022, 05:33:28:972 UTC
1	0.500	1/8/2022, 05:33:29:472 UTC
2	1.500	1/8/2022, 05:33:30:472 UTC
3	3.500	1/8/2022, 05:33:32:472 UTC
4	7.500	1/8/2022, 05:33:36:472 UTC
5	15.500	1/8/2022, 05:33:44:472 UTC
6	31.500	1/8/2022, 05:34:00:472 UTC
7	63.500	1/8/2022, 05:34:32:472 UTC
8	127.500	1/8/2022, 05:35:36:472 UTC
9	255.500	1/8/2022, 05:37:44:472 UTC

(grabbed from https://exponentialbackoffcalculator.com/ because I'm too lazy to calculate it myself 😄 )

But agree we should fix before release 👍🏻

iffyio · 2022-01-10T11:13:00Z

Double the delay each time is an exponential backoff of 2, or am I missing something here?

Yeah it is exponential but that's not practical because exponential increases are too large to backoff with - we almost never want to wait minutes to retry an error (thinking around >30s between retries would be a niche use case). In practice implementations of exponential backoff would include some randomization to avoid a thundering herd, as well as an upper bound on the interval (e.g max 20s) to avoid waiting too long to retry since that's bad for MTTR.

markmandel · 2022-01-10T18:43:24Z

Very big +1 for sure on wanting some Jitter for sure! I don't immediately see a good way to do that with tryhard without a custom config? (Seems like a good feature to add to the crate -- shall we file an issue?).

If we want a quick fix/hack with the current implementation, we could randomise the initial_delay within a range -- it's not perfect, but at least it's a step in the right direction.

And capping exponential backoff at 30s like you suggested would seem like a reasonable thing to do as well 👍🏻 (max_delay does seem to cover this).

iffyio · 2022-01-11T15:05:23Z

Both the jitter and max backoff is something that should be covered by the crate yeah.
Re why its currently always 500ms, seems that was due to a new backoff instance being created each retry_fn so the previous state wasn't being reused. e.g moving that out of the closure works as intended

quilkin/src/xds/ads_client.rs

Line 100 in 1653ec4

    
           let mut backoff = ExponentialBackoff::new(std::time::Duration::from_millis(500));

One other thing that's broken is resetting the backoff. If we retry and reconnect we want to reset the backoff state so that next time we hit an issue, we start counting from scratch. We no longer have that so currently if we count up to say 30s previously, next time we hit an error we'll wait 30s before the first retry which would be problematic

Solves the bug where on failure to connect, the retry operation would occur every 500ms. * Moved the ExponentialBackoff outside the backoff loop, so it didn't get recreated on each retry. * Reset the backoff back to initial state on the first retry. * Max the delay at 30s. * Add jitter of 0-2s to each delay. Closes googleforgames#461 Sample log output (slog, edited for clarity), with delay written inline: {"msg":"Unable to connect to the XDS server","level":"ERRO","ts":"2022-01-11T12:33:25.094550425-08:00","delay":"1.782s","error":"transport error"} {"msg":"Unable to connect to the XDS server","level":"ERRO","ts":"2022-01-11T12:33:26.880956106-08:00","delay":"2.903s","error":"transport error"} {"msg":"Unable to connect to the XDS server","level":"ERRO","ts":"2022-01-11T12:33:29.786490322-08:00","delay":"3.172s","error":"transport error"} {"msg":"Unable to connect to the XDS server","level":"ERRO","ts":"2022-01-11T12:33:32.962315902-08:00","delay":"4.333s","error":"transport error"} {"msg":"Unable to connect to the XDS server","level":"ERRO","ts":"2022-01-11T12:33:37.299096949-08:00","delay":"8.757s","error":"transport error"} {"msg":"Unable to connect to the XDS server","level":"ERRO","ts":"2022-01-11T12:33:46.060135438-08:00","delay":"17.473s","error":"transport error"} {"msg":"Unable to connect to the XDS server","level":"ERRO","ts":"2022-01-11T12:34:03.535624597-08:00","delay":"30.706s","error":"transport error"} {"msg":"Unable to connect to the XDS server","level":"ERRO","ts":"2022-01-11T12:34:34.243188450-08:00","delay":"30.808s","error":"transport error"} {"msg":"Unable to connect to the XDS server","level":"ERRO","ts":"2022-01-11T12:35:05.055083825-08:00","delay":"30.966s","error":"transport error"} {"msg":"Unable to connect to the XDS server","level":"ERRO","ts":"2022-01-11T12:35:36.025452758-08:00","delay":"31.897s","error":"transport error"} {"msg":"Unable to connect to the XDS server","level":"ERRO","ts":"2022-01-11T12:36:07.925738599-08:00","delay":"30.413s","error":"transport error"} {"msg":"Unable to connect to the XDS server","level":"ERRO","ts":"2022-01-11T12:36:38.343017008-08:00","delay":"30.799s","error":"transport error"}

iffyio added kind/bug Something isn't working priority/high Issues that should be addressed as soon as possible. labels Jan 7, 2022

markmandel mentioned this issue Jan 11, 2022

Fix exponential backoff on xds client retry #465

Merged

XAMPPRocky closed this as completed in #465 Jan 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XDS backoff is broken #461

XDS backoff is broken #461

iffyio commented Jan 7, 2022

XAMPPRocky commented Jan 7, 2022 •

edited

Loading

markmandel commented Jan 8, 2022

iffyio commented Jan 10, 2022

markmandel commented Jan 10, 2022

iffyio commented Jan 11, 2022

XDS backoff is broken #461

XDS backoff is broken #461

Comments

iffyio commented Jan 7, 2022

XAMPPRocky commented Jan 7, 2022 • edited Loading

markmandel commented Jan 8, 2022

iffyio commented Jan 10, 2022

markmandel commented Jan 10, 2022

iffyio commented Jan 11, 2022

XAMPPRocky commented Jan 7, 2022 •

edited

Loading