Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XDS backoff is broken #461

Closed
iffyio opened this issue Jan 7, 2022 · 5 comments · Fixed by #465
Closed

XDS backoff is broken #461

iffyio opened this issue Jan 7, 2022 · 5 comments · Fixed by #465
Labels
kind/bug Something isn't working priority/high Issues that should be addressed as soon as possible.

Comments

@iffyio
Copy link
Collaborator

iffyio commented Jan 7, 2022

Backing off when the xds management server is down no longer works as intended after introducing the tryhard crate. Currently we do retry but with a constant, short 500ms delay which results in spamming the logs and the xds server.

{"msg":"Unable to connect to the XDS server","level":"ERRO","ts":"2022-01-07T16:05:14.220039213+01:00","source":"xds::AdsClient","source":"server::DynamicResourceManager","source":"server::Server","error":"transport error"}
{"msg":"Unable to connect to the XDS server","level":"ERRO","ts":"2022-01-07T16:05:14.723603366+01:00","source":"xds::AdsClient","source":"server::DynamicResourceManager","source":"server::Server","error":"transport error"}
{"msg":"Unable to connect to the XDS server","level":"ERRO","ts":"2022-01-07T16:05:15.226190241+01:00","source":"xds::AdsClient","source":"server::DynamicResourceManager","source":"server::Server","error":"transport error"}

More importantly, looking at the crate's implementation of exponential backoff it looks like it doubles the delay time on every attempt which would be quite wrong and dangerous as we'll end up driving the proxy into an unresponsive state where it waits for an unrealistic amount of time.

I think we need to resolve the latter issue especially before cutting a new release since this would be a regression that can potentially cause an outage. I haven't looked much into the tryhard crate, if there isn't another way and we can't fix this upstream in time, we should swap back to the previous backoff or another crate that has a reasonable backoff implementation.

To repro, run quilkin locally with this config (the server doesn't need to exist)

version: v1alpha1
proxy:
  port: 7001
dynamic:
  management_servers:
  - address: http://localhost:18000

expected behavior is that we should backoff exponentially, with a max delay like in the following (note the time stamp ts field vs the previous snippet)

{"msg":"Unable to connect to the XDS server","level":"ERRO","ts":"2022-01-07T15:59:00.369060583+01:00","source":"xds::AdsClient","source":"server::DynamicResourceManager","source":"server::Server","error":"transport error","address":"http://localhost:18000"}
{"msg":"Unable to connect to the XDS server","level":"ERRO","ts":"2022-01-07T15:59:01.156613480+01:00","source":"xds::AdsClient","source":"server::DynamicResourceManager","source":"server::Server","error":"transport error","address":"http://localhost:18000"}
{"msg":"Unable to connect to the XDS server","level":"ERRO","ts":"2022-01-07T15:59:02.072053317+01:00","source":"xds::AdsClient","source":"server::DynamicResourceManager","source":"server::Server","error":"transport error","address":"http://localhost:18000"}
{"msg":"Unable to connect to the XDS server","level":"ERRO","ts":"2022-01-07T15:59:04.332760360+01:00","source":"xds::AdsClient","source":"server::DynamicResourceManager","source":"server::Server","error":"transport error","address":"http://localhost:18000"}
{"msg":"Unable to connect to the XDS server","level":"ERRO","ts":"2022-01-07T15:59:08.005171575+01:00","source":"xds::AdsClient","source":"server::DynamicResourceManager","source":"server::Server","error":"transport error","address":"http://localhost:18000"}
{"msg":"Unable to connect to the XDS server","level":"ERRO","ts":"2022-01-07T15:59:11.584600083+01:00","source":"xds::AdsClient","source":"server::DynamicResourceManager","source":"server::Server","error":"transport error","address":"http://localhost:18000"}
{"msg":"Unable to connect to the XDS server","level":"ERRO","ts":"2022-01-07T15:59:16.652149371+01:00","source":"xds::AdsClient","source":"server::DynamicResourceManager","source":"server::Server","error":"transport error","address":"http://localhost:18000"}
@iffyio iffyio added kind/bug Something isn't working priority/high Issues that should be addressed as soon as possible. labels Jan 7, 2022
@XAMPPRocky
Copy link
Collaborator

XAMPPRocky commented Jan 7, 2022

Replacing the backoff strategy should be straightforward to fix in the project without upstream changes to tryhard (though we should make those improvements). We're already using a custom backoff strategy that calls the ExponentialBackoff based on what error we got back. So all that should be required is implementing OnRetry and calling that in the function (or writing it entirely in that custom_backoff fn).

let mut backoff = ExponentialBackoff::new(std::time::Duration::from_millis(500));

As for why it's always a flat 500ms I'm less sure.

@markmandel
Copy link
Member

More importantly, looking at the crate's implementation of exponential backoff it looks like it doubles the delay time on every attempt which would be quite wrong and dangerous as we'll end up driving the proxy into an unresponsive state where it waits for an unrealistic amount of time.

Double the delay each time is an exponential backoff of 2, or am I missing something here?

If we're looking at 10 retries, for example, we end up with:

Run Seconds Timestamp
0 0.000 1/8/2022, 05:33:28:972 UTC
1 0.500 1/8/2022, 05:33:29:472 UTC
2 1.500 1/8/2022, 05:33:30:472 UTC
3 3.500 1/8/2022, 05:33:32:472 UTC
4 7.500 1/8/2022, 05:33:36:472 UTC
5 15.500 1/8/2022, 05:33:44:472 UTC
6 31.500 1/8/2022, 05:34:00:472 UTC
7 63.500 1/8/2022, 05:34:32:472 UTC
8 127.500 1/8/2022, 05:35:36:472 UTC
9 255.500 1/8/2022, 05:37:44:472 UTC

(grabbed from https://exponentialbackoffcalculator.com/ because I'm too lazy to calculate it myself 😄 )

But agree we should fix before release 👍🏻

@iffyio
Copy link
Collaborator Author

iffyio commented Jan 10, 2022

Double the delay each time is an exponential backoff of 2, or am I missing something here?

Yeah it is exponential but that's not practical because exponential increases are too large to backoff with - we almost never want to wait minutes to retry an error (thinking around >30s between retries would be a niche use case). In practice implementations of exponential backoff would include some randomization to avoid a thundering herd, as well as an upper bound on the interval (e.g max 20s) to avoid waiting too long to retry since that's bad for MTTR.

@markmandel
Copy link
Member

Very big +1 for sure on wanting some Jitter for sure! I don't immediately see a good way to do that with tryhard without a custom config? (Seems like a good feature to add to the crate -- shall we file an issue?).

If we want a quick fix/hack with the current implementation, we could randomise the initial_delay within a range -- it's not perfect, but at least it's a step in the right direction.

And capping exponential backoff at 30s like you suggested would seem like a reasonable thing to do as well 👍🏻 (max_delay does seem to cover this).

@iffyio
Copy link
Collaborator Author

iffyio commented Jan 11, 2022

Both the jitter and max backoff is something that should be covered by the crate yeah.
Re why its currently always 500ms, seems that was due to a new backoff instance being created each retry_fn so the previous state wasn't being reused. e.g moving that out of the closure works as intended

let mut backoff = ExponentialBackoff::new(std::time::Duration::from_millis(500));

One other thing that's broken is resetting the backoff. If we retry and reconnect we want to reset the backoff state so that next time we hit an issue, we start counting from scratch. We no longer have that so currently if we count up to say 30s previously, next time we hit an error we'll wait 30s before the first retry which would be problematic

markmandel added a commit to markmandel/quilkin that referenced this issue Jan 11, 2022
Solves the bug where on failure to connect, the retry operation would
occur every 500ms.

* Moved the ExponentialBackoff outside the backoff loop, so it didn't
 get recreated on each retry.
* Reset the backoff back to initial state on the first retry.
* Max the delay at 30s.
* Add jitter of 0-2s to each delay.

Closes googleforgames#461

Sample log output (slog, edited for clarity), with delay written inline:

{"msg":"Unable to connect to the XDS server","level":"ERRO","ts":"2022-01-11T12:33:25.094550425-08:00","delay":"1.782s","error":"transport error"}
{"msg":"Unable to connect to the XDS server","level":"ERRO","ts":"2022-01-11T12:33:26.880956106-08:00","delay":"2.903s","error":"transport error"}
{"msg":"Unable to connect to the XDS server","level":"ERRO","ts":"2022-01-11T12:33:29.786490322-08:00","delay":"3.172s","error":"transport error"}
{"msg":"Unable to connect to the XDS server","level":"ERRO","ts":"2022-01-11T12:33:32.962315902-08:00","delay":"4.333s","error":"transport error"}
{"msg":"Unable to connect to the XDS server","level":"ERRO","ts":"2022-01-11T12:33:37.299096949-08:00","delay":"8.757s","error":"transport error"}
{"msg":"Unable to connect to the XDS server","level":"ERRO","ts":"2022-01-11T12:33:46.060135438-08:00","delay":"17.473s","error":"transport error"}
{"msg":"Unable to connect to the XDS server","level":"ERRO","ts":"2022-01-11T12:34:03.535624597-08:00","delay":"30.706s","error":"transport error"}
{"msg":"Unable to connect to the XDS server","level":"ERRO","ts":"2022-01-11T12:34:34.243188450-08:00","delay":"30.808s","error":"transport error"}
{"msg":"Unable to connect to the XDS server","level":"ERRO","ts":"2022-01-11T12:35:05.055083825-08:00","delay":"30.966s","error":"transport error"}
{"msg":"Unable to connect to the XDS server","level":"ERRO","ts":"2022-01-11T12:35:36.025452758-08:00","delay":"31.897s","error":"transport error"}
{"msg":"Unable to connect to the XDS server","level":"ERRO","ts":"2022-01-11T12:36:07.925738599-08:00","delay":"30.413s","error":"transport error"}
{"msg":"Unable to connect to the XDS server","level":"ERRO","ts":"2022-01-11T12:36:38.343017008-08:00","delay":"30.799s","error":"transport error"}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working priority/high Issues that should be addressed as soon as possible.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants