-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
XDS backoff is broken #461
Comments
Replacing the backoff strategy should be straightforward to fix in the project without upstream changes to tryhard (though we should make those improvements). We're already using a custom backoff strategy that calls the Line 100 in 1653ec4
As for why it's always a flat 500ms I'm less sure. |
Double the delay each time is an exponential backoff of 2, or am I missing something here? If we're looking at 10 retries, for example, we end up with:
(grabbed from https://exponentialbackoffcalculator.com/ because I'm too lazy to calculate it myself 😄 ) But agree we should fix before release 👍🏻 |
Yeah it is exponential but that's not practical because exponential increases are too large to backoff with - we almost never want to wait minutes to retry an error (thinking around >30s between retries would be a niche use case). In practice implementations of exponential backoff would include some randomization to avoid a thundering herd, as well as an upper bound on the interval (e.g max 20s) to avoid waiting too long to retry since that's bad for MTTR. |
Very big +1 for sure on wanting some Jitter for sure! I don't immediately see a good way to do that with If we want a quick fix/hack with the current implementation, we could randomise the And capping exponential backoff at 30s like you suggested would seem like a reasonable thing to do as well 👍🏻 ( |
Both the jitter and max backoff is something that should be covered by the crate yeah. Line 100 in 1653ec4
One other thing that's broken is resetting the backoff. If we retry and reconnect we want to reset the backoff state so that next time we hit an issue, we start counting from scratch. We no longer have that so currently if we count up to say 30s previously, next time we hit an error we'll wait 30s before the first retry which would be problematic |
Solves the bug where on failure to connect, the retry operation would occur every 500ms. * Moved the ExponentialBackoff outside the backoff loop, so it didn't get recreated on each retry. * Reset the backoff back to initial state on the first retry. * Max the delay at 30s. * Add jitter of 0-2s to each delay. Closes googleforgames#461 Sample log output (slog, edited for clarity), with delay written inline: {"msg":"Unable to connect to the XDS server","level":"ERRO","ts":"2022-01-11T12:33:25.094550425-08:00","delay":"1.782s","error":"transport error"} {"msg":"Unable to connect to the XDS server","level":"ERRO","ts":"2022-01-11T12:33:26.880956106-08:00","delay":"2.903s","error":"transport error"} {"msg":"Unable to connect to the XDS server","level":"ERRO","ts":"2022-01-11T12:33:29.786490322-08:00","delay":"3.172s","error":"transport error"} {"msg":"Unable to connect to the XDS server","level":"ERRO","ts":"2022-01-11T12:33:32.962315902-08:00","delay":"4.333s","error":"transport error"} {"msg":"Unable to connect to the XDS server","level":"ERRO","ts":"2022-01-11T12:33:37.299096949-08:00","delay":"8.757s","error":"transport error"} {"msg":"Unable to connect to the XDS server","level":"ERRO","ts":"2022-01-11T12:33:46.060135438-08:00","delay":"17.473s","error":"transport error"} {"msg":"Unable to connect to the XDS server","level":"ERRO","ts":"2022-01-11T12:34:03.535624597-08:00","delay":"30.706s","error":"transport error"} {"msg":"Unable to connect to the XDS server","level":"ERRO","ts":"2022-01-11T12:34:34.243188450-08:00","delay":"30.808s","error":"transport error"} {"msg":"Unable to connect to the XDS server","level":"ERRO","ts":"2022-01-11T12:35:05.055083825-08:00","delay":"30.966s","error":"transport error"} {"msg":"Unable to connect to the XDS server","level":"ERRO","ts":"2022-01-11T12:35:36.025452758-08:00","delay":"31.897s","error":"transport error"} {"msg":"Unable to connect to the XDS server","level":"ERRO","ts":"2022-01-11T12:36:07.925738599-08:00","delay":"30.413s","error":"transport error"} {"msg":"Unable to connect to the XDS server","level":"ERRO","ts":"2022-01-11T12:36:38.343017008-08:00","delay":"30.799s","error":"transport error"}
Backing off when the xds management server is down no longer works as intended after introducing the tryhard crate. Currently we do retry but with a constant, short 500ms delay which results in spamming the logs and the xds server.
More importantly, looking at the crate's implementation of exponential backoff it looks like it doubles the delay time on every attempt which would be quite wrong and dangerous as we'll end up driving the proxy into an unresponsive state where it waits for an unrealistic amount of time.
I think we need to resolve the latter issue especially before cutting a new release since this would be a regression that can potentially cause an outage. I haven't looked much into the tryhard crate, if there isn't another way and we can't fix this upstream in time, we should swap back to the previous backoff or another crate that has a reasonable backoff implementation.
To repro, run quilkin locally with this config (the server doesn't need to exist)
expected behavior is that we should backoff exponentially, with a max delay like in the following (note the time stamp
ts
field vs the previous snippet)The text was updated successfully, but these errors were encountered: