-
-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proxy load balancing retry issues #4245
Comments
Great points, thanks for bringing them up for discussion.
Hmm, yeah that is tricky because the idea is that you don't want the request to block for more than As for possible improvements:
❌ Let's try to avoid this, as the first backend that is tried still counts as a "try".
❌ I'd also like to avoid this, as it increases complexity; I'm also not sure this is ever what people actually want -- no one cares how many times backends are tried, they just want to establish the connection as quick as possible with some cap on latency (lb_try_duration).
✔️ This is a good plan -- I'm actually surprised that other proxies' docs recommend > 3s, I would think that internal backends should be able to connect within even 1s -- 3 seems realllly slow, let alone 10. Let's definitely lower it. I'm not sure to what. 5s sounds good, but I'd probably be OK with going lower too. Who wants to wait even 2s just for a connection to be established to an internal backend?? (If only we had telemetry to know the percentile of backend connections that take longer than 2-3s. 🙄 )
Health checks absolutely allow for this very easily, you can expect a certain (class of) status code, as well as a body or timeout: https://caddyserver.com/docs/modules/http.handlers.reverse_proxy#health_checks/active As for these things:
✔️ Would definitely accept a contribution here from someone.
✔️ Similarly, this would be a good fix I'd welcome from anyone who submits a PR. (I'm a little swamped right now with the website projects.) |
I don't really agree, I'd expect it to be possible to easily configure Caddy to attempt at least one try on error if the error was in connecting or completing a handshake with the first upstream. I don't think time is necessarily an absolute here. It's often more important for UX to make sure as many requests as possible actually give a positive result than unexpectedly having an error which may cause disruption for the user. If other upstreams are available, I'd rather it at least gave another a try before serving the error to the client.
Right, but that's only for active health checks, not for retries. I was talking about retries here. Those are different features altogether (but they complement eachother). Active health checks are useful to mark upstreams as unhealthy as soon as possible out of band from client requests, but they don't help with ensuring that retries happen to avoid responding to the client with errors when another upstream may successfully deal with the request. |
It does try once, it just won't retry in that case. If it doesn't retry, then that's due to a misconfiguration regarding the try duration and timeouts. If the timeout is longer than the whole try duration, then it makes sense that it won't retry. I do agree we should adjust the default dial timeout.
Oh, I see. Is there an actual use case that requires determining retry logic by reading the response body? |
Again, I don't agree. Timeouts and try duration are not a good enough guarantee that at least one more retry will be attempted in all potentially recoverable error situations.
Yes. Like I said HAProxy allows for it -- it has different retry modes listed at the link I gave, including looking at the response code. |
Why? |
Because timeouts aren't the only source of possible recoverable errors. See HAProxy's list here: https://cbonte.github.io/haproxy-dconv/2.4/configuration.html#4.2-retry-on If high availability is the goal, a robust and flexible retry mechanism is pretty important. |
Ok. Let's adjust the default timeout, and consider adding new optional retry logic; maybe similar to how active health checks work, as opposed to "number of times" though. Will have to give it more thought when I have a chance (am running out the door now). |
Related to some of the issues in #4245, a complaint about the proxy transport defaults not being properly documented in https://caddy.community/t/default-values-for-directives/14254/6. - Dug into the stdlib to find the actual defaults for some of the timeouts and buffer limits, documenting them in godoc so the JSON docs get them next release. - Moved the keep-alive and dial-timeout defaults from `reverseproxy.go` to `httptransport.go`. It doesn't make sense to set defaults in the proxy, because then any time the transport is configured with non-defaults, the keep-alive and dial-timeout defaults are lost! - Sped up the dial timeout from 10s to 3s, in practice it rarely makes sense to wait a whole 10s for dialing. A shorter timeout helps a lot with the load balancer retries, so using something lower helps with user experience.
* reverseproxy: Adjust defaults, document defaults Related to some of the issues in #4245, a complaint about the proxy transport defaults not being properly documented in https://caddy.community/t/default-values-for-directives/14254/6. - Dug into the stdlib to find the actual defaults for some of the timeouts and buffer limits, documenting them in godoc so the JSON docs get them next release. - Moved the keep-alive and dial-timeout defaults from `reverseproxy.go` to `httptransport.go`. It doesn't make sense to set defaults in the proxy, because then any time the transport is configured with non-defaults, the keep-alive and dial-timeout defaults are lost! - Sped up the dial timeout from 10s to 3s, in practice it rarely makes sense to wait a whole 10s for dialing. A shorter timeout helps a lot with the load balancer retries, so using something lower helps with user experience. * reverseproxy: Make keepalive interval configurable via Caddyfile * fastcgi: DialTimeout default for fastcgi transport too
@francislavoie Was work ever started regarding retries for a given response code? I'm receiving periodic 502s from Google Cloud Storage, and their docs indicate that one should retry with exponential backoff |
So to recap here after #4756 is merged:
This is great, but there's still some areas where this is lacking:
I've given up on the idea of |
I think the remaining issue (retrying POST,etc. requests such that request payload is re-used even if partially read) is already solved in 613d544 via #6259. The user will, of course, need to configure to enable retries and the conditions on which to retry ( |
There's a few different points to talk about here.
Firstly -- I've been playing around a bit with
lb_try_duration
and I think this doesn't necessarily always have the desired effect because of when the "duration" starts. The timer forlb_try_duration
starts right when thereverse_proxy
module gets the request.This sounds fine in isolation, but if you consider that errors that might trigger retries might take a while to actually happen (e.g. dial timeout after Caddy's default of 10s -- more on that later) then it becomes hard to configure Caddy to only retry once or twice after a failure. It's probably undesirable to set an
lb_try_duration
of 15 seconds just to make sure the 10 second timeouts are caught and trigger retries.For example, try a config like this:
Assuming
10.0.0.255:80
does not exist, this should trigger a dial timeout (default 10s). Trying it withcurl localhost:7000
, you see it hang for 10 seconds, then in the logs we getERROR dial tcp 10.0.0.255:80: i/o timeout
, and no fallback to:7001
happens. If we increaselb_try_duration
to15s
instead, then we get a hang for 10 seconds then a successful fallback to:7001
does happen, and no error appear in the logs (but a DEBUG level log does appear for the initial error).There's a few different ways I can think of that this could be made better:
Make the try duration timer start from the time the first attempt fails, instead of the beginning of the request.
This would mean that the amount of times a retry would happen would be easier to predict. This has some implications though, because not all errors take the same amount of time. If some errors happen fast, you don't want to cause a stampede of stacked retries when you have a busy server, so it would be better to be able to cap the amount of retries (because potentially, a 15s try duration with a 250ms interval has a potential of 15s/0.25s = 60 retries)
Implement a
lb_try_min_times <amount>
option which can complement or replacelb_try_duration
; a minimum number of times to try to pick an upstream (default 1). If specified alongsidelb_try_duration
, then it should try again even if the duration has elapsed.This would make it possible to make slow errors at least attempt a retry, but still allow capping fast errors with the duration timer. This could still end up leading to long waits for the clients though if each retry take a long time (e.g. 10s for dial timeout, say with
lb_try_min_times 3
would mean a 30s wait if all upstreams are down, at worst). So obviously in these situations, the dial timeout is too long, so...Reduce the default
http
transport's dial timeout from 10s to maybe 5s or even 2s.I'm sure this might have some implications for some potentially high latency scenarios, but you'd hope not. FWIW, Envoy uses a 5s default, Traefik uses a 30s default, HAProxy has no default but recommends >3s, with most examples in the docs at 5s.
I wouldn't call this a complete solution, but it's "a start".
As a note, Caddy currently retries all
DialError
, or any error for GET requests by default (any error emitted by Caddy or the stdlib), and does not retry any status code errors received from upstream (if the upstream returned a 502 response of its own or whatever). There is theretry_match
mechanism (which is lacking Caddyfile support -- could probably be easily implemented) which allows for matching requests, but doesn't allow for retrying on specific kinds of responses unlesshandle_response
is used which is kinda janky. It would be nice to have support for matching responses that should be retried.HAProxy has some interesting stuff on the topic of retries that we could take inspiration from: https://cbonte.github.io/haproxy-dconv/2.4/configuration.html#4-retries
HAProxy also supports "connect-only" or "connect+tls" active checks -- this would be a nice-to-have for Caddy as well, since they're much lighter weight than actual HTTP health checks. HAProxy defaults to 2s interval for those.
As an aside, issues #1545 and #4174 are related, e.g. Caddy doesn't use more than the first SRV backend when retrying because there's no state about selection kept for the lifetime of a request which would make it possible to try "lower priority" results of a DNS lookup.
While investigating this, I noticed that keepalive defaults for the
http
transport only seem to apply if no transport was configured. This seems like a bug; the keepalives should have defaults regardless. This means that if you configuredial_timeout
to a shorter value like2s
to work around the issues above, then you lose the keepalives altogether. This should be fixed, but didn't really warrant its own issue, so I'm mentioning it here for now.The text was updated successfully, but these errors were encountered: