Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lb_retries apparently not working #6259

Closed
gottlike opened this issue Apr 21, 2024 · 32 comments
Closed

lb_retries apparently not working #6259

gottlike opened this issue Apr 21, 2024 · 32 comments
Labels
bug 🐞 Something isn't working
Milestone

Comments

@gottlike
Copy link

gottlike commented Apr 21, 2024

I'd like to be able to recreate/restart Docker services that are hooked into Caddy via reverse_proxy without downtime. Usually a (POST) request takes about 10 seconds to complete. My config is super simple:

example.com {
        log {
                format console
                output stdout
        }

        root * /srv

        file_server {
                precompressed br gzip
        }

        reverse_proxy /-/proxy1 http://docker_container1:9292 {
                lb_retries 3
                lb_try_interval 1s
        }
        reverse_proxy /-/proxy2 http://docker_container2:9292 {
                lb_retries 3
                lb_try_interval 1s
        }
}

When running a bunch of requests and restarting the container before the responses were returned, I always get 502 errors, despite the lb_retries directive.

The error messages look like this:

2024/04/21 12:35:27.404 ERROR   http.log.access.log0    handled request {"request": {"remote_ip": "62.153.218.40", "remote_port": "44362", "client_ip": "62.153.218.40", "proto": "HTTP/1.1", "method": "POST", "host": "example.com", "uri": "/-/proxy1", "headers": {"Accept-Encoding": ["gzip, compress, deflate, br"], "Connection": ["keep-alive"], "Accept": ["application/json, text/plain, */*"], "Content-Type": ["application/json"], "User-Agent": ["axios/1.6.7"], "Content-Length": ["67993"]}, "tls": {"resumed": true, "version": 772, "cipher_suite": 4865, "proto": "", "server_name": "example.com"}}, "bytes_read": 0, "user_id": "", "duration": 1.014601761, "size": 0, "status": 502, "resp_headers": {"Server": ["Caddy"], "Alt-Svc": ["h3=\":443\"; ma=2592000"]}}

{"level":"error","ts":1713702927.4045582,"logger":"http.log.error.log0","msg":"readfrom tcp 172.18.0.2:34872->172.18.0.3:9292: http: invalid Read on closed Body","request":{"remote_ip":"62.153.218.40","remote_port":"44362","client_ip":"62.153.218.40","proto":"HTTP/1.1","method":"POST","host":"example.com","uri":"/-/proxy1","headers":{"Accept-Encoding":["gzip, compress, deflate, br"],"Connection":["keep-alive"],"Accept":["application/json, text/plain, */*"],"Content-Type":["application/json"],"User-Agent":["axios/1.6.7"],"Content-Length":["67993"]},"tls":{"resumed":true,"version":772,"cipher_suite":4865,"proto":"","server_name":"example.com"}},"duration":1.014601761,"status":502,"err_id":"cr390dupf","err_trace":"reverseproxy.statusError (reverseproxy.go:1267)"}
@gottlike
Copy link
Author

gottlike commented Apr 21, 2024

Never mind, it seems this works as expected (or doesn't, rather): https://caddy.community/t/retries-for-post-requests-with-body/23478

@gottlike
Copy link
Author

Okay, I would really like to see this. How much would it cost to sponsor development of this feature, @francislavoie ?

@gottlike gottlike reopened this Apr 21, 2024
@francislavoie
Copy link
Member

See https://caddyserver.com/sponsor, you can reach out that way to get a quote.

@mholt
Copy link
Member

mholt commented Apr 22, 2024

I got the email. Thank you for reaching out!

While investigating this, I realized this should already work:

:1234 {
	reverse_proxy 127.0.0.1:1235 {
		# same as the OP's original config
		lb_retries 3
		lb_try_interval 1s

		# only retry request if the app will treat it as idempotent
		# (GETs are considered idempotent by default, POSTs aren't)
		lb_retry_match header Idempotency-Key *

		# enable buffering sufficient to fit the request body
		request_buffers 128K

		# optional, just to see what's going on
		verbose_logs
	}
}

I start the proxy, and then I fire off a request:

$ curl -v "http://localhost:1234" -H 'Idempotency-Key: asdf' -d 'foobar'

and then quickly, before the load balancer gives up, I bring up the simulated backend:

$ caddy respond --listen 127.0.0.1:1235 'Request body: {http.request.body}'

But I was getting readFrom errors, because the request body was being closed... but it was definitely being fully read into the buffer! So closing the request body shouldn't matter.

Anyway, all I had to do was patch half a line of code: 613d544

Seems to work now.

@mholt mholt closed this as completed Apr 22, 2024
@mholt mholt added the bug 🐞 Something isn't working label Apr 22, 2024
@mholt mholt added this to the v2.8.0 milestone Apr 22, 2024
@mholt
Copy link
Member

mholt commented Apr 22, 2024

Actually, just following up with d93e027 which also preserves the body in case one of the backends partially (or totally) depletes it before having the connection severed.

@gottlike
Copy link
Author

Awesome, thank you!

@gottlike
Copy link
Author

@mholt I noticed two minor issues while playing around with this on 2.7.6 with GET requests:

  • After successfully retrying, Caddy responds with two "Server: Caddy" headers
  • Using the lb_retry_match directive doesn't seem to do anything with regard to GET requests? No matter what, the retries are always attempted.

@gottlike
Copy link
Author

From the official documentation of lb_retry_match:

If the connection to the upstream failed, a retry is always allowed.

Doesn't this kind of defeat its purpose, at least for my use case?

@francislavoie
Copy link
Member

francislavoie commented Apr 23, 2024

After successfully retrying, Caddy responds with two "Server: Caddy" headers

Are you sure this is because of the retry? Are you not just proxying to another Caddy server?

Using the lb_retry_match directive doesn't seem to do anything with regard to GET requests? No matter what, the retries are always attempted.

What do you mean? What did you try to set it to? What's your goal?

Doesn't this kind of defeat its purpose, at least for my use case?

In what way? Connection failures are always retryable, because obviously the request wasn't sent anywhere yet.

@gottlike
Copy link
Author

gottlike commented Apr 23, 2024

Okay, here's a reproduction of the issue:

Caddyfile:

:1234 {
	reverse_proxy 127.0.0.1:1235 {
		lb_retries 5
		lb_try_interval 1s

		lb_retry_match header X-Idempotent-Retry true

		request_buffers 256K

		verbose_logs
	}
}

I run this with caddy run

Then I run curl -v "http://localhost:1234" in another tab.

And 1 second afterwards caddy respond --listen 127.0.0.1:1235 'Hi there!' in another tab.

I would expect to receive a 502 error with curl, even though the proxy went up after a second. That's because the request didn't have "X-Idempotent-Retry: true" as a header.

If I would run curl -v "http://localhost:1234" -H 'X-Idempotent-Retry: true' I would expect to get a 200 with "Hi there!" after 2 seconds.

Aaand you are right about the dual "Server: Caddy" response. That's because I use Caddy as a reverse proxy, obviously - sorry.

@gottlike
Copy link
Author

gottlike commented Apr 23, 2024

My use case is very similar to this simple reproduction:

I have a bunch of Docker containers as proxies and I'd like to have a zero-downtime restart/deployment for requests that are idempotent. But only for those, so that we don't have a retry on a POST that was already partly processed (which could happen now, with this fix/addition).

@francislavoie
Copy link
Member

francislavoie commented Apr 23, 2024

And 1 second afterwards caddy respond --listen 127.0.0.1:1235 'Hi there!' in another tab.

Aaand you are right about the dual "Server: Caddy" response. That's because I use Caddy as a reverse proxy, obviously - sorry.

Yeah, exactly, you're running Caddy behind Caddy, so of course it'll have two copies of the Server: Caddy header. That's working as intended. Each server adds their own Server header.

I would expect to receive a 502 error with curl, even though the proxy went up after a second. That's because the request didn't have "X-Idempotent-Retry: true" as a header.

So you want to not retry even connection problems? But why? I can't think of any reason why that would be helpful.

The logic for retries is here:

func (lb LoadBalancing) tryAgain(ctx caddy.Context, start time.Time, retries int, proxyErr error, req *http.Request) bool {
// no retries are configured
if lb.TryDuration == 0 && lb.Retries == 0 {
return false
}
// if we've tried long enough, break
if lb.TryDuration > 0 && time.Since(start) >= time.Duration(lb.TryDuration) {
return false
}
// if we've reached the retry limit, break
if lb.Retries > 0 && retries >= lb.Retries {
return false
}
// if the error occurred while dialing (i.e. a connection
// could not even be established to the upstream), then it
// should be safe to retry, since without a connection, no
// HTTP request can be transmitted; but if the error is not
// specifically a dialer error, we need to be careful
if proxyErr != nil {
_, isDialError := proxyErr.(DialError)
herr, isHandlerError := proxyErr.(caddyhttp.HandlerError)
// if the error occurred after a connection was established,
// we have to assume the upstream received the request, and
// retries need to be carefully decided, because some requests
// are not idempotent
if !isDialError && !(isHandlerError && errors.Is(herr, errNoUpstream)) {
if lb.RetryMatch == nil && req.Method != "GET" {
// by default, don't retry requests if they aren't GET
return false
}
if !lb.RetryMatch.AnyMatch(req) {
return false
}
}
}
// fast path; if the interval is zero, we don't need to wait
if lb.TryInterval == 0 {
return true
}
// otherwise, wait and try the next available host
timer := time.NewTimer(time.Duration(lb.TryInterval))
select {
case <-timer.C:
return true
case <-ctx.Done():
if !timer.Stop() {
// if the timer has been stopped then read from the channel
<-timer.C
}
return false
}
}

@gottlike
Copy link
Author

gottlike commented Apr 23, 2024

So you want to not retry even connection problems? But why? I can't think of any reason why that would be helpful.

Okay, so the "always retry, no matter what lb_retry_match says"-behaviour only happens if Caddy was never able to initially connect to the upstream? That would be safe then, I guess. I figured it would also do that if it did connect initially and the upstream connection broke while processing the request. If that's not the case and the lb_retry_match is taken into account in those cases, that's fine.

@francislavoie
Copy link
Member

Yeah, connection failures (before the HTTP request is written upstream) is always allowed to retry, because we know it's always safe. The lb_retry_match is for enabling retries other than just GET if it fails after connecting.

@mholt
Copy link
Member

mholt commented Apr 23, 2024

Thanks for the discussion here -- let me know if there's anything else needed!

@gottlike
Copy link
Author

Actually, I do have another suggestion 😆: When the request_buffers size is not big enough for requests, Caddy responds with a 502 error. In my eyes that's not quite accurate and a 507 error would be more appropriate. But yeah.. not exactly high prio and open for debate.

@mohammed90
Copy link
Member

Actually, I do have another suggestion 😆: When the request_buffers size is not big enough for requests, Caddy responds with a 502 error. In my eyes that's not quite accurate and a 507 error would be more appropriate. But yeah.. not exactly high prio and open for debate.

The MDN says this status is limited to WebDav usage only. It can also be argued that this status from a reverse-proxy exposes more information than necessary to clients.

@gottlike
Copy link
Author

gottlike commented Apr 24, 2024

Yeah, actually there's no perfect fit for this. But a 502 is wrong, too. That had me scratching my head for a while, since my backend never received the request and didn't have an error. Only found out by trial and error that it's because of the request buffer size.

@mohammed90
Copy link
Member

mohammed90 commented Apr 24, 2024

Perhaps 413 Content Too Large?

@gottlike
Copy link
Author

That's probably best. It's still not super obvious for the server admin, but much better than the 502 👍

@mholt
Copy link
Member

mholt commented Apr 24, 2024

@gottlike

Actually, I do have another suggestion 😆: When the request_buffers size is not big enough for requests, Caddy responds with a 502 error.

Hmm, I'm not sure if that's expected. 🤔 That's not what happens for me. For me, it still reads the whole body and passes it upstream, it just only buffers up to the buffer size first.

@gottlike
Copy link
Author

gottlike commented Apr 24, 2024

Okay, interesting! Here's how you can easily reproduce it (on MacOS at least):

:1234 {
        reverse_proxy 127.0.0.1:1235 {
                request_buffers 256K
                verbose_logs
        }
}

Create a big file with dummy data: echo '{"data":"'$(openssl rand -base64 $((512*1024)) | tr -d '\n')'"}' > file.json

Start the upstream: caddy respond --listen 127.0.0.1:1235 'Heyho!'

Then: curl -v -X POST -H "Content-Type: application/json" -d @file.json http://localhost:1234

With Caddy 2.7.6 I'm getting:

*   Trying [::1]:1234...
* Connected to localhost (::1) port 1234
> POST / HTTP/1.1
> Host: localhost:1234
> User-Agent: curl/8.4.0
> Accept: */*
> Content-Type: application/json
> Content-Length: 699063
> 
* We are completely uploaded and fine
< HTTP/1.1 502 Bad Gateway
< Server: Caddy
< Date: Wed, 24 Apr 2024 18:14:51 GMT
< Content-Length: 0
< 
* Connection #0 to host localhost left intact
2024/04/24 18:14:43.388 ERROR   http.log.error  net/http: HTTP/1.x transport connection broken: http: ContentLength=256000 with Body length 699063      {"request": {"remote_ip": "::1", "remote_port": "58598", "client_ip": "::1", "proto": "HTTP/1.1", "method": "POST", "host": "localhost:1234", "uri": "/", "headers": {"Accept": ["*/*"], "Content-Type": ["application/json"], "Content-Length": ["699063"], "User-Agent": ["curl/8.4.0"]}}, "duration": 0.001649833, "status": 502, "err_id": "ekrqejxrx", "err_trace": "reverseproxy.statusError (reverseproxy.go:1267)"}

@mholt
Copy link
Member

mholt commented Apr 24, 2024

@gottlike Interesting, works for me:

$ curl -v -X POST -H "Content-Type: application/json" -d @file.json http://localhost:1234
Note: Unnecessary use of -X or --request, POST is already inferred.
* Host localhost:1234 was resolved.
* IPv6: ::1
* IPv4: 127.0.0.1
*   Trying [::1]:1234...
* Connected to localhost (::1) port 1234
> POST / HTTP/1.1
> Host: localhost:1234
> User-Agent: curl/8.7.1
> Accept: */*
> Content-Type: application/json
> Content-Length: 699063
> 
* upload completely sent off: 699063 bytes
< HTTP/1.1 200 OK
< Content-Type: text/plain; charset=utf-8
< Date: Wed, 24 Apr 2024 18:21:55 GMT
< Server: Caddy
< Server: Caddy
< Transfer-Encoding: chunked
< 
...

(also, Daniel the author of curl would like to remind you that -X POST is not necessary 😇 )

@gottlike
Copy link
Author

gottlike commented Apr 24, 2024

Damn, and I omitted the warning so nobody would notice 😋

How can we debug this further? I edited my post and added the Caddy logs, maybe that helps to shed a light?

@mholt
Copy link
Member

mholt commented Apr 24, 2024

@gottlike To clarify, this is separate from retries now? Because my attempt did not employ any retries (the backend was up before running curl, and there is nothing in the config to perform retries).

Let me try on a Mac. I'm on Linux currently...

@mholt
Copy link
Member

mholt commented Apr 24, 2024

It's unfortunately working for me on Mac, too:

matt@MattBook-Air caddy % curl -v -H "Content-Type: application/json" -d @file.json http://localhost:1234
*   Trying [::1]:1234...
* Connected to localhost (::1) port 1234
> POST / HTTP/1.1
> Host: localhost:1234
> User-Agent: curl/8.4.0
> Accept: */*
> Content-Type: application/json
> Content-Length: 699063
> 
* We are completely uploaded and fine
< HTTP/1.1 200 OK
< Content-Length: 6
< Content-Type: text/plain; charset=utf-8
< Date: Wed, 24 Apr 2024 18:38:00 GMT
< Server: Caddy
< Server: Caddy
< 
* Connection #0 to host localhost left intact
Heyho!%                 

@gottlike
Copy link
Author

gottlike commented Apr 24, 2024

@mholt Yes, just the very simple Caddyfile I posted is enough. The errors are on Linux (x86) and MacOS (ARM) for me.

Are you also using 2.7.6, not master or something?

@mholt
Copy link
Member

mholt commented Apr 24, 2024

@gottlike Oh, I am using master, good point.

We're hoping to tag 2.8 (beta) this week, can you try master? That should work!

@gottlike
Copy link
Author

gottlike commented Apr 24, 2024

Ahh.. that might be it! I never built any Go project, so I would like to wait a few days and try again, once the beta release is available - if that's alright :)

@mohammed90
Copy link
Member

You can grab the CI artifact from here

https://github.com/caddyserver/caddy/actions/runs/8808836194#artifacts

@mholt
Copy link
Member

mholt commented Apr 24, 2024

@gottlike No need to wait, please feel free to download our pre-compiled binaries from CI artifacts: https://github.com/caddyserver/caddy/actions/runs/8808836194#artifacts 😁

Edit: Lightning Mohammed beat me to it again!

@gottlike
Copy link
Author

Thanks guys! With master it works 🥳

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐞 Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants