lb_retries apparently not working #6259

gottlike · 2024-04-21T12:46:08Z

I'd like to be able to recreate/restart Docker services that are hooked into Caddy via reverse_proxy without downtime. Usually a (POST) request takes about 10 seconds to complete. My config is super simple:

example.com {
        log {
                format console
                output stdout
        }

        root * /srv

        file_server {
                precompressed br gzip
        }

        reverse_proxy /-/proxy1 http://docker_container1:9292 {
                lb_retries 3
                lb_try_interval 1s
        }
        reverse_proxy /-/proxy2 http://docker_container2:9292 {
                lb_retries 3
                lb_try_interval 1s
        }
}

When running a bunch of requests and restarting the container before the responses were returned, I always get 502 errors, despite the lb_retries directive.

The error messages look like this:

2024/04/21 12:35:27.404 ERROR   http.log.access.log0    handled request {"request": {"remote_ip": "62.153.218.40", "remote_port": "44362", "client_ip": "62.153.218.40", "proto": "HTTP/1.1", "method": "POST", "host": "example.com", "uri": "/-/proxy1", "headers": {"Accept-Encoding": ["gzip, compress, deflate, br"], "Connection": ["keep-alive"], "Accept": ["application/json, text/plain, */*"], "Content-Type": ["application/json"], "User-Agent": ["axios/1.6.7"], "Content-Length": ["67993"]}, "tls": {"resumed": true, "version": 772, "cipher_suite": 4865, "proto": "", "server_name": "example.com"}}, "bytes_read": 0, "user_id": "", "duration": 1.014601761, "size": 0, "status": 502, "resp_headers": {"Server": ["Caddy"], "Alt-Svc": ["h3=\":443\"; ma=2592000"]}}

{"level":"error","ts":1713702927.4045582,"logger":"http.log.error.log0","msg":"readfrom tcp 172.18.0.2:34872->172.18.0.3:9292: http: invalid Read on closed Body","request":{"remote_ip":"62.153.218.40","remote_port":"44362","client_ip":"62.153.218.40","proto":"HTTP/1.1","method":"POST","host":"example.com","uri":"/-/proxy1","headers":{"Accept-Encoding":["gzip, compress, deflate, br"],"Connection":["keep-alive"],"Accept":["application/json, text/plain, */*"],"Content-Type":["application/json"],"User-Agent":["axios/1.6.7"],"Content-Length":["67993"]},"tls":{"resumed":true,"version":772,"cipher_suite":4865,"proto":"","server_name":"example.com"}},"duration":1.014601761,"status":502,"err_id":"cr390dupf","err_trace":"reverseproxy.statusError (reverseproxy.go:1267)"}

The text was updated successfully, but these errors were encountered:

gottlike · 2024-04-21T13:06:23Z

Never mind, it seems this works as expected (or doesn't, rather): https://caddy.community/t/retries-for-post-requests-with-body/23478

gottlike · 2024-04-21T13:53:51Z

Okay, I would really like to see this. How much would it cost to sponsor development of this feature, @francislavoie ?

francislavoie · 2024-04-21T17:15:55Z

See https://caddyserver.com/sponsor, you can reach out that way to get a quote.

mholt · 2024-04-22T19:21:32Z

I got the email. Thank you for reaching out!

While investigating this, I realized this should already work:

:1234 {
	reverse_proxy 127.0.0.1:1235 {
		# same as the OP's original config
		lb_retries 3
		lb_try_interval 1s

		# only retry request if the app will treat it as idempotent
		# (GETs are considered idempotent by default, POSTs aren't)
		lb_retry_match header Idempotency-Key *

		# enable buffering sufficient to fit the request body
		request_buffers 128K

		# optional, just to see what's going on
		verbose_logs
	}
}

I start the proxy, and then I fire off a request:

$ curl -v "http://localhost:1234" -H 'Idempotency-Key: asdf' -d 'foobar'

and then quickly, before the load balancer gives up, I bring up the simulated backend:

$ caddy respond --listen 127.0.0.1:1235 'Request body: {http.request.body}'

But I was getting readFrom errors, because the request body was being closed... but it was definitely being fully read into the buffer! So closing the request body shouldn't matter.

Anyway, all I had to do was patch half a line of code: 613d544

Seems to work now.

mholt · 2024-04-22T21:23:45Z

Actually, just following up with d93e027 which also preserves the body in case one of the backends partially (or totally) depletes it before having the connection severed.

gottlike · 2024-04-23T05:37:01Z

Awesome, thank you!

gottlike · 2024-04-23T09:23:20Z

@mholt I noticed two minor issues while playing around with this on 2.7.6 with GET requests:

After successfully retrying, Caddy responds with two "Server: Caddy" headers
Using the lb_retry_match directive doesn't seem to do anything with regard to GET requests? No matter what, the retries are always attempted.

gottlike · 2024-04-23T09:33:10Z

From the official documentation of lb_retry_match:

If the connection to the upstream failed, a retry is always allowed.

Doesn't this kind of defeat its purpose, at least for my use case?

francislavoie · 2024-04-23T10:07:57Z

After successfully retrying, Caddy responds with two "Server: Caddy" headers

Are you sure this is because of the retry? Are you not just proxying to another Caddy server?

Using the lb_retry_match directive doesn't seem to do anything with regard to GET requests? No matter what, the retries are always attempted.

What do you mean? What did you try to set it to? What's your goal?

Doesn't this kind of defeat its purpose, at least for my use case?

In what way? Connection failures are always retryable, because obviously the request wasn't sent anywhere yet.

gottlike · 2024-04-23T10:18:25Z

Okay, here's a reproduction of the issue:

Caddyfile:

:1234 {
	reverse_proxy 127.0.0.1:1235 {
		lb_retries 5
		lb_try_interval 1s

		lb_retry_match header X-Idempotent-Retry true

		request_buffers 256K

		verbose_logs
	}
}

I run this with caddy run

Then I run curl -v "http://localhost:1234" in another tab.

And 1 second afterwards caddy respond --listen 127.0.0.1:1235 'Hi there!' in another tab.

I would expect to receive a 502 error with curl, even though the proxy went up after a second. That's because the request didn't have "X-Idempotent-Retry: true" as a header.

If I would run curl -v "http://localhost:1234" -H 'X-Idempotent-Retry: true' I would expect to get a 200 with "Hi there!" after 2 seconds.

Aaand you are right about the dual "Server: Caddy" response. That's because I use Caddy as a reverse proxy, obviously - sorry.

gottlike · 2024-04-23T10:21:27Z

My use case is very similar to this simple reproduction:

I have a bunch of Docker containers as proxies and I'd like to have a zero-downtime restart/deployment for requests that are idempotent. But only for those, so that we don't have a retry on a POST that was already partly processed (which could happen now, with this fix/addition).

francislavoie · 2024-04-23T10:24:10Z

And 1 second afterwards caddy respond --listen 127.0.0.1:1235 'Hi there!' in another tab.

Aaand you are right about the dual "Server: Caddy" response. That's because I use Caddy as a reverse proxy, obviously - sorry.

Yeah, exactly, you're running Caddy behind Caddy, so of course it'll have two copies of the Server: Caddy header. That's working as intended. Each server adds their own Server header.

I would expect to receive a 502 error with curl, even though the proxy went up after a second. That's because the request didn't have "X-Idempotent-Retry: true" as a header.

So you want to not retry even connection problems? But why? I can't think of any reason why that would be helpful.

The logic for retries is here:

caddy/modules/caddyhttp/reverseproxy/reverseproxy.go

Lines 1013 to 1071 in 6a02999

    
           func (lb LoadBalancing) tryAgain(ctx caddy.Context, start time.Time, retries int, proxyErr error, req *http.Request) bool { 
        
           	// no retries are configured 
        
           	if lb.TryDuration == 0 && lb.Retries == 0 { 
        
           		return false 
        
           	} 
        
           	// if we've tried long enough, break 
        
           	if lb.TryDuration > 0 && time.Since(start) >= time.Duration(lb.TryDuration) { 
        
           		return false 
        
           	} 
        
           	// if we've reached the retry limit, break 
        
           	if lb.Retries > 0 && retries >= lb.Retries { 
        
           		return false 
        
           	} 
        
           	// if the error occurred while dialing (i.e. a connection 
        
           	// could not even be established to the upstream), then it 
        
           	// should be safe to retry, since without a connection, no 
        
           	// HTTP request can be transmitted; but if the error is not 
        
           	// specifically a dialer error, we need to be careful 
        
           	if proxyErr != nil { 
        
           		_, isDialError := proxyErr.(DialError) 
        
           		herr, isHandlerError := proxyErr.(caddyhttp.HandlerError) 
        
           		// if the error occurred after a connection was established, 
        
           		// we have to assume the upstream received the request, and 
        
           		// retries need to be carefully decided, because some requests 
        
           		// are not idempotent 
        
           		if !isDialError && !(isHandlerError && errors.Is(herr, errNoUpstream)) { 
        
           			if lb.RetryMatch == nil && req.Method != "GET" { 
        
           				// by default, don't retry requests if they aren't GET 
        
           				return false 
        
           			} 
        
           			if !lb.RetryMatch.AnyMatch(req) { 
        
           				return false 
        
           			} 
        
           		} 
        
           	} 
        
           	// fast path; if the interval is zero, we don't need to wait 
        
           	if lb.TryInterval == 0 { 
        
           		return true 
        
           	} 
        
           	// otherwise, wait and try the next available host 
        
           	timer := time.NewTimer(time.Duration(lb.TryInterval)) 
        
           	select { 
        
           	case <-timer.C: 
        
           		return true 
        
           	case <-ctx.Done(): 
        
           		if !timer.Stop() { 
        
           			// if the timer has been stopped then read from the channel 
        
           			<-timer.C 
        
           		} 
        
           		return false 
        
           	} 
        
           }

gottlike · 2024-04-23T10:32:01Z

So you want to not retry even connection problems? But why? I can't think of any reason why that would be helpful.

Okay, so the "always retry, no matter what lb_retry_match says"-behaviour only happens if Caddy was never able to initially connect to the upstream? That would be safe then, I guess. I figured it would also do that if it did connect initially and the upstream connection broke while processing the request. If that's not the case and the lb_retry_match is taken into account in those cases, that's fine.

francislavoie · 2024-04-23T10:47:05Z

Yeah, connection failures (before the HTTP request is written upstream) is always allowed to retry, because we know it's always safe. The lb_retry_match is for enabling retries other than just GET if it fails after connecting.

mholt · 2024-04-23T13:05:51Z

Thanks for the discussion here -- let me know if there's anything else needed!

gottlike · 2024-04-24T11:08:19Z

Actually, I do have another suggestion 😆: When the request_buffers size is not big enough for requests, Caddy responds with a 502 error. In my eyes that's not quite accurate and a 507 error would be more appropriate. But yeah.. not exactly high prio and open for debate.

mohammed90 · 2024-04-24T11:17:59Z

Actually, I do have another suggestion 😆: When the request_buffers size is not big enough for requests, Caddy responds with a 502 error. In my eyes that's not quite accurate and a 507 error would be more appropriate. But yeah.. not exactly high prio and open for debate.

The MDN says this status is limited to WebDav usage only. It can also be argued that this status from a reverse-proxy exposes more information than necessary to clients.

gottlike · 2024-04-24T11:19:57Z

Yeah, actually there's no perfect fit for this. But a 502 is wrong, too. That had me scratching my head for a while, since my backend never received the request and didn't have an error. Only found out by trial and error that it's because of the request buffer size.

mohammed90 · 2024-04-24T11:24:26Z

Perhaps 413 Content Too Large?

gottlike · 2024-04-24T11:31:35Z

That's probably best. It's still not super obvious for the server admin, but much better than the 502 👍

mholt · 2024-04-24T13:46:48Z

@gottlike

Actually, I do have another suggestion 😆: When the request_buffers size is not big enough for requests, Caddy responds with a 502 error.

Hmm, I'm not sure if that's expected. 🤔 That's not what happens for me. For me, it still reads the whole body and passes it upstream, it just only buffers up to the buffer size first.

gottlike · 2024-04-24T18:19:28Z

Okay, interesting! Here's how you can easily reproduce it (on MacOS at least):

:1234 {
        reverse_proxy 127.0.0.1:1235 {
                request_buffers 256K
                verbose_logs
        }
}

Create a big file with dummy data: echo '{"data":"'$(openssl rand -base64 $((512*1024)) | tr -d '\n')'"}' > file.json

Start the upstream: caddy respond --listen 127.0.0.1:1235 'Heyho!'

Then: curl -v -X POST -H "Content-Type: application/json" -d @file.json http://localhost:1234

With Caddy 2.7.6 I'm getting:

*   Trying [::1]:1234...
* Connected to localhost (::1) port 1234
> POST / HTTP/1.1
> Host: localhost:1234
> User-Agent: curl/8.4.0
> Accept: */*
> Content-Type: application/json
> Content-Length: 699063
> 
* We are completely uploaded and fine
< HTTP/1.1 502 Bad Gateway
< Server: Caddy
< Date: Wed, 24 Apr 2024 18:14:51 GMT
< Content-Length: 0
< 
* Connection #0 to host localhost left intact

2024/04/24 18:14:43.388 ERROR   http.log.error  net/http: HTTP/1.x transport connection broken: http: ContentLength=256000 with Body length 699063      {"request": {"remote_ip": "::1", "remote_port": "58598", "client_ip": "::1", "proto": "HTTP/1.1", "method": "POST", "host": "localhost:1234", "uri": "/", "headers": {"Accept": ["*/*"], "Content-Type": ["application/json"], "Content-Length": ["699063"], "User-Agent": ["curl/8.4.0"]}}, "duration": 0.001649833, "status": 502, "err_id": "ekrqejxrx", "err_trace": "reverseproxy.statusError (reverseproxy.go:1267)"}

mholt · 2024-04-24T18:22:58Z

@gottlike Interesting, works for me:

$ curl -v -X POST -H "Content-Type: application/json" -d @file.json http://localhost:1234
Note: Unnecessary use of -X or --request, POST is already inferred.
* Host localhost:1234 was resolved.
* IPv6: ::1
* IPv4: 127.0.0.1
*   Trying [::1]:1234...
* Connected to localhost (::1) port 1234
> POST / HTTP/1.1
> Host: localhost:1234
> User-Agent: curl/8.7.1
> Accept: */*
> Content-Type: application/json
> Content-Length: 699063
> 
* upload completely sent off: 699063 bytes
< HTTP/1.1 200 OK
< Content-Type: text/plain; charset=utf-8
< Date: Wed, 24 Apr 2024 18:21:55 GMT
< Server: Caddy
< Server: Caddy
< Transfer-Encoding: chunked
< 
...

(also, Daniel the author of curl would like to remind you that -X POST is not necessary 😇 )

gottlike · 2024-04-24T18:26:13Z

Damn, and I omitted the warning so nobody would notice 😋

How can we debug this further? I edited my post and added the Caddy logs, maybe that helps to shed a light?

mholt · 2024-04-24T18:35:03Z

@gottlike To clarify, this is separate from retries now? Because my attempt did not employ any retries (the backend was up before running curl, and there is nothing in the config to perform retries).

Let me try on a Mac. I'm on Linux currently...

mholt · 2024-04-24T18:38:37Z

It's unfortunately working for me on Mac, too:

matt@MattBook-Air caddy % curl -v -H "Content-Type: application/json" -d @file.json http://localhost:1234
*   Trying [::1]:1234...
* Connected to localhost (::1) port 1234
> POST / HTTP/1.1
> Host: localhost:1234
> User-Agent: curl/8.4.0
> Accept: */*
> Content-Type: application/json
> Content-Length: 699063
> 
* We are completely uploaded and fine
< HTTP/1.1 200 OK
< Content-Length: 6
< Content-Type: text/plain; charset=utf-8
< Date: Wed, 24 Apr 2024 18:38:00 GMT
< Server: Caddy
< Server: Caddy
< 
* Connection #0 to host localhost left intact
Heyho!%

gottlike · 2024-04-24T18:39:47Z

@mholt Yes, just the very simple Caddyfile I posted is enough. The errors are on Linux (x86) and MacOS (ARM) for me.

Are you also using 2.7.6, not master or something?

mholt · 2024-04-24T18:41:37Z

@gottlike Oh, I am using master, good point.

We're hoping to tag 2.8 (beta) this week, can you try master? That should work!

gottlike · 2024-04-24T18:44:41Z

Ahh.. that might be it! I never built any Go project, so I would like to wait a few days and try again, once the beta release is available - if that's alright :)

mohammed90 · 2024-04-24T18:51:01Z

You can grab the CI artifact from here

https://github.com/caddyserver/caddy/actions/runs/8808836194#artifacts

mholt · 2024-04-24T18:51:04Z

@gottlike No need to wait, please feel free to download our pre-compiled binaries from CI artifacts: https://github.com/caddyserver/caddy/actions/runs/8808836194#artifacts 😁

Edit: Lightning Mohammed beat me to it again!

gottlike · 2024-04-25T05:15:58Z

Thanks guys! With master it works 🥳

gottlike closed this as completed Apr 21, 2024

gottlike reopened this Apr 21, 2024

mholt closed this as completed Apr 22, 2024

mholt added the bug 🐞 Something isn't working label Apr 22, 2024

mholt added this to the v2.8.0 milestone Apr 22, 2024

mholt mentioned this issue Apr 24, 2024

Proxy load balancing retry issues #4245

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lb_retries apparently not working #6259

lb_retries apparently not working #6259

gottlike commented Apr 21, 2024 •

edited

Loading

gottlike commented Apr 21, 2024 •

edited

Loading

gottlike commented Apr 21, 2024

francislavoie commented Apr 21, 2024

mholt commented Apr 22, 2024 •

edited

Loading

mholt commented Apr 22, 2024

gottlike commented Apr 23, 2024

gottlike commented Apr 23, 2024

gottlike commented Apr 23, 2024

francislavoie commented Apr 23, 2024 •

edited

Loading

gottlike commented Apr 23, 2024 •

edited

Loading

gottlike commented Apr 23, 2024 •

edited

Loading

francislavoie commented Apr 23, 2024 •

edited

Loading

gottlike commented Apr 23, 2024 •

edited

Loading

francislavoie commented Apr 23, 2024

mholt commented Apr 23, 2024

gottlike commented Apr 24, 2024

mohammed90 commented Apr 24, 2024

gottlike commented Apr 24, 2024 •

edited

Loading

mohammed90 commented Apr 24, 2024 •

edited

Loading

gottlike commented Apr 24, 2024

mholt commented Apr 24, 2024

gottlike commented Apr 24, 2024 •

edited

Loading

mholt commented Apr 24, 2024

gottlike commented Apr 24, 2024 •

edited

Loading

mholt commented Apr 24, 2024

mholt commented Apr 24, 2024

gottlike commented Apr 24, 2024 •

edited

Loading

mholt commented Apr 24, 2024

gottlike commented Apr 24, 2024 •

edited

Loading

mohammed90 commented Apr 24, 2024

mholt commented Apr 24, 2024 •

edited

Loading

gottlike commented Apr 25, 2024

lb_retries apparently not working #6259

lb_retries apparently not working #6259

Comments

gottlike commented Apr 21, 2024 • edited Loading

gottlike commented Apr 21, 2024 • edited Loading

gottlike commented Apr 21, 2024

francislavoie commented Apr 21, 2024

mholt commented Apr 22, 2024 • edited Loading

mholt commented Apr 22, 2024

gottlike commented Apr 23, 2024

gottlike commented Apr 23, 2024

gottlike commented Apr 23, 2024

francislavoie commented Apr 23, 2024 • edited Loading

gottlike commented Apr 23, 2024 • edited Loading

gottlike commented Apr 23, 2024 • edited Loading

francislavoie commented Apr 23, 2024 • edited Loading

gottlike commented Apr 23, 2024 • edited Loading

francislavoie commented Apr 23, 2024

mholt commented Apr 23, 2024

gottlike commented Apr 24, 2024

mohammed90 commented Apr 24, 2024

gottlike commented Apr 24, 2024 • edited Loading

mohammed90 commented Apr 24, 2024 • edited Loading

gottlike commented Apr 24, 2024

mholt commented Apr 24, 2024

gottlike commented Apr 24, 2024 • edited Loading

mholt commented Apr 24, 2024

gottlike commented Apr 24, 2024 • edited Loading

mholt commented Apr 24, 2024

mholt commented Apr 24, 2024

gottlike commented Apr 24, 2024 • edited Loading

mholt commented Apr 24, 2024

gottlike commented Apr 24, 2024 • edited Loading

mohammed90 commented Apr 24, 2024

mholt commented Apr 24, 2024 • edited Loading

gottlike commented Apr 25, 2024

gottlike commented Apr 21, 2024 •

edited

Loading

gottlike commented Apr 21, 2024 •

edited

Loading

mholt commented Apr 22, 2024 •

edited

Loading

francislavoie commented Apr 23, 2024 •

edited

Loading

gottlike commented Apr 23, 2024 •

edited

Loading

gottlike commented Apr 23, 2024 •

edited

Loading

francislavoie commented Apr 23, 2024 •

edited

Loading

gottlike commented Apr 23, 2024 •

edited

Loading

gottlike commented Apr 24, 2024 •

edited

Loading

mohammed90 commented Apr 24, 2024 •

edited

Loading

gottlike commented Apr 24, 2024 •

edited

Loading

gottlike commented Apr 24, 2024 •

edited

Loading

gottlike commented Apr 24, 2024 •

edited

Loading

gottlike commented Apr 24, 2024 •

edited

Loading

mholt commented Apr 24, 2024 •

edited

Loading