Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

skipper times out before clients on slow connections can finish their requests #795

Closed
ptrf opened this issue Sep 18, 2018 · 16 comments
Closed

Comments

@ptrf
Copy link

ptrf commented Sep 18, 2018

We have found a bug in skipper, which affects connections where a client attempts to send large request payload on a slow connection. Specifically, skipper errors out due to an i/o timeout:

error while proxying, route [...] with backend http://[...]:8080, status code 504: dialing failed false: read tcp [...]:9999->[...]:28972: i/o timeout

Our setup contains a kubernetes deployed service that exposes a POST endpoint. This endpoint can accept payloads of several hundreds of MBs.

When a client attempts to POST a large payload on a slow connection, skipper aborts the request after reaching the timeout specified by -read-timeout-server.

It seems that skipper times out because the backend has not responded, even though the client's request has yet to complete.

We attempted a workaround where the service would send multiple 100 Continue informational responses, after reading each 8 MB chunk, but here we in turn ran into issues golang/go#26089 and golang/go#26088.

Steps to reproduce:

  1. Setup a service that exposes a POST or PUT endpoint behind a skipper - the endpoint may discard the request payload after reading it.
  2. Set the skipper's -read-timeout-server to, say, 10 sec
  3. Use curl(1) to POST a file using --limit-rate options:
curl -X POST [skipper-endpoint-url] --data-binary @somelargefile --limit-rate 2k

If somelargefile is 50 KB, it should take 25+ seconds for curl(1) to finish the request, which exceeds the timeout specified by -read-timeout-server, and you should see the request abort, even though the client hasn't finished the request.

cc @roffe

@szuecs
Copy link
Member

szuecs commented Sep 18, 2018

Thanks for creating this detailed issue!

Did you also tried to change the -read-timeout-server to 0 or to a higher number and see if the problem is not happening anymore?

@roffe
Copy link

roffe commented Sep 18, 2018

if we raised the number we could transfer the upload until the timeout happened, ie set it to 15 min and transfer would take 30 min, it gets cut at the timeout. I verified it with curls rate limited at different time ranges

never tried 0 cause no timeouts doesn't sound to healthy, also raising the timeout really high would affect other ingress rules on the same controller.

might be worth adding option per ingress to configure this timeout or maybe reset the timeout if the client still is sending data.

@szuecs
Copy link
Member

szuecs commented Sep 19, 2018

Option per ingress is not scalable (connection pool per ingress).
We would need to change a bit more, that we will do anyways in the future, but maybe not to fix the issue.
We used the Go default timeouts and if you want you can set the read timeout to 0, because there are a lot of other timeouts catching tcp/ip timeouts for example.
I build the client/server timeouts because of https://blog.cloudflare.com/the-complete-guide-to-golang-net-http-timeouts/amp/ which explains them in a good way .

@tkrop
Copy link
Member

tkrop commented Sep 19, 2018

Can you ensure, that Skipper/go is closing the TCP connections on the other timeouts? Otherwise you might end up with the to-many-open-file-descriptors-problem - when setting read timeout to 0.

@szuecs
Copy link
Member

szuecs commented Sep 19, 2018

@tkrop please proof that it doesn’t. I appreciate if you show that the assumption is wrong, because it would be another issue in golang we can put upstream. I am pretty sure that my assumption from reading docs is right.

Readheadertimeout is http and if timeout, which is tcp, don’t close the connection, what is the purpose of a tcp timeout?
I hope you read the article and might consider reading the godoc, which is more current than the article.

@szuecs
Copy link
Member

szuecs commented Sep 19, 2018

@roffe a timeout of 15 minutes is already easily to DoS. Why bother with the timeout if you have to ratelimit anyways?

@roffe
Copy link

roffe commented Sep 19, 2018

The ratelimit is to simulate a slow connection

@roffe
Copy link

roffe commented Sep 19, 2018

99% of all the uploads would be from reliable connections but sometimes people sit on a flaky cellphone connection and upload slow and the total transfer time would go over -read-timeout-server

@szuecs
Copy link
Member

szuecs commented Sep 19, 2018

I am not sure if we talk about the same problem.
I mean you can set read timeout to 0, because the DoS protection that you need to protect from malicious unhealthy clients can be protected by client side ratelimit.
For timeouts you can use the tcp (transport) timeouts, that don’t care about the http payload to be transferred in time.

Do I miss something?

@ptrf
Copy link
Author

ptrf commented Sep 20, 2018

@szuecs If I'm not mistaken while reading the article above – "The Complete guide to Go net/http timeouts" – you control the maximum amount of time that Skipper will wait for a client to finish
sending its request with http.Server.ReadTimeout, controlled in skipper by the parameter -read-timeout-server?

If this is correctly understood, then the behaviour we're seeing is intentional.

@szuecs
Copy link
Member

szuecs commented Sep 20, 2018

@ptrf yes exactly. In code you find this at https://github.com/zalando/skipper/blob/master/skipper.go#L588-L596

You could also set http.ReadHeaderTimeout with -read-header-timeout-server=30s for timeout client connections and use -read-timeout-server=0 to have the intended behaviour, I would expect.

@ptrf
Copy link
Author

ptrf commented Sep 20, 2018

@roffe Well, I guess we should close this issue?

@ptrf
Copy link
Author

ptrf commented Sep 20, 2018

@szuecs And thanks for looking at our inquiry :-)

@tkrop
Copy link
Member

tkrop commented Sep 20, 2018

@szuecs you are probably right. I can't proof this. It is just base on the intuition, that go is propagating the 0 down to the socket. This may be no problem, if the go's http is terminating the receive operation using other means, e.g by closing the socket. I'm not the go expert to say how this is handled, but it might be worth keeping this in mind or testing it before using this setup.

@roffe
Copy link

roffe commented Sep 20, 2018

@ptrf yes we can close the issue, thanks for looking into this @szuecs and elaborating on the behaviour

@ptrf
Copy link
Author

ptrf commented Sep 20, 2018

closing then, thanks again

@ptrf ptrf closed this as completed Sep 20, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants