-
Notifications
You must be signed in to change notification settings - Fork 306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/cacheserver: too many timeouts #313
Comments
Not fully debugged yet, but here is what I have uncovered. The actual error is produced in the Go runtime by epoll and is reported first in the Upspin code by the rpc/server code running in the storage server running in Google Cloud. The error happens when the HTTP server, trying to read the payload for store.Put method invocation, times out at I was unable to find a timeout whose value affected the behavior, so I tried another approach and changed the value of writers in store/storecache. It is set to 20 originally, and at 20 I get timeouts every second or two. At 5 they appear a few times a minute, at 4 once every few minutes, and at 3 never. Let's look at the 4 setting, as that almost never times out and keeps the line saturated as well as 20 (sic). My home line is steady at delivering 1.5MB/s upstream, and with a writers setting of 4, this rate is maintained. A setting of 4 means about 4MB are outstanding on the wire, and that is right on the cusp of timing out. At 1.5MB/s, 4MB takes 3 or 4 seconds. Thus we would expect to see a timeout somewhere in the system in the neighborhood of 3-5 seconds, but I cannot find one. The network code in the Go runtime is inscrutable to me. (The amazing thing about epoll is that it's better than its predecessor.) Someone who understands that code, or maybe the HTTP code, might know where the relevant timeout is, and may be able to adjust it. Meanwhile if I get a chance with a different throughput network I'll see what the sweet value of writers is on that, and maybe find a way to set it dynamically. For now, I will hand-tune my value of writers. This isn't over yet. |
Without understanding the underlying extraordinarily complex library code, we can't assume that there is any fairness going on in the different streams. Therefore, the applicable timeout may be much more than 3-5 seconds, i.e., one stream could be getting starved. |
The relevant timeout is 15s and is in cloud/https See https://upspin-review.googlesource.com/c/8285/ There are several free parameters that affect the timeout rate, and one fixed. The fixed one is the bandwidth available; the free ones are number of parallel writers, block size, and timeout. It should be possible to adjust one or more of the free parameters based on the observed bandwidth, although I realize this is not going to be easy. The current settings will make it all but impossible for store.Put to succeed on a slow link. |
We also have to keep in mind the DoS considerations from Timeouts section
of the gopheracademy discussion referenced in #127. That's not to say
that we should hurt normal performance to defend against attacks, just that
a good solution will diagnose all these many issues and is probably worth
documenting for a wider audience.
…On Fri, Mar 10, 2017 at 8:21 AM, Rob Pike ***@***.***> wrote:
The relevant timeout is 15s and is in cloud/https See
https://upspin-review.googlesource.com/c/8285/
There are several free parameters that affect the timeout rate, and one
fixed. The fixed one is the bandwidth available; the free ones are number
of parallel writers, block size, and timeout. It should be possible to
adjust one or more of the free parameters based on the observed bandwidth,
although I realize this is not going to be easy.
The current settings will make it all but impossible for store.Put to
succeed on a slow link.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#313 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AIA3unkwa_1RPV_WyKCODttKiCSHD1B0ks5rkXh7gaJpZM4MU0qV>
.
|
4 is still arbitrary but at least on my home line generates almost no timeouts while still keeping the uplink saturated. Update #313 Change-Id: Ib641313ac7b98151d5fb80b1b95a987005fedb4b Reviewed-on: https://upspin-review.googlesource.com/8320 Reviewed-by: David Presotto <[email protected]>
This has been resolved, mostly. |
When creating a large data set with many largish files (a music library), it all worked but took about twice as long as it should have, according to the bandwidth I was seeing and the size of the data, and was accompanied by a great many errors like this:
(Note too the dangling colon.)
I believe what's happening is that the client is doing parallel 1MB writes, and a significant fraction of them time out just before completion, in this case halving my bandwidth. I imagine the parameters of my push matter, but I could see this problem producing much worse slowdowns.
Understand and fix.
The text was updated successfully, but these errors were encountered: