-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Errors and high latency when using GCS backend under (medium) load #5419
Comments
@armon this is the issue we talked about a few days ago, fyi. |
This error is coming from GCS/the official GCS libs. Did |
@sethvargo Any thoughts here? |
@roboll When you said you tried |
Hmm - we added the You're correct that it uses the standard library, pretty much out of the box 😄. Since it only happens in bursts, I don't think we are leaking a handle or anything. Something of note: that first failure is coming from a failed health check. That could be causing a cascading failure which is why you're seeing latency. If the server is trying to check its own leadership status, but that request fails, you could be forcing a leader election, which will basically cause requests to hang during that time. Do it frequently enough and it becomes a serious bottleneck. Do you have any metrics or correlations on leadership transitions during this time? In any event, this seems like a duplicate of googleapis/google-cloud-go#701 and subsequently googleapis/google-cloud-go#753 (/cc @jba) P.S. thanks for the thorough bug report 😄 |
Also, if you're issuing that many requests that frequently, you're probably in the price range on GCS where it makes sense to migrate to using Google Cloud Spanner 😄 |
@sethvargo Yes - we're definitely not using GCS for the cost. GCS was our first choice for the simplicity, we plan to migrate to an alternate backend. Whatever we can do to stabilize performance on GCS while we work on that migration would be great though. (This is a
Agree, seems unlikely. We plan to try Either way, as far as I can tell from the spec, the max number of streams for a client connection is something the server dictates so I'd not expect there to be much configurable from the client side besides what we have (artificially limiting concurrency). We'll post back once we have info on whether this is helpful. Thanks! |
Thank you for your feedback! We just did some testing with max_parallel. We first tested by setting it to 32 and triggered a rolling update of our pods:
We then tested by setting it to 256 and the behavior seemed similar to the default (128) It really seems that the bottleneck is in the http2 connection created by the GCS client. To validate this, we are about to test a custom vault build which instead of creating a single client creates multiple ones and chose a random one for all operations. We will update the issue after testing it I also think that it is surprising to see this low-level http2 error in the vault logs. I think it would make sense for it to be caught in the http2 layer or in the gcs client (which could return an error on operations when too many are already inflight with a clear message) |
Let us know how the custom build works out. I don't have much control over the GCS go library, but I can ping the people who do. |
It did not change anything. I looked at connections from vault to google api IP addresses and I only saw 2 (we tried creating 16 clients). |
We're also getting this issue periodically except the load in our cluster is actually pretty miniscule. The underlying cloud storage bucket is getting about 10 requests/sec, but our Vault primary is still periodically choking up with messages like
Eventually Vault fails over to the standby, which does just fine and doesn't have any of these issues. We're definitely not at a scale where the Cloud Spanner storage backend makes sense so I'd really like to try and figure out what's going on here. |
Interestingly, the GCS server seems to set I'm experimenting with dropping |
Let us know if things get better for you with this option. In our case the connection pattern changed a little but we still saw a lot of |
So I spent a bit of time trying to figure out what's going on here. Basically, our Vault setup looks like this:
The problem we're experiencing is that Vault is continuously experiencing leadership transitions between the two servers. The loss of leadership on a host is accompanied with many messages like the following:
It seems like the Vault active node loses its ability to keep the HA lock because it receives REFUSED_STREAM errors from Google Cloud Storage, leading to a failover; this happens like clockwork every 30 minutes or so. One thing I noticed about our workload is that on one host, we have an instance of consul-template in a restart loop. The consul-template task authenticates to Vault with a GCE metadata token, provides consul-template with the Vault token, and uses consul-template to fetch a dynamic MongoDB username/password pair secret. consul-template then tries to render the template to disk but fails because it doesn't have write permissions to the directory. consul-template appears to renew the secret once immediately after it's fetched in the background. I see trace messages like
So, what I suspect is happening (without any conclusive evidence) is that somewhere in the HTTP2 transport there's a race in how context cancellation is handled, and streams that Vault thinks are closed, GCS thinks are open. This would eventually lead to exhausting the Of course, the problem could be something else completely and the context cancelation errors are a total red herring too - this is just speculation for the moment :) I had no luck changing I've still got the broken consul-template workload set up for the moment, so if you have any idea for how this can be debugged further, I'd love to give it a swing. I was, perhaps unsurprisingly, unable to reproduce the issue on a local environment; I can trigger the |
So I think I'm beginning to zero in on the issue here. I did a packet capture of Vault's communication with the GCS server to try and understand the state of the HTTP2 connection. To do this, I built a version of Vault with this patch (16b2b83) which lets us dump out TLS session keys for analysis (and it also puts a custom I probably shouldn't share the pcap file here (although the contents of the requests are protected by the Vault seal, there's stuff like OAuth bearer tokens for the communication to GCS). But, I did find some interesting things:
Looking at some of the HTTP2 stream IDs that were unclosed, they all appeared to be POST requests with multipart upload that never actually had the bodies sent - for example:
That's the only HTTP2 for that stream ID; so, the server is waiting for some body frames that the client is never sending. This happens 100 times, and then the server starts refusing the creation of new streams because it thinks the client needs to deal with the ones that it already has open. I'm still not sure if this is a bug in Vault, google-cloud-go, or the stdlib itself though. |
And I think I have a reproduction of this outside of Vault now: https://github.com/KJTsanaktsidis/refused_stream_repro so I think from here I'm going to file an issue on google-cloud-go. |
Hey @KJTsanaktsidis - thanks for putting this together, and sorry this is causing everyone such issues. I've pinged the team responsible for client libraries to take a look at this (and your reproduction). Even though this isn't an issue in Vault, I'd still vote to keep this issue open for tracking. |
This looks like an http2 issue in Go. Thanks to @KJTsanaktsidis for going the extra mile. See golang/go#27208. Presently, it seems the fix will land in Go 1.12. |
@KJTsanaktsidis many thanks for diving into this Also, in case someone stumbles on this issue before the fix gets into vault, we ran a typical deployment that used to trigger high latencies and errors with |
I have been following this issue a while and I'm happy on the one hand that the culprit seems to be found but on the other hand it is not very transparent what the consequences for vault are. Does this mean that the vault project either has to wait for GO 1.12 or the GO 1.11 backport and than it has to be recompiled against that GO version and afterwards we can expect that we can use the GCS backend? What will the time window of such an process be? Go 1.12 will be expected for approx 02/19 hopefully the backport lands earlier (I guys maybe 01/19). So will the vault rebuild happen in 1.0.0 or 1.0.1 or even 1.1.0 ? |
We use the x/net lib for some things so the things that use that will be updated for 1.0 GA. Outside of that it's pretty much what you said... Wait for a build with the backport, or build yourself from a custom Go build. |
@jefferai Thanks for your comment. It would help me if you could point out when the 1.0 GA approx. will land. Would you recommend using another kind of backend, e.g. postgres and do a migration to gcs afterwards? Would such a temporary solution be feasible or even reasonable? |
@mrahbar Note that I don't believe what we're doing in 1.0 will affect this, as I highly doubt GCS lib is using x/net instead of the built-in stdlib http client -- unless they would want to switch to it, which also seems unlikely. And there isn't even a patch yet on the Go side that we could cherry pick into our builds -- they have approved it for cherry pick but not yet actually done the work. You could use a different backend. I can't really comment on any of them (or really much on the GCS one) as we don't support anything that isn't inmem/file/consul. |
I’m curious what impact you’re concerned about from disabling HTTP2 in your vault deployment? Is it just a performance thing? FWIW I haven’t really measured anything in detail but none of the Vault telemetry data really moved when I enabled this mitigation in our deployment. |
☝️ I also am curious about this. |
@KJTsanaktsidis I think I would also go with disabling HTTP2 for now. Having said that I would prefer not to do so if I would have an alternative. The vault setup I'm planning needs to have a low latency and have to server small chunks of stored secret quite often. I have to admit that I didn't do any benchmarking after I stumbled upon this issue and needless to say I'm also not aware of any smart caching logics in the go client for gcs (So @KJTsanaktsidis if you have some benchmarks I would be more than thankful). My concerns where mainly motivated from a project management perspective as I needed to knew when and under which process the fix can be expected to land in vault. I'm aware that the gcs backend is not a vault managed plugin but in the end I have to wait for a vault release compiled against a GO version including the patch or build it myself. |
I don’t have anything that you would call a benchmark unfortunately and our workload is not really that high volume at the moment. The go http1 client I think keeps connections alive and caches them for reuse, so even with http2 disabled the gcs client shouldn’t be making/tearing down a ton of connections. So I don’t think disabling http2 should have a big impact on latency |
Hey @jefferai - was Vault 1.0.2 built with Go 1.11.4? If yes, this should be fixed now. |
I guess I'll close and if people still have an issue we'll reopen. |
Sweet! All - please upgrade to Vault 1.0.2 or later. If the issue persists, please capture logs and metrics and we'll reopen the issue! |
Describe the bug
When using the gcs backend we noticed a high latency for vault authentication and reads when doing operations in parallel from several clients (sometime the clients even receive errors). We first suspected throttling on the GCS api but accessing the same bucket and keys using the gsutil CLI remained fast during these higher latency events.
Average latency reported by the telemetry endpoint goes up to dozens of seconds during these events for most operations.
We looked at the vault logs and saw that when this was happening we were getting a lot of errors in the vault logs:
REFUSED_STREAM
seems to indicate that the gcs client is trying to open too many streams on the http2 connection. I had a quick look at the code and the gcs backend seems to be using the standard gcs go library so I wonder if the issue could actually be there.We will try to work around this by decreasing the
max_parallel
option of the backend (I will update the issue after our tests), but I was wondering if you had other ideas.To Reproduce
Steps to reproduce the behavior:
Expected behavior
Limited increase in latency and no errors
Environment:
0.11.1
consul-template 0.19.4
ubuntu 1804
Vault server configuration file(s):
GOOGLE_STORAGE_BUCKET defined with environment variable
Additional context
N/A
The text was updated successfully, but these errors were encountered: