Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors and high latency when using GCS backend under (medium) load #5419

Closed
lbernail opened this issue Sep 27, 2018 · 31 comments
Closed

Errors and high latency when using GCS backend under (medium) load #5419

lbernail opened this issue Sep 27, 2018 · 31 comments

Comments

@lbernail
Copy link

lbernail commented Sep 27, 2018

Describe the bug
When using the gcs backend we noticed a high latency for vault authentication and reads when doing operations in parallel from several clients (sometime the clients even receive errors). We first suspected throttling on the GCS api but accessing the same bucket and keys using the gsutil CLI remained fast during these higher latency events.

Average latency reported by the telemetry endpoint goes up to dozens of seconds during these events for most operations.

We looked at the vault logs and saw that when this was happening we were getting a lot of errors in the vault logs:

2018-09-25T17:39:09.236Z [ERROR] core: error checking health: error="failed to check for initialization: failed to read object: Get https://www.googleapis.com/storage/v1/b//o?alt=json&delimiter=%2F&pageToken=&prefix=core%2F&projection=full&versions=false: stream error: stream ID 68481; REFUSED_STREAM"

error closing connection: Post https://www.googleapis.com/upload/storage/v1/b//o?alt=json&projection=full&uploadType=multipart: http2: Transport: cannot retry err [stream error: stream ID 68483; REFUSED_STREAM] after Request.Body was written; define Request.GetBody to avoid this error"

REFUSED_STREAM seems to indicate that the gcs client is trying to open too many streams on the http2 connection. I had a quick look at the code and the gcs backend seems to be using the standard gcs go library so I wonder if the issue could actually be there.

We will try to work around this by decreasing the max_parallel option of the backend (I will update the issue after our tests), but I was wondering if you had other ideas.

To Reproduce
Steps to reproduce the behavior:

  1. Run a vault server using gcs as backend
  2. Authenticate and read keys from many clients in parallel. In our case we have ~1000 clients performing authentication (kubernetes auth backend) and reading ~5 keys each over 15mn (but since this happens during a kubernetes rolling update, we have 3 batches of ~300 clients that perform these operations over 1-2 minutes)

Expected behavior
Limited increase in latency and no errors

Environment:

  • Vault Server Version: 0.11.1
  • Vault CLI Version: consul-template 0.19.4
  • Server Operating System/Architecture: ubuntu 1804

Vault server configuration file(s):

plugin_directory = "/var/lib/vault/plugins"

storage gcs {}

disable_clustering = true

ha_storage consul {
  address = "127.0.0.1:8500"
  path    = "vault/"
}

ui = true

listener tcp {
  address         = ":443"
  cluster_address = ":8201"
}

telemetry {
  dogstatsd_addr = "localhost:8125"
}

GOOGLE_STORAGE_BUCKET defined with environment variable

Additional context
N/A

@roboll
Copy link

roboll commented Sep 27, 2018

@armon this is the issue we talked about a few days ago, fyi.

@jefferai
Copy link
Member

This error is coming from GCS/the official GCS libs. Did max_parallel help?

@jefferai
Copy link
Member

@sethvargo Any thoughts here?

@jefferai
Copy link
Member

@roboll When you said you tried gsutil was it with the same API key from the same box at the same time? We've seen customers on AWS run through burst IOPS on specific instances and get throttled, but from other nodes things look just fine. Not sure how GCS behaves here but it's reasonable to think you could be hitting limits defined per instance/IP/API key/etc.

@sethvargo
Copy link
Contributor

Hmm - we added the max_parallel option specifically because we weren't sure what the "right" number is. It also depends heavily on your storage bucket class (are you coldline, nearline, regional, or multi-regional). Those each behave differently under load. I'm curious if reducing max_parallel helped at all.

You're correct that it uses the standard library, pretty much out of the box 😄. Since it only happens in bursts, I don't think we are leaking a handle or anything.

Something of note: that first failure is coming from a failed health check. That could be causing a cascading failure which is why you're seeing latency. If the server is trying to check its own leadership status, but that request fails, you could be forcing a leader election, which will basically cause requests to hang during that time. Do it frequently enough and it becomes a serious bottleneck. Do you have any metrics or correlations on leadership transitions during this time?

In any event, this seems like a duplicate of googleapis/google-cloud-go#701 and subsequently googleapis/google-cloud-go#753 (/cc @jba)

P.S. thanks for the thorough bug report 😄

@sethvargo
Copy link
Contributor

Also, if you're issuing that many requests that frequently, you're probably in the price range on GCS where it makes sense to migrate to using Google Cloud Spanner 😄

@roboll
Copy link

roboll commented Sep 28, 2018

@sethvargo Yes - we're definitely not using GCS for the cost. GCS was our first choice for the simplicity, we plan to migrate to an alternate backend. Whatever we can do to stabilize performance on GCS while we work on that migration would be great though. (This is a regional bucket.)

Since it only happens in bursts, I don't think we are leaking a handle or anything.

Agree, seems unlikely.

We plan to try max_parallel tomorrow though it looks like that only indirectly impacts the number of streams per h2 connection by limiting concurrent operations from Vault. If my understanding of the max streams per connection is correct, this is a control setting that servers dictate to clients, and clients to servers. (ref. https://httpwg.org/specs/rfc7540.html#rfc.section.5.1.2) I would expect the GCS server to be sending a settings frame with some SETTINGS_MAX_CONCURRENT_STREAMS - I've yet to inspect the traffic to see if that is the case.

Either way, as far as I can tell from the spec, the max number of streams for a client connection is something the server dictates so I'd not expect there to be much configurable from the client side besides what we have (artificially limiting concurrency). We'll post back once we have info on whether this is helpful.

Thanks!

@lbernail
Copy link
Author

lbernail commented Sep 28, 2018

Thank you for your feedback!

We just did some testing with max_parallel.

We first tested by setting it to 32 and triggered a rolling update of our pods:

  • average and maximum latencies from telemetry were much higher than with 128 (~2mn for some operations)
  • we also confirmed this by trying to access vault with another vault client
  • we noticed that the v1/sys/health route had the same latency issue which triggered a bad behavior on our load-balancer (GCP ILB): the active vault server was marked unhealthy (it tests v1/sys/health with a 2s timeout) and requests were load-balanced across both vault servers. It did not have an impact on load but it generated "too many redirects" errors on the clients (the vault standby node redirects to the load-balancer IP, which under standard condition forwards traffic to the active node)
  • the amount of REFUSED_STREAM did not decrease: they were spread over longer periods but the total was actually higher
  • the active vault server did not seem to recover (it probably would have given enough time) so we triggered a failover. The rolling update then finished ok (pods were in crashloop backoff so clients took time to reconnect)

We then tested by setting it to 256 and the behavior seemed similar to the default (128)

It really seems that the bottleneck is in the http2 connection created by the GCS client. To validate this, we are about to test a custom vault build which instead of creating a single client creates multiple ones and chose a random one for all operations. We will update the issue after testing it


I also think that it is surprising to see this low-level http2 error in the vault logs. I think it would make sense for it to be caught in the http2 layer or in the gcs client (which could return an error on operations when too many are already inflight with a clear message)

@sethvargo
Copy link
Contributor

Let us know how the custom build works out. I don't have much control over the GCS go library, but I can ping the people who do.

@lbernail
Copy link
Author

It did not change anything. I looked at connections from vault to google api IP addresses and I only saw 2 (we tried creating 16 clients).
It looks like the GCS library uses a transport that reuses connections and I did not find an easy way to increase the size of the connection pool in the NewClient call (it may be possible by tweaking the DialSettings of ClientOption)

@KJTsanaktsidis
Copy link

We're also getting this issue periodically except the load in our cluster is actually pretty miniscule. The underlying cloud storage bucket is getting about 10 requests/sec, but our Vault primary is still periodically choking up with messages like

Nov 04 21:50:15 vaultgcs-0 vault[933]: 2018-11-04T21:50:15.809Z [ERROR] core: error checking health: error="failed to check for initialization: failed to read object: Get https://www.googleapis.com/storage/v1/b/vaultgcs_backend_us1_staging/o?alt=json&delimiter=%2F&pageToken=&prefix=core%2F&projection=full&versions=false: stream error: stream ID 10737; REFUSED_STREAM"
Nov 04 21:50:20 vaultgcs-0 vault[933]: 2018-11-04T21:50:20.905Z [ERROR] auth.approle.auth_approle_61097b9d.tidy: error tidying global secret IDs: error="failed to read object: Get https://www.googleapis.com/storage/v1/b/vaultgcs_backend_us1_staging/o?alt=json&delimiter=%2F&pageToken=&prefix=auth%2F0ea44b0c-9a82-8f14-bc4f-dea5e7768538%2Fsecret_id%2F&projection=full&versions=false: stream error: stream ID 10745; REFUSED_STREAM"

Eventually Vault fails over to the standby, which does just fine and doesn't have any of these issues. We're definitely not at a scale where the Cloud Spanner storage backend makes sense so I'd really like to try and figure out what's going on here.

@KJTsanaktsidis
Copy link

Interestingly, the GCS server seems to set MAX_CONCURRENT_STREAMS=100 in its header frame, and the default value of max_parallel is 128, so it seems like this is just asking for a bunch of REFUSED_STREAM errors?

I'm experimenting with dropping max_parallel below 100 - if that fixes it, perhaps my issue is not the same as @lbernail 's issue.

@lbernail
Copy link
Author

lbernail commented Nov 5, 2018

Let us know if things get better for you with this option. In our case the connection pattern changed a little but we still saw a lot of REFUSED_STREAMS
The best would probably be to be able to use multiple connections to GCS but this cannot be changed easily with the gcs sdk (create more clients does not increase the number of connections from what we saw)

@KJTsanaktsidis
Copy link

So I spent a bit of time trying to figure out what's going on here. Basically, our Vault setup looks like this:

  • We use the GCS backend in HA mode
  • There are two Vault servers
  • The GCS bucket is regional and the Vault servers are in different zones in the same region as the bucket
  • We use the GCP auth method extensively with short TTLs (1h)
  • We use dynamic secret backends (like MongoDB) with these tokens

The problem we're experiencing is that Vault is continuously experiencing leadership transitions between the two servers. The loss of leadership on a host is accompanied with many messages like the following:

Nov 5 08:24:48 vaultgcs-0.c.outbound-staging.internal vault[26331]: 2018-11-05T08:24:48.771Z [ERROR] expiration: failed to revoke lease: lease_id=auth/gcp/login/793c04b9d14cd3a2272cb0b86a48084cf11d7191 error="failed to revoke token: failed to scan for children: failed to read object: Get https://www.googleapis.com/storage/v1/b/vaultgcs_backend_us1_staging/o?alt=json&delimiter=%2F&pageToken=&prefix=sys%2Ftoken%2Fparent%2F793c04b9d14cd3a2272cb0b86a48084cf11d7191%2F&prettyPrint=false&projection=full&versions=false: stream error: stream ID 15391; REFUSED_STREAM"Nov 5 08:24:54 vaultgcs-0.c.outbound-staging.internal vault[26331]: 2018-11-05T08:24:54.542Z [ERROR] core: barrier init check failed: error="failed to check for initialization: failed to read object: Get https://www.googleapis.com/storage/v1/b/vaultgcs_backend_us1_staging/o?alt=json&delimiter=%2F&pageToken=&prefix=core%2F&prettyPrint=false&projection=full&versions=false: stream error: stream ID 15397; REFUSED_STREAM"

It seems like the Vault active node loses its ability to keep the HA lock because it receives REFUSED_STREAM errors from Google Cloud Storage, leading to a failover; this happens like clockwork every 30 minutes or so.

One thing I noticed about our workload is that on one host, we have an instance of consul-template in a restart loop. The consul-template task authenticates to Vault with a GCE metadata token, provides consul-template with the Vault token, and uses consul-template to fetch a dynamic MongoDB username/password pair secret. consul-template then tries to render the template to disk but fails because it doesn't have write permissions to the directory.

consul-template appears to renew the secret once immediately after it's fetched in the background. I see trace messages like [TRACE] vault.read(db/mongodb/creds/mongodb-out00-instance-monitor): starting renewer from the consul-template process, but never see the corresponding successfully renewed log message. The process is crashing due to the failed template render before the renewal can complete in most cases. I think this is causing a lot of log messages in our Vault server like the following

Nov 5 08:20:23 vaultgcs-0.c.outbound-staging.internal vault[26331]: 2018-11-05T08:20:23.373Z [ERROR] secrets.system.system_663f7081: lease renewal failed: lease_id=db/mongodb/creds/mongodb-out00-instance-monitor/7EHeCWK1SFe9O0XybyvYm5Mh error="failed to read lease entry: failed to read value for "sys/expire/id/db/mongodb/creds/mongodb-out00-instance-monitor/7EHeCWK1SFe9O0XybyvYm5Mh": Get https://storage.googleapis.com/vaultgcs_backend_us1_staging/sys/expire/id/db/mongodb/creds/mongodb-out00-instance-monitor/7EHeCWK1SFe9O0XybyvYm5Mh: context canceled"

So, what I suspect is happening (without any conclusive evidence) is that somewhere in the HTTP2 transport there's a race in how context cancellation is handled, and streams that Vault thinks are closed, GCS thinks are open. This would eventually lead to exhausting the MAX_CONCURRENT_STREAMS limit and the REFUSED_STREAM errors we're seeing here (because under normal operations, we would never expect to see REFUSED_STREAM because client and server should be in agreement about whether or not a new stream can be opened).

Of course, the problem could be something else completely and the context cancelation errors are a total red herring too - this is just speculation for the moment :)

I had no luck changing max_parallel, but as a temporary workaround setting the environment variable GODEBUG=http2client=0 to turn of HTTP2 transport for GCS made the leadership flapping go away.

I've still got the broken consul-template workload set up for the moment, so if you have any idea for how this can be debugged further, I'd love to give it a swing. I was, perhaps unsurprisingly, unable to reproduce the issue on a local environment; I can trigger the context canceled errors pretty easily with consul-template but it isn't causing any problems for the HTTP2 stream.

@KJTsanaktsidis
Copy link

So I think I'm beginning to zero in on the issue here. I did a packet capture of Vault's communication with the GCS server to try and understand the state of the HTTP2 connection. To do this, I built a version of Vault with this patch (16b2b83) which lets us dump out TLS session keys for analysis (and it also puts a custom http.Transport into storage.NewClient(), if that interests you @lbernail ).

I probably shouldn't share the pcap file here (although the contents of the requests are protected by the Vault seal, there's stuff like OAuth bearer tokens for the communication to GCS). But, I did find some interesting things:

  • Two separate HTTP/2 connections were in fact made to GCS - not just one
  • One of the connections (TCP stream 4 in my packet capture) began receiving RST_STREAM frames with reason REFUSED_STREAM during my test, which was sufficient to cause the host to lose the leadership
  • I used the following Wireshark filter to find all the HTTP2 stream IDs that had either been closed by the server (because they're finished) or closed by some kind of client or server side reset: tcp.stream eq 4 && http2 && ((http2.flags.end_stream == 1 && ip.dst == 10.234.72.12) || (http2.rst_stream.error)). I did some really hacky analysis on the CSV export from Wireshark, and found that there were 120 stream IDs that appear to have never been closed!

Looking at some of the HTTP2 stream IDs that were unclosed, they all appeared to be POST requests with multipart upload that never actually had the bodies sent - for example:

Frame 56172: 177 bytes on wire (1416 bits), 177 bytes captured (1416 bits)
Ethernet II, Src: 42:01:0a:ea:48:0c (42:01:0a:ea:48:0c), Dst: 42:01:0a:ea:48:01 (42:01:0a:ea:48:01)
Internet Protocol Version 4, Src: 10.234.72.12, Dst: 172.217.8.10
Transmission Control Protocol, Src Port: 18628, Dst Port: 443, Seq: 1166186, Ack: 17392303, Len: 111
Secure Sockets Layer
HyperText Transfer Protocol 2
    Stream: HEADERS, Stream ID: 3731, Length 73, POST /upload/storage/v1/b/vaultgcs_backend_us1_staging/o?alt=json&prettyPrint=false&projection=full&uploadType=multipart
        Length: 73
        Type: HEADERS (1)
        Flags: 0x04
            .... ...0 = End Stream: False
            .... .1.. = End Headers: True
            .... 0... = Padded: False
            ..0. .... = Priority: False
            00.0 ..0. = Unused: 0x00
        0... .... .... .... .... .... .... .... = Reserved: 0x0
        .000 0000 0000 0000 0000 1110 1001 0011 = Stream Identifier: 3731
        [Pad Length: 0]
        Header Block Fragment: cb83ce87d2c55fbfa6da126ac76258b0b40d2593ed48cf6d...
        [Header Length: 632]
        [Header Count: 9]
        Header: :authority: www.googleapis.com
        Header: :method: POST
        Header: :path: /upload/storage/v1/b/vaultgcs_backend_us1_staging/o?alt=json&prettyPrint=false&projection=full&uploadType=multipart
        Header: :scheme: https
        Header: x-goog-api-client: gl-go/1.11.2 gccl/20180226
        Header: user-agent: Vault/1.0.0-beta1 (+https://www.vaultproject.io/; go1.11.2)
        Header: content-type: multipart/related; boundary=d130391dabef43b667ae61a86adbf5ae74571e22b29412b0d633c66cbe1d
        Header: authorization: Bearer REDACTED_GOOGLE_CLOUD_BEARER_TOKEN
        Header: accept-encoding: gzip

That's the only HTTP2 for that stream ID; so, the server is waiting for some body frames that the client is never sending. This happens 100 times, and then the server starts refusing the creation of new streams because it thinks the client needs to deal with the ones that it already has open.

I'm still not sure if this is a bug in Vault, google-cloud-go, or the stdlib itself though.

@KJTsanaktsidis
Copy link

And I think I have a reproduction of this outside of Vault now: https://github.com/KJTsanaktsidis/refused_stream_repro so I think from here I'm going to file an issue on google-cloud-go.

@sethvargo
Copy link
Contributor

Hey @KJTsanaktsidis - thanks for putting this together, and sorry this is causing everyone such issues. I've pinged the team responsible for client libraries to take a look at this (and your reproduction). Even though this isn't an issue in Vault, I'd still vote to keep this issue open for tracking.

@enocom
Copy link

enocom commented Nov 7, 2018

This looks like an http2 issue in Go. Thanks to @KJTsanaktsidis for going the extra mile.

See golang/go#27208. Presently, it seems the fix will land in Go 1.12.

@lbernail
Copy link
Author

lbernail commented Nov 8, 2018

@KJTsanaktsidis many thanks for diving into this

Also, in case someone stumbles on this issue before the fix gets into vault, we ran a typical deployment that used to trigger high latencies and errors with GODEBUG=http2client=0 and the issue completely disappeared, so it's definitely a nice workaround until the fix in http2

@mrahbar
Copy link

mrahbar commented Nov 20, 2018

I have been following this issue a while and I'm happy on the one hand that the culprit seems to be found but on the other hand it is not very transparent what the consequences for vault are.

Does this mean that the vault project either has to wait for GO 1.12 or the GO 1.11 backport and than it has to be recompiled against that GO version and afterwards we can expect that we can use the GCS backend? What will the time window of such an process be? Go 1.12 will be expected for approx 02/19 hopefully the backport lands earlier (I guys maybe 01/19). So will the vault rebuild happen in 1.0.0 or 1.0.1 or even 1.1.0 ?
I'm planning a big installation of vault and this issue is a real blocker (even with the GODEBUG=http2client=0 trick). It would be really helpful for use if we got some dates to work with. Thanks and keep up the great work 👍 .

@jefferai
Copy link
Member

We use the x/net lib for some things so the things that use that will be updated for 1.0 GA. Outside of that it's pretty much what you said... Wait for a build with the backport, or build yourself from a custom Go build.

@mrahbar
Copy link

mrahbar commented Nov 20, 2018

@jefferai Thanks for your comment. It would help me if you could point out when the 1.0 GA approx. will land. Would you recommend using another kind of backend, e.g. postgres and do a migration to gcs afterwards? Would such a temporary solution be feasible or even reasonable?

@jefferai
Copy link
Member

@mrahbar Note that I don't believe what we're doing in 1.0 will affect this, as I highly doubt GCS lib is using x/net instead of the built-in stdlib http client -- unless they would want to switch to it, which also seems unlikely. And there isn't even a patch yet on the Go side that we could cherry pick into our builds -- they have approved it for cherry pick but not yet actually done the work.

You could use a different backend. I can't really comment on any of them (or really much on the GCS one) as we don't support anything that isn't inmem/file/consul.

@KJTsanaktsidis
Copy link

I’m curious what impact you’re concerned about from disabling HTTP2 in your vault deployment? Is it just a performance thing? FWIW I haven’t really measured anything in detail but none of the Vault telemetry data really moved when I enabled this mitigation in our deployment.

@jefferai
Copy link
Member

☝️ I also am curious about this.

@mrahbar
Copy link

mrahbar commented Nov 20, 2018

@KJTsanaktsidis I think I would also go with disabling HTTP2 for now. Having said that I would prefer not to do so if I would have an alternative. The vault setup I'm planning needs to have a low latency and have to server small chunks of stored secret quite often. I have to admit that I didn't do any benchmarking after I stumbled upon this issue and needless to say I'm also not aware of any smart caching logics in the go client for gcs (So @KJTsanaktsidis if you have some benchmarks I would be more than thankful). My concerns where mainly motivated from a project management perspective as I needed to knew when and under which process the fix can be expected to land in vault. I'm aware that the gcs backend is not a vault managed plugin but in the end I have to wait for a vault release compiled against a GO version including the patch or build it myself.

@KJTsanaktsidis
Copy link

I don’t have anything that you would call a benchmark unfortunately and our workload is not really that high volume at the moment.

The go http1 client I think keeps connections alive and caches them for reuse, so even with http2 disabled the gcs client shouldn’t be making/tearing down a ton of connections. So I don’t think disabling http2 should have a big impact on latency

@sethvargo
Copy link
Contributor

Hey @jefferai - was Vault 1.0.2 built with Go 1.11.4? If yes, this should be fixed now.

@jefferai
Copy link
Member

jefferai commented Feb 8, 2019

Looks like yes

@jefferai
Copy link
Member

jefferai commented Feb 8, 2019

I guess I'll close and if people still have an issue we'll reopen.

@jefferai jefferai closed this as completed Feb 8, 2019
@sethvargo
Copy link
Contributor

Sweet! All - please upgrade to Vault 1.0.2 or later. If the issue persists, please capture logs and metrics and we'll reopen the issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants