Potential for an open connection leak #8654

shashankmehra · 2020-04-01T11:01:37Z

I was trying to analyze memory usage of my application when I noticed the following pprof output:

list RawRequestWithContext
Total: 180.79MB
ROUTINE ======================== github.com/hashicorp/vault/api.(*Client).RawRequestWithContext in github.com/hashicorp/vault/api/client.go
    1.12MB    42.63MB (flat, cum) 23.58% of Total
         .          .    795:	}
         .          .    796:
         .          .    797:	var result *Response
         .          .    798:	resp, err := client.Do(req)
         .          .    799:	if resp != nil {
         .     4.45MB    800:		result = &Response{Response: resp}
         .          .    801:	}
         .          .    802:	if err != nil {
         .          .    803:		if strings.Contains(err.Error(), "tls: oversized") {
         .          .    804:			err = errwrap.Wrapf(
         .          .    805:				"{{err}}\n\n"+
         .          .    806:					"This error usually means that the server is running with TLS disabled\n"+
         .          .    807:					"but the client is configured to use TLS. Please either enable TLS\n"+
         .          .    808:					"on the server or run the client with -address set to an address\n"+
         .          .    809:					"that uses the http protocol:\n\n"+
         .          .    810:					"    vault <command> -address http://<address>\n\n"+
         .          .    811:					"You can also set the VAULT_ADDR environment variable:\n\n\n"+
         .          .    812:					"    VAULT_ADDR=http://<address> vault <command>\n\n"+
         .          .    813:					"where <address> is replaced by the actual address to the server.",
         .          .    814:				err)
         .          .    815:		}
         .          .    816:		return result, err
         .          .    817:	}
         .          .    818:
         .          .    819:	// Check for a redirect, only allowing for a single redirect
         .          .    820:	if (resp.StatusCode == 301 || resp.StatusCode == 302 || resp.StatusCode == 307) && redirectCount == 0 {
         .     4.32MB    821:		// Parse the updated location
         .          .    822:		respLoc, err := resp.Location()
         .     3.43MB    823:		if err != nil {
         .          .    824:			return result, err
         .          .    825:		}
         .          .    826:
         .          .    827:		// Ensure a protocol downgrade doesn't happen
         .          .    828:		if req.URL.Scheme == "https" && respLoc.Scheme != "https" {
         .          .    829:			return result, fmt.Errorf("redirect would cause protocol downgrade")
         .          .    830:		}
         .          .    831:
         .          .    832:		// Update the request
         .          .    833:		r.URL = respLoc
         .          .    834:
         .          .    835:		// Reset the request body if any
         .          .    836:		if err := r.ResetJSONBody(); err != nil {
         .          .    837:			return result, err
         .          .    838:		}
         .          .    839:
    1.12MB     1.12MB    840:		// Retry the request
         .          .    841:		redirectCount++
         .          .    842:		goto START
         .          .    843:	}
         .    29.31MB    844:
         .          .    845:	if err := result.Error(); err != nil {
         .          .    846:		return result, err
         .          .    847:	}
         .          .    848:
         .          .    849:	return result, nil

I am trying to understand why 29.31MB was accumulated at line 844. My hypothesis is that at line 846 https://github.com/hashicorp/vault/blob/master/api/client.go#L846 in function RawRequestWithContext a response object is created whose body is to be closed by the caller of this function.

However consider the case of a redirect where at line 890 the statement goto START is executed. The line 846 will be executed again creating another response object. The previously created response object's body is never closed causing a connection and memory leak for every redirect.

The text was updated successfully, but these errors were encountered:

tyrannosaurus-becks · 2020-04-01T15:51:44Z

Hi! Thanks for the report! Can you provide further steps to reproduce this? Information such as what Vault version you're on, what API calls you're making, and any further information that might help us reproduce this locally on our own machines. Also, are you seeing ongoing growth in memory usage, or does memory tend to stabilize like this?

lawliet89 · 2020-05-04T01:12:18Z

I am observing the same memory leak on my Vault 1.4.0 server running on Kubernetes.

You can see the memory usage graph below:

It might be useful to note that this replica became the active server some time on April 22.

I have a bunch of secret engines mounted:

GCP
AWS
KV
Postgres

And the following auth mounts:

Kubernetes
LDAP

I am also having Prometheus constantly scrape Vault server for metrics.

Is there anything more I can do to provide more information?

The pod will probably OOM soon and be killed by Kubernetes.

kalafut · 2020-05-04T01:23:44Z

@lawliet89 You can run Vault debug (https://www.vaultproject.io/docs/commands/debug) which will, among other things, capture heap usage stats. We can help analyze these logs (they may be emailed or otherwise shared outside of GitHub).

lawliet89 · 2020-05-04T01:26:05Z

Thanks @kalafut

May I know who I should email the logs to?

kalafut · 2020-05-04T02:11:03Z

@lawliet89 You can send them to me (email is in Vault's commits).

kalafut · 2020-05-04T04:52:08Z

@lawliet89 Thanks for sending those. There is almost 1GB in use by the write buffer for GCS storage (in a function in the GCS sdk). I'm not familiar enough with the GCS storage details to know why, however. I do wonder what effect (if any) https://www.vaultproject.io/docs/configuration/storage/google-cloud-storage#chunk_size would have.

lawliet89 · 2020-05-04T05:12:32Z

@kalafut

I haven't really looked into it. but maybe this can be fixed with a bump to the Google Cloud SDK.

An update was recently reverted in ea1564b due to etcd-io/etcd#11154 of which the fix should be released shortly.

I will try halving the chunk size to see if it makes any difference.

FWIW, I'm in the process of migrating to Raft storage. This might not be a fix for future readers, however.

lawliet89 · 2020-05-06T01:14:27Z

Maybe this is too early to say, but halving the chunk size seems to have fixed the memory leak issue. I will report back if this is not the case next week.

EDIT: One whole week of data. Looks good.

kalafut · 2020-05-06T01:28:40Z

Thanks, that is a useful datapoint.

sidewinder12s · 2020-06-06T00:27:55Z

I may also be seeing similar memory usage growth, though we've been running into memory exhaustion issues generally that make the graphs not super clear. Though I am on version 1.1.0.

@lawliet89 I assume you halved it to 4096?

lawliet89 · 2020-06-07T08:01:55Z

@sidewinder12s Yes.

rokrokss · 2020-08-13T13:05:30Z

https://github.com/hashicorp/vault/blob/master/api/logical.go#L132

experienced the same issue and it was created by the write api calls. read didn't make this.

sidewinder12s · 2020-09-02T20:55:43Z

For other folks, I suspect this issue may be compounding any performance issues I'm having with GCS: #8761

I've got 1000s of new nodes doing GCP logins daily that probably are not helping my GCS performance.

elsesiy · 2020-11-03T20:59:15Z

@kalafut I'm investigating a somewhat related memory leak and noticed the following using vault 1.5.4:

(pprof) list wrapGenericHandler
Total: 30.41GB
ROUTINE ======================== github.com/hashicorp/vault/http.wrapGenericHandler.func1 in vault/http/handler.go
      27MB     9.16GB (flat, cum) 30.12% of Total
         .          .    267:		maxRequestDuration = props.ListenerConfig.MaxRequestDuration
         .          .    268:		maxRequestSize = props.ListenerConfig.MaxRequestSize
         .          .    269:	}
         .          .    270:	if maxRequestDuration == 0 {
         .          .    271:		maxRequestDuration = vault.DefaultMaxRequestDuration
         .   396.62MB    272:	}
         .          .    273:	if maxRequestSize == 0 {
         .          .    274:		maxRequestSize = DefaultMaxRequestSize
         .          .    275:	}
         .          .    276:	return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
         .          .    277:		// Set the Cache-Control header for all the responses returned
         .          .    278:		// by Vault
         .          .    279:		w.Header().Set("Cache-Control", "no-store")
         .          .    280:
         .   184.52MB    281:		// Start with the request context
         .          .    282:		ctx := r.Context()
         .          .    283:		var cancelFunc context.CancelFunc
         .          .    284:		// Add our timeout, but not for the monitor endpoint, as it's streaming
       8MB       50MB    285:		if strings.HasSuffix(r.URL.Path, "sys/monitor") {
         .          .    286:			ctx, cancelFunc = context.WithCancel(ctx)
      19MB       60MB    287:		} else {
         .   512.12kB    288:			ctx, cancelFunc = context.WithTimeout(ctx, maxRequestDuration)
         .   407.57MB    289:		}
         .          .    290:		// Add a size limiter if desired
         .          .    291:		if maxRequestSize > 0 {
         .          .    292:			ctx = context.WithValue(ctx, "max_request_size", maxRequestSize)
         .          .    293:		}
         .          .    294:		ctx = context.WithValue(ctx, "original_request_path", r.URL.Path)
         .          .    295:		r = r.WithContext(ctx)
         .          .    296:		r = r.WithContext(namespace.ContextWithNamespace(r.Context(), namespace.RootNamespace))
         .          .    297:
         .          .    298:		switch {
         .          .    299:		case strings.HasPrefix(r.URL.Path, "/v1/"):
         .          .    300:			newR, status := adjustRequest(core, r)
         .          .    301:			if status != 0 {
         .          .    302:				respondError(w, status, nil)
         .          .    303:				cancelFunc()
         .          .    304:				return
         .          .    305:			}
         .          .    306:			r = newR
         .          .    307:
         .     8.09GB    308:		case strings.HasPrefix(r.URL.Path, "/ui"), r.URL.Path == "/robots.txt", r.URL.Path == "/":
         .          .    309:		default:
         .          .    310:			respondError(w, http.StatusNotFound, nil)
         .          .    311:			cancelFunc()
         .          .    312:			return
         .          .    313:		}

Seems like the string matching is very expensive here

calvn · 2020-11-04T23:15:04Z

     .   184.52MB    281:		// Start with the request context

The line references on mem usage like the one above seem to be off, can you do pprof with the -source_path flag and point it to the local vault repository location with the commit for 1.5.4 checked out (should be 1a730771ec70149293efe91e1d283b10d255c6d1)?

elsesiy · 2020-11-05T01:02:07Z

@calvn my bad, I was on master.

Total: 30.41GB
ROUTINE ======================== github.com/hashicorp/vault/http.wrapGenericHandler.func1 in vault/http/handler.go
      27MB     9.16GB (flat, cum) 30.12% of Total
         .          .    267:		maxRequestSize = DefaultMaxRequestSize
         .          .    268:	}
         .          .    269:	return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
         .          .    270:		// Set the Cache-Control header for all the responses returned
         .          .    271:		// by Vault
         .   396.62MB    272:		w.Header().Set("Cache-Control", "no-store")
         .          .    273:
         .          .    274:		// Start with the request context
         .          .    275:		ctx := r.Context()
         .          .    276:		var cancelFunc context.CancelFunc
         .          .    277:		// Add our timeout, but not for the monitor endpoint, as it's streaming
         .          .    278:		if strings.HasSuffix(r.URL.Path, "sys/monitor") {
         .          .    279:			ctx, cancelFunc = context.WithCancel(ctx)
         .          .    280:		} else {
         .   184.52MB    281:			ctx, cancelFunc = context.WithTimeout(ctx, maxRequestDuration)
         .          .    282:		}
         .          .    283:		// Add a size limiter if desired
         .          .    284:		if maxRequestSize > 0 {
       8MB       50MB    285:			ctx = context.WithValue(ctx, "max_request_size", maxRequestSize)
         .          .    286:		}
      19MB       60MB    287:		ctx = context.WithValue(ctx, "original_request_path", r.URL.Path)
         .   512.12kB    288:		r = r.WithContext(ctx)
         .   407.57MB    289:		r = r.WithContext(namespace.ContextWithNamespace(r.Context(), namespace.RootNamespace))
         .          .    290:
         .          .    291:		switch {
         .          .    292:		case strings.HasPrefix(r.URL.Path, "/v1/"):
         .          .    293:			newR, status := adjustRequest(core, r)
         .          .    294:			if status != 0 {
         .          .    295:				respondError(w, status, nil)
         .          .    296:				cancelFunc()
         .          .    297:				return
         .          .    298:			}
         .          .    299:			r = newR
         .          .    300:
         .          .    301:		case strings.HasPrefix(r.URL.Path, "/ui"), r.URL.Path == "/robots.txt", r.URL.Path == "/":
         .          .    302:		default:
         .          .    303:			respondError(w, http.StatusNotFound, nil)
         .          .    304:			cancelFunc()
         .          .    305:			return
         .          .    306:		}
         .          .    307:
         .     8.09GB    308:		h.ServeHTTP(w, r)
         .          .    309:
         .          .    310:		cancelFunc()
         .          .    311:		return
         .          .    312:	})
         .          .    313:}

I'm happy to provide more debugging context and info on the issue, can I email you directly?

calvn · 2020-11-05T02:03:07Z

@elsesiy sure thing! You can send it to calvin{at}hashicorp{dot}com.

ncabatoff · 2021-07-20T20:57:54Z

Looks like there were a couple of different investigations going on here. I'm not sure if there were any resolutions or not, but it's been a long time and I suspect people have moved on. If people think they're seeing connection or memory leaks in current Vault versions, please open a new issue.

tyrannosaurus-becks added bug Used to indicate a potential bug waiting-for-response labels Apr 1, 2020

tyrannosaurus-becks added the core/client label Apr 1, 2020

raskchanky removed the waiting-for-response label Sep 16, 2020

ncabatoff closed this as completed Jul 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential for an open connection leak #8654

Potential for an open connection leak #8654

shashankmehra commented Apr 1, 2020

tyrannosaurus-becks commented Apr 1, 2020

lawliet89 commented May 4, 2020 •

edited

Loading

kalafut commented May 4, 2020

lawliet89 commented May 4, 2020

kalafut commented May 4, 2020

kalafut commented May 4, 2020

lawliet89 commented May 4, 2020 •

edited

Loading

lawliet89 commented May 6, 2020 •

edited

Loading

kalafut commented May 6, 2020

sidewinder12s commented Jun 6, 2020

lawliet89 commented Jun 7, 2020

rokrokss commented Aug 13, 2020

sidewinder12s commented Sep 2, 2020

elsesiy commented Nov 3, 2020

calvn commented Nov 4, 2020

elsesiy commented Nov 5, 2020

calvn commented Nov 5, 2020

ncabatoff commented Jul 20, 2021

Potential for an open connection leak #8654

Potential for an open connection leak #8654

Comments

shashankmehra commented Apr 1, 2020

tyrannosaurus-becks commented Apr 1, 2020

lawliet89 commented May 4, 2020 • edited Loading

kalafut commented May 4, 2020

lawliet89 commented May 4, 2020

kalafut commented May 4, 2020

kalafut commented May 4, 2020

lawliet89 commented May 4, 2020 • edited Loading

lawliet89 commented May 6, 2020 • edited Loading

kalafut commented May 6, 2020

sidewinder12s commented Jun 6, 2020

lawliet89 commented Jun 7, 2020

rokrokss commented Aug 13, 2020

sidewinder12s commented Sep 2, 2020

elsesiy commented Nov 3, 2020

calvn commented Nov 4, 2020

elsesiy commented Nov 5, 2020

calvn commented Nov 5, 2020

ncabatoff commented Jul 20, 2021

lawliet89 commented May 4, 2020 •

edited

Loading

lawliet89 commented May 4, 2020 •

edited

Loading

lawliet89 commented May 6, 2020 •

edited

Loading