Potential memory leak from WASM NullPlugin onNetworkNewConnection #24720

rmak-cpi · 2020-06-16T20:45:24Z

Bug description
Istio-proxy is consuming increasing amount of memory for an application with moderate HTTP traffic (10-20 qps)

[ ] Configuration Infrastructure
[ ] Docs
[ ] Installation
[ ] Networking
[x ] Performance and Scalability
[ ] Policies and Telemetry
[ ] Security
[ ] Test and Release
[ ] User Experience
[ ] Developer Infrastructure

Expected behavior
Istio-proxy memory should stay relatively constant over time.

Steps to reproduce the bug
Here is a heap profile flame graph after service has been up for 12+ hours:

The flame graph seems to indicate that the heap has been accumulating blocks allocated from operator() calls through the NullPlugin although I am at a bit of a loss to figure out a workaround from reading the relevant source code.

In case it's helpful, here is the text version of the heap profile:

File: envoy
Type: inuse_space
Showing nodes accounting for 33124.62kB, 98.34% of 33682.43kB total
Dropped 482 nodes (cum <= 168.41kB)
flat flat% sum% cum cum%
27546.82kB 81.78% 81.78% 33411.07kB 99.19% std::__1::__function::__func::operator()
1816.88kB 5.39% 87.18% 2421.37kB 7.19% std::__1::__hash_table::__emplace_unique_key_args
1449.59kB 4.30% 91.48% 2052.53kB 6.09% std::__1::__hash_table::__emplace_unique_impl
1206.61kB 3.58% 95.06% 1206.61kB 3.58% std::__1::__hash_table::__rehash
819.49kB 2.43% 97.50% 819.49kB 2.43% OPENSSL_malloc
246.34kB 0.73% 98.23% 246.34kB 0.73% std::__1::basic_string::basic_string
28.98kB 0.086% 98.31% 1025.07kB 3.04% Envoy::Extensions::TransportSockets::Tls::ContextManagerImpl::createSslClientContext
7.83kB 0.023% 98.34% 32049.05kB 95.15% Envoy::Server::ConnectionHandlerImpl::ActiveTcpListener::newConnection
2.06kB 0.0061% 98.34% 32053.88kB 95.16% Envoy::Network::ListenerImpl::listenCallback
0 0% 98.34% 613.34kB 1.82% ASN1_item_d2i
0 0% 98.34% 1439.37kB 4.27% Envoy::Config::GrpcMuxImpl::onDiscoveryResponse
0 0% 98.34% 1359.13kB 4.04% Envoy::Config::GrpcSubscriptionImpl::onConfigUpdate
0 0% 98.34% 1144.74kB 3.40% Envoy::Event::FileEventImpl::assignEvents(unsigned int, event_base*)::$_0::__invoke
0 0% 98.34% 29594.81kB 87.86% Envoy::Extensions::Common::Wasm::Context::onCreate
0 0% 98.34% 32009.40kB 95.03% Envoy::Extensions::Common::Wasm::Context::onNetworkNewConnection
0 0% 98.34% 29594.81kB 87.86% Envoy::Extensions::Common::Wasm::Null::NullPlugin::ensureContext
0 0% 98.34% 2414.76kB 7.17% Envoy::Extensions::Common::Wasm::Null::Plugin::Stats::PluginContext::onNewConnection
0 0% 98.34% 992.35kB 2.95% Envoy::Extensions::TransportSockets::Tls::ClientContextImpl::ClientContextImpl
0 0% 98.34% 1025.07kB 3.04% Envoy::Extensions::TransportSockets::Tls::ClientSslSocketFactory::onAddOrUpdateSecret
0 0% 98.34% 1081.84kB 3.21% Envoy::Extensions::TransportSockets::Tls::ContextImpl::ContextImpl
0 0% 98.34% 1433.78kB 4.26% Envoy::Grpc::AsyncStreamCallbacks::onReceiveMessageRaw
0 0% 98.34% 1430.18kB 4.25% Envoy::Grpc::AsyncStreamImpl::onData
0 0% 98.34% 1427.35kB 4.24% Envoy::Http::AsyncStreamImpl::encodeData
0 0% 98.34% 1223kB 3.63% Envoy::Http::CodecClient::CodecReadFilter::onData
0 0% 98.34% 1277.10kB 3.79% Envoy::Http::CodecClient::onData
0 0% 98.34% 1293.62kB 3.84% Envoy::Http::Http2::ConnectionImpl::dispatch
0 0% 98.34% 1402.62kB 4.16% Envoy::Http::Http2::ConnectionImpl::onFrameReceived
0 0% 98.34% 1409.21kB 4.18% Envoy::Http::ResponseDecoderWrapper::decodeData
0 0% 98.34% 753.30kB 2.24% Envoy::MainCommonBase::run
0 0% 98.34% 1178.06kB 3.50% Envoy::Network::ConnectionImpl::onFileEvent
0 0% 98.34% 1219.07kB 3.62% Envoy::Network::ConnectionImpl::onReadReady
0 0% 98.34% 32014.61kB 95.05% Envoy::Network::FilterManagerImpl::initializeReadFilters
0 0% 98.34% 33233.24kB 98.67% Envoy::Network::FilterManagerImpl::onContinueReading
0 0% 98.34% 1420.46kB 4.22% Envoy::Router::UpstreamRequest::decodeData
0 0% 98.34% 1359.39kB 4.04% Envoy::Secret::SdsApi::onConfigUpdate
0 0% 98.34% 32050.53kB 95.16% Envoy::Server::ConnectionHandlerImpl::ActiveTcpListener::onAcceptWorker
0 0% 98.34% 32050.53kB 95.16% Envoy::Server::ConnectionHandlerImpl::ActiveTcpSocket::continueFilterChain
0 0% 98.34% 886.31kB 2.63% Envoy::Server::InstanceImpl::run
0 0% 98.34% 32102.04kB 95.31% Envoy::Server::WorkerImpl::threadRoutine
0 0% 98.34% 220.22kB 0.65% Envoy::Ssl::TlsCertificateConfigImpl::TlsCertificateConfigImpl
0 0% 98.34% 32101.35kB 95.31% Envoy::Thread::ThreadImplPosix::ThreadImplPosix(std::__1::function)::$_0::__invoke
0 0% 98.34% 403.05kB 1.20% PEM_ASN1_read_bio
0 0% 98.34% 195.30kB 0.58% PEM_X509_INFO_read_bio
0 0% 98.34% 571.81kB 1.70% [libc-2.27.so]
0 0% 98.34% 223.61kB 0.66% asn1_d2i_ex_primitive
0 0% 98.34% 223.61kB 0.66% asn1_ex_c2i
0 0% 98.34% 175.63kB 0.52% asn1_item_ex_combine_new
0 0% 98.34% 613.34kB 1.82% asn1_item_ex_d2i
0 0% 98.34% 454.71kB 1.35% asn1_template_ex_d2i
0 0% 98.34% 454.71kB 1.35% asn1_template_noexp_d2i
0 0% 98.34% 207.76kB 0.62% d2i_X509_AUX
0 0% 98.34% 33099.96kB 98.27% event_base_loop
0 0% 98.34% 33183.10kB 98.52% event_process_active_single_queue
0 0% 98.34% 32053.88kB 95.16% listener_read_cb
0 0% 98.34% 675.96kB 2.01% main
0 0% 98.34% 1326.59kB 3.94% nghttp2_session_mem_recv
0 0% 98.34% 1401.21kB 4.16% nghttp2_session_on_data_received
0 0% 98.34% 195.30kB 0.58% parse_x509
0 0% 98.34% 32100.56kB 95.30% start_thread
0 0% 98.34% 602.94kB 1.79% std::__1::__hash_table::__node_insert_unique_prepare

Version (include the output of istioctl version --remote and kubectl version and helm version if you used Helm)

$istioctl version --remote
client version: 1.6.2
control plane version: 1.6.2
data plane version: 1.6.2 (20 proxies)

While I was only able to dig into this after the upgrade to 1.6.2, I have observed the increased memory consumption since 1.5.2.

$kubectl version
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.3", GitCommit:"2e7996e3e2712684bc73f0dec0200d64eec7fe40", GitTreeState:"clean", BuildDate:"2020-05-20T12:52:00Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"16+", GitVersion:"v1.16.8-eks-e16311", GitCommit:"e163110a04dcb2f39c3325af96d019b4925419eb", GitTreeState:"clean", BuildDate:"2020-03-27T22:37:12Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/amd64"}

How was Istio installed?

istioctl manifest apply -f/istioctl upgrade -f

Environment where bug was observed (cloud vendor, OS, etc)

AWS EKS 1.6 control plane and worker nodes

The text was updated successfully, but these errors were encountered:

bianpengyuan · 2020-06-16T22:39:53Z

Can you pasted the stats that are collected by istio-proxy and related to the problematic service (query prometheus server for istio_requests_total & istio_tcp_{received,sent}_bytes_total, or exec into proxy container and curl localhost:15000/stats/prometheus).

Also when you are saying this is a HTTP app, do you mean plain HTTP or HTTPS? I am wondering how does onNetworkNewConnection triggered for HTTP traffics @gargnupur Is this expected?

gargnupur · 2020-06-16T23:11:09Z

Looks like this is HTTPS because it's creating a TLS client connection from the stacktrace...
I am not sure if this should happen.. will have to test it out..

From the stacktrace.. looks like here is what https://sourcegraph.com/github.com/envoyproxy/envoy@master/-/blob/source/extensions/transport_sockets/tls/context_manager_impl.cc#L35:13 is using space...

gargnupur · 2020-06-16T23:12:32Z

Could it be related to envoyproxy/envoy#10030?

rmak-cpi · 2020-06-17T00:04:01Z

Can you pasted the stats that are collected by istio-proxy and related to the problematic service (query prometheus server for istio_requests_total & istio_tcp_{received,sent}_bytes_total, or exec into proxy container and curl localhost:15000/stats/prometheus).

Also when you are saying this is a HTTP app, do you mean plain HTTP or HTTPS? I am wondering how does onNetworkNewConnection triggered for HTTP traffics @gargnupur Is this expected?

Hi @bianpengyuan, per your request, here is a link to the prometheus metrics on istio-proxy: https://gist.github.com/rmak-cpi/835570832707ca8a58e67c3947938a00.

HTTPs is terminated at our cloud provider's load-balancer, but the application does serve mTLS traffic for in-cluster clients and that might be useful information.

(gist link was bad, updated)

rmak-cpi · 2020-07-07T00:30:24Z

Hi @bianpengyuan, I took some time reading through the new WASM plugin code (it looks like I have both the metadata_exchange and stats plugins enabled), and if I am reading the code correctly, the Context definition should be coming from the WASM plugin sdk (https://github.com/proxy-wasm/proxy-wasm-cpp-sdk/blob/master/proxy_wasm_api.h#L429); and given that the Context class has a virtual destructor defined, shouldn't the PluginContext class in the stats (and the metadata_exchange) plugin have virtual destructor defined as well otherwise destructors for the members will not be called (in particular request_info_ shared_ptr)? This can explain the slow leak that I am observing but I may be reading too much into this so it would be great if you can confirm one way or another. Thanks!

rmak-cpi · 2020-07-07T01:01:05Z

Please ignore my previous comment about declaring virtual destructor in the PluginContext class, looks like the compiler should take care of generating the correct destructor for PluginContext in the vtable. That said, it means I don't really have anything to go on to help track down this problem :P

rmak-cpi · 2020-08-14T01:58:39Z

I have narrowed the issue down to the stats plugin (https://github.com/istio/proxy/blob/master/extensions/stats/plugin.h) by disabling metrics and observing that memory consumption by the istio proxy remained constant for long periods of time. Disabling metrics is clearly less than ideal, and from looking at the code, the leak can definitely occur if the onLog method is not called reliably on each request. That said, I am having a hard time tracing through the (envoy/istio-proxy/wasm-proxy) code to understand how onLog is being called in the request lifecycle so any pointers to the relevant code snippets will be very much appreciated.

bianpengyuan · 2020-08-14T02:43:00Z

Is there any labels in the metric that has unbounded value? Can you dump the stats page k exec <pod> -c istio-proxy curl localhost:15000/stats/prometheus and check if there is too many time series generated?

rmak-cpi · 2020-08-14T16:08:43Z

Is there any labels in the metric that has unbounded value? Can you dump the stats page k exec <pod> -c istio-proxy curl localhost:15000/stats/prometheus and check if there is too many time series generated?

Hi @bianpengyuan, I have posted a new gist of the prometheus metrics dump at https://gist.github.com/rmak-cpi/fbf7c59eb16a10cc4a0e0dba1c241597, one thing that jumps out at me is that the total number of tcp connections opened is much higher than the total number of tcp connections closed:

TYPE istio_tcp_connections_opened_total counter

istio_tcp_connections_opened_total{reporter="source",source_workload="xxxxxxxxxxxxxxx",source_workload_namespace="default",source_principal="unknown",source_app="unknown",source_version="unknown",destination_workload="unknown",destination_workload_namespace="unknown",destination_principal="unknown",destination_app="unknown",destination_version="unknown",destination_service="unknown",destination_service_name="PassthroughCluster",destination_service_namespace="unknown",request_protocol="tcp",response_flags="-",connection_security_policy="unknown",source_canonical_service="xxxxxxxxxxxxxxx",destination_canonical_service="unknown",source_canonical_revision="latest",destination_canonical_revision="latest"} 55938

TYPE istio_tcp_connections_closed_total counter

istio_tcp_connections_closed_total{reporter="source",source_workload="xxxxxxxxxxxxxxx",source_workload_namespace="default",source_principal="unknown",source_app="unknown",source_version="unknown",destination_workload="unknown",destination_workload_namespace="unknown",destination_principal="unknown",destination_app="unknown",destination_version="unknown",destination_service="unknown",destination_service_name="PassthroughCluster",destination_service_namespace="unknown",request_protocol="tcp",response_flags="-",connection_security_policy="unknown",source_canonical_service="xxxxxxxxxxxxxxx",destination_canonical_service="unknown",source_canonical_revision="latest",destination_canonical_revision="latest"} 861

gargnupur · 2020-08-14T16:34:07Z

If tcp connections don't close, then RequestInfo will hang around in the plugin:
Map<uint32_t, std::shared_ptr<::Wasm::Common::RequestInfo>>
tcp_request_queue_;
This could be the reason for the leak... do you know why requests are not closing?

rmak-cpi · 2020-08-14T17:26:00Z

If tcp connections don't close, then RequestInfo will hang around in the plugin:
Map<uint32_t, std::shared_ptr<::Wasm::Common::RequestInfo>>
tcp_request_queue_;
This could be the reason for the leak... do you know why requests are not closing?

The service is taking http traffic directly from a cloud-provider load-balancer (AWS alb), I am suspecting this may have something to do with http keepalive handling between envoy and the load-balancer.

rmak-cpi · 2020-08-14T17:28:31Z

I am also not seeing any signs of TCP sockets leaking at the operating system level through node-exporter metrics.

bianpengyuan · 2020-08-14T17:43:46Z

Is it that the sidecar with increased memory only received http requests? There should be no TCP metric generated if so.

If it is possible, and to help us narrow down the issue, can you try to delete the 1.6 tcp stats filter (k delete envoyfilter tcp-stats-filter-1.6 -n istio-system), and see if that solves your issue? Without this filter, http metric should still be generated.

rmak-cpi · 2020-08-14T18:28:37Z

Hi @bianpengyuan, let me try that later on as I am running an experiment disabling http keepalive, and I will report back once I've obtained data from disabling the tcp-stats-filter.

rmak-cpi · 2020-08-15T16:43:47Z

Hi @bianpengyuan, just want to report back that envy proxy memory usage after disabling the tcp stats filter remained stable, but my own experiment disabling http keepalive hasn't gone so well since I have yet to find the magic config to to actually disable http keepalive.

rmak-cpi · 2020-08-16T18:07:54Z

Since onLog is only called when connection is closed on both upstream and downstream, the following code snippet seems a little risky (https://github.com/istio/envoy/blob/master/source/extensions/common/wasm/context.cc#L1571):

if (end_stream) {
// This is called when seeing end_stream=true and not on an upstream connection event,
// because registering for latter requires replicating the whole TCP proxy extension.
onUpstreamConnectionClose(PeerType::Unknown);
}

It sounds like an upstream connection close event can only be detected if there is explicit data flow terminating the data stream but I could be reading too much into this.

gargnupur · 2020-08-17T12:46:35Z

/cc @jplevyak
@rmak-cpi : how about something like this? https://stackoverflow.com/questions/31052911/how-to-prevent-keep-alive-in-a-http-1-1-request

rmak-cpi · 2020-08-17T15:10:06Z

/cc @jplevyak
@rmak-cpi : how about something like this? https://stackoverflow.com/questions/31052911/how-to-prevent-keep-alive-in-a-http-1-1-request

Thanks @gargnupur, I was hoping to disable http keepalive through envoy config only without any (client/server) application code changes. I will try to play around with EnvoyFilter some more to see if I can disable http keepalive there.

rmak-cpi · 2020-08-18T00:03:57Z

I managed to disable http keepalive on the server side but the leak is still occurring with the tcp connections opened metric growing much bigger compared with the tcp connections closed metric over time. At this point, I don't think it's productive for me to flail around guessing what the underlying problem is although I will be happy to run an instrumented build (perhaps with additional metrics for upstreamClosed vs downstreamClosed events detected by the wasm context?).

Thanks to @bianpengyuan's suggestion, I do have a decent workaround of disabling the tcp stats plugin, but I do hope to see this issue resolved as this seems to be a fundamental mismatch in how connection lifecycle events should be handled in the wasm plugin.

gargnupur · 2020-08-18T00:05:49Z

@rmak-cpi : Can you give us some pointers on your setup, so that we can try to repro this on our end?

rmak-cpi · 2020-08-18T00:21:32Z

@rmak-cpi : Can you give us some pointers on your setup, so that we can try to repro this on our end?

There really isn't anything particularly special about my setup, it's basically an http service exposed through the following chain of lb/proxies:

public AWS ALB -> istio envoy sidecar-> my java application ( my service is configured as a nodeport service so alb can use all cluster workers as target instances)

Most gcloud/gke setups are probably dominated by grpc which I think uses mostly persistent connection intended to last forever. The leakage that I observe is actually happening pretty slowly given that http keep alive is in place to keep tcp connections for relatively long period but it's definitely noticeable after a week or so.

bianpengyuan · 2020-08-18T03:16:44Z

Thanks to @bianpengyuan's suggestion, I do have a decent workaround of disabling the tcp stats plugin

@rmak-cpi Sorry I might miss some context here, did you remove tcp filter and envoy memory usage did not increase any more because of that? My interpretation of your comment #24720 (comment) is that you've not performed the test, but seems like you did? If so I think we can try expose some envoy stats and make sure if this is a memory leak for connection leak.

rmak-cpi · 2020-08-18T05:22:45Z

Hi @bianpengyuan, to clarify my earlier comment, I did verify that the leak stopped once I removed the (and only the) tcp stat filters. I wanted to experiment with keep alive settings with the hope that I might be able to further isolate the problem.

kppullin · 2020-09-09T04:08:42Z

I believe to have run into the same leak. On our most severely affected pod, the heap pprof flamegraph looks more or less identical to the image in the initial report above. Deleting the stats filter as suggested releases the "leaked" memory as shown in this screenshot (it's only over a 5 minute range so you can't see the growth of the leak too well).

It is interesting to note that we only have outbound stats enabled, and sure enough the services running in our cluster that frequently open many outbound connections (e.g. non-pooled connections to the k8s API servers and various database flavors) leak much, much faster than those utilizing HTTP keep-alives and DB connection pools. This aligns with the heap flamegraph showing high allocations in the newConnection path. Those with high inbound connection open rates do not appear to leak. I think we've lucked out here a bit since our inbound open rate is much higher than out : ).

The memory leak is slow, about 2-5MB/hour depending on our dev, test, and production clusters. A quick, rough look via tcpdump shows a connection open rate of ~2 connections/sec.

Happy to provide further traces, tests, etc. to get to the root cause.

andrascz · 2020-11-03T11:02:58Z

We are hitting this with a https://github.com/DandyDeveloper/charts/tree/master/charts/redis-ha deployment with HAproxy. The envoy sidecar process uses an increasing amount of memory over time, and the difference between opened and closed connections grows too. This is after two hours:

reporter=.=source;.;source_workload=.=redis-haproxy;.;source_workload_namespace=.=<redacted>;.;source_principal=.=spiffe://cluster.local/ns/<redacted>/sa/redis-haproxy;.;source_app=.=redis-ha-haproxy;.;source_version=.=unknown;.;source_canonical_service=.=redis-ha-haproxy;.;source_canonical_revision=.=latest;.;destination_workload=.=redis-server;.;destination_workload_namespace=.=<redacted>;.;destination_principal=.=spiffe://cluster.local/ns/<redacted>/sa/redis;.;destination_app=.=redis-ha;.;destination_version=.=unknown;.;destination_service=.=redis-announce-0.<redacted>.svc.cluster.local;.;destination_service_name=.=redis-announce-0;.;destination_service_namespace=.=<redacted>;.;destination_canonical_service=.=redis-ha;.;destination_canonical_revision=.=latest;.;request_protocol=.=tcp;.;response_flags=.=-;.;connection_security_policy=.=unknown;.;_istio_tcp_connections_closed_total: 14492
reporter=.=source;.;source_workload=.=redis-haproxy;.;source_workload_namespace=.=<redacted>;.;source_principal=.=spiffe://cluster.local/ns/<redacted>/sa/redis-haproxy;.;source_app=.=redis-ha-haproxy;.;source_version=.=unknown;.;source_canonical_service=.=redis-ha-haproxy;.;source_canonical_revision=.=latest;.;destination_workload=.=redis-server;.;destination_workload_namespace=.=<redacted>;.;destination_principal=.=spiffe://cluster.local/ns/<redacted>/sa/redis;.;destination_app=.=redis-ha;.;destination_version=.=unknown;.;destination_service=.=redis-announce-0.<redacted>.svc.cluster.local;.;destination_service_name=.=redis-announce-0;.;destination_service_namespace=.=<redacted>;.;destination_canonical_service=.=redis-ha;.;destination_canonical_revision=.=latest;.;request_protocol=.=tcp;.;response_flags=.=-;.;connection_security_policy=.=unknown;.;_istio_tcp_connections_opened_total: 21674
reporter=.=source;.;source_workload=.=redis-haproxy;.;source_workload_namespace=.=<redacted>;.;source_principal=.=spiffe://cluster.local/ns/<redacted>/sa/redis-haproxy;.;source_app=.=redis-ha-haproxy;.;source_version=.=unknown;.;source_canonical_service=.=redis-ha-haproxy;.;source_canonical_revision=.=latest;.;destination_workload=.=redis-server;.;destination_workload_namespace=.=<redacted>;.;destination_principal=.=spiffe://cluster.local/ns/<redacted>/sa/redis;.;destination_app=.=redis-ha;.;destination_version=.=unknown;.;destination_service=.=redis-announce-1.<redacted>.svc.cluster.local;.;destination_service_name=.=redis-announce-1;.;destination_service_namespace=.=<redacted>;.;destination_canonical_service=.=redis-ha;.;destination_canonical_revision=.=latest;.;request_protocol=.=tcp;.;response_flags=.=-;.;connection_security_policy=.=unknown;.;_istio_tcp_connections_closed_total: 7323
reporter=.=source;.;source_workload=.=redis-haproxy;.;source_workload_namespace=.=<redacted>;.;source_principal=.=spiffe://cluster.local/ns/<redacted>/sa/redis-haproxy;.;source_app=.=redis-ha-haproxy;.;source_version=.=unknown;.;source_canonical_service=.=redis-ha-haproxy;.;source_canonical_revision=.=latest;.;destination_workload=.=redis-server;.;destination_workload_namespace=.=<redacted>;.;destination_principal=.=spiffe://cluster.local/ns/<redacted>/sa/redis;.;destination_app=.=redis-ha;.;destination_version=.=unknown;.;destination_service=.=redis-announce-1.<redacted>.svc.cluster.local;.;destination_service_name=.=redis-announce-1;.;destination_service_namespace=.=<redacted>;.;destination_canonical_service=.=redis-ha;.;destination_canonical_revision=.=latest;.;request_protocol=.=tcp;.;response_flags=.=-;.;connection_security_policy=.=unknown;.;_istio_tcp_connections_opened_total: 19849
reporter=.=source;.;source_workload=.=redis-haproxy;.;source_workload_namespace=.=<redacted>;.;source_principal=.=spiffe://cluster.local/ns/<redacted>/sa/redis-haproxy;.;source_app=.=redis-ha-haproxy;.;source_version=.=unknown;.;source_canonical_service=.=redis-ha-haproxy;.;source_canonical_revision=.=latest;.;destination_workload=.=redis-server;.;destination_workload_namespace=.=<redacted>;.;destination_principal=.=spiffe://cluster.local/ns/<redacted>/sa/redis;.;destination_app=.=redis-ha;.;destination_version=.=unknown;.;destination_service=.=redis-announce-2.<redacted>.svc.cluster.local;.;destination_service_name=.=redis-announce-2;.;destination_service_namespace=.=<redacted>;.;destination_canonical_service=.=redis-ha;.;destination_canonical_revision=.=latest;.;request_protocol=.=tcp;.;response_flags=.=-;.;connection_security_policy=.=unknown;.;_istio_tcp_connections_closed_total: 7163
reporter=.=source;.;source_workload=.=redis-haproxy;.;source_workload_namespace=.=<redacted>;.;source_principal=.=spiffe://cluster.local/ns/<redacted>/sa/redis-haproxy;.;source_app=.=redis-ha-haproxy;.;source_version=.=unknown;.;source_canonical_service=.=redis-ha-haproxy;.;source_canonical_revision=.=latest;.;destination_workload=.=redis-server;.;destination_workload_namespace=.=<redacted>;.;destination_principal=.=spiffe://cluster.local/ns/<redacted>/sa/redis;.;destination_app=.=redis-ha;.;destination_version=.=unknown;.;destination_service=.=redis-announce-2.<redacted>.svc.cluster.local;.;destination_service_name=.=redis-announce-2;.;destination_service_namespace=.=<redacted>;.;destination_canonical_service=.=redis-ha;.;destination_canonical_revision=.=latest;.;request_protocol=.=tcp;.;response_flags=.=-;.;connection_security_policy=.=unknown;.;_istio_tcp_connections_opened_total: 17249

Memory consumption is up to about 100 Mb RES while in other deployments it is around 50-60Mb.
There are lots of connection reset by peer in the envoy debug log.

Kyslik · 2020-11-20T16:23:52Z

We are experiencing the same on Istio 1.6.13; We've tested the ~~fix~~ band-aid mentioned by @bianpengyuan in #24720 (comment).

At about 15:50 EnvoyFilter tcp-stats-filter-1.6 was removed:

bianpengyuan · 2020-11-20T16:26:48Z

Fix is going to be shipped with 1.8.1.

Kyslik · 2020-11-20T16:27:38Z

@bianpengyuan what about Istio 1.6?

bianpengyuan · 2020-11-30T19:43:58Z

@Kyslik 1.6 has just shipped its last release. @kyessenov do you think it is possible to backport to 1.7?

kyessenov · 2020-11-30T19:45:15Z

Yeah, it's a simple fix. @PiotrSikora to confirm.

Kyslik · 2020-12-01T09:18:36Z

@bianpengyuan You mean 1.6.14? https://istio.io/latest/news/releases/1.6.x/announcing-1.6.14/ does not mention this very GitHub issue; please correct me if I am wrong.

bianpengyuan · 2020-12-01T16:52:04Z

@Kyslik Sorry, I meant we did not catch the last 1.6 release. At the best we can backport this to Istio 1.7.

kyessenov · 2020-12-08T20:20:23Z

@PiotrSikora are we OK backporting the fix to 1.7 istio proxy?

SCLogo · 2021-01-18T09:34:23Z

is this fix shipped with 1.8.1 ?

rmak-cpi · 2021-01-18T20:04:47Z

is this fix shipped with 1.8.1 ?

FWIW, I deployed 1.8.1 to my environments and I no longer observe the original memory leak (with TCP metrics envoy filter enabled). But I guess @kyessenov, @PiotrSikora, and/or @bianpengyuan can give the more official word on this.

bianpengyuan · 2021-01-18T21:17:02Z

Yes this has been fixed at 1.8.1+. It will also be fixed at 1.7.next.

tariq1890 · 2021-01-28T14:24:50Z

Hi, has this been backported to Istio 1.7?

bianpengyuan · 2021-01-28T16:46:40Z

Yes it will be in next 1.7 release.

The charts are updated for Istio 1.7.8. Here's a summary of the changes. * cray-istio-operator I updated the charts/istio-operator subchart from the 1.7.8 distribution. I got rid of the wait-jobs hook because this isn't necessary since Helm takes care of making sure the CRD is ready. On a related note, I found that when I upgraded from Istio 1.6.13 that Helm deleted the IstioOperator CRD. To work around this, there's an upgrade hook that recreates the IstioOperator CRD if it doens't exist. * cray-istio-deploy Just changed the default image tags and removed things from the README that weren't accurate. * cray-istio I updated the charts/istio and charts/ingressgatewayhmn subcharts with the latest version from the 1.7.8 distribution. I removed the transfer encoding workaround ( CASMPET-3079 ). The upstream bug is fixed: envoyproxy/envoy#10041 I removed the tcp-stats-filter-1.6 memory leak workaround ( CASMPET-4026 ). Ths upstream bug is fixed: istio/istio#24720 The istio-ingressgateway now needs `runAsRoot: true`, see https://istio.io/latest/news/releases/1.7.x/announcing-1.7/upgrade-notes/#gateways-run-as-non-root .

istio-policy-bot added the area/perf and scalability label Jun 16, 2020

bianpengyuan added the area/extensions and telemetry label Jun 16, 2020

bianpengyuan assigned gargnupur and bianpengyuan Jun 16, 2020

bianpengyuan added this to the 1.6 milestone Jun 16, 2020

istio-policy-bot added the lifecycle/needs-triage label Jun 20, 2020

bianpengyuan added this to the 1.8 milestone Oct 30, 2020

bianpengyuan assigned kyessenov and unassigned bianpengyuan and gargnupur Oct 30, 2020

kyessenov assigned PiotrSikora Oct 30, 2020

kyessenov mentioned this issue Nov 10, 2020

preliminary release notes for 1.8 istio/istio.io#8415

Merged

Kyslik mentioned this issue Nov 23, 2020

wasm: fix network leak envoyproxy/envoy#13836

Merged

bianpengyuan modified the milestones: 1.8, Backlog Nov 30, 2020

bianpengyuan closed this as completed Jan 18, 2021

bianpengyuan mentioned this issue Jan 28, 2021

wasm: fix network leak (#13836) istio/envoy#293

Merged

bianpengyuan mentioned this issue Mar 5, 2021

Memory leak in istio-proxy #30504

Closed

howardjohn added this to Prioritization Jul 18, 2024

howardjohn moved this to Done in Prioritization Jul 18, 2024

Potential memory leak from WASM NullPlugin onNetworkNewConnection #24720

Potential memory leak from WASM NullPlugin onNetworkNewConnection #24720

Comments

rmak-cpi commented Jun 16, 2020 • edited Loading

bianpengyuan commented Jun 16, 2020 • edited Loading

gargnupur commented Jun 16, 2020

gargnupur commented Jun 16, 2020

rmak-cpi commented Jun 17, 2020 • edited Loading

rmak-cpi commented Jul 7, 2020

rmak-cpi commented Jul 7, 2020

rmak-cpi commented Aug 14, 2020

bianpengyuan commented Aug 14, 2020

rmak-cpi commented Aug 14, 2020 • edited Loading

TYPE istio_tcp_connections_opened_total counter

TYPE istio_tcp_connections_closed_total counter

gargnupur commented Aug 14, 2020

rmak-cpi commented Aug 14, 2020

rmak-cpi commented Aug 14, 2020

bianpengyuan commented Aug 14, 2020

rmak-cpi commented Aug 14, 2020

rmak-cpi commented Aug 15, 2020 • edited Loading

rmak-cpi commented Aug 16, 2020

gargnupur commented Aug 17, 2020

rmak-cpi commented Aug 17, 2020

rmak-cpi commented Aug 18, 2020

gargnupur commented Aug 18, 2020

rmak-cpi commented Aug 18, 2020

bianpengyuan commented Aug 18, 2020

rmak-cpi commented Aug 18, 2020

kppullin commented Sep 9, 2020

andrascz commented Nov 3, 2020 • edited Loading

Kyslik commented Nov 20, 2020

bianpengyuan commented Nov 20, 2020

Kyslik commented Nov 20, 2020

bianpengyuan commented Nov 30, 2020

kyessenov commented Nov 30, 2020

Kyslik commented Dec 1, 2020

bianpengyuan commented Dec 1, 2020

kyessenov commented Dec 8, 2020

SCLogo commented Jan 18, 2021

rmak-cpi commented Jan 18, 2021

bianpengyuan commented Jan 18, 2021

tariq1890 commented Jan 28, 2021

bianpengyuan commented Jan 28, 2021

rmak-cpi commented Jun 16, 2020 •

edited

Loading

bianpengyuan commented Jun 16, 2020 •

edited

Loading

rmak-cpi commented Jun 17, 2020 •

edited

Loading

rmak-cpi commented Aug 14, 2020 •

edited

Loading

rmak-cpi commented Aug 15, 2020 •

edited

Loading

andrascz commented Nov 3, 2020 •

edited

Loading