Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential memory leak from WASM NullPlugin onNetworkNewConnection #24720

Closed
rmak-cpi opened this issue Jun 16, 2020 · 41 comments
Closed

Potential memory leak from WASM NullPlugin onNetworkNewConnection #24720

rmak-cpi opened this issue Jun 16, 2020 · 41 comments

Comments

@rmak-cpi
Copy link

rmak-cpi commented Jun 16, 2020

Bug description
Istio-proxy is consuming increasing amount of memory for an application with moderate HTTP traffic (10-20 qps)

[ ] Configuration Infrastructure
[ ] Docs
[ ] Installation
[ ] Networking
[x ] Performance and Scalability
[ ] Policies and Telemetry
[ ] Security
[ ] Test and Release
[ ] User Experience
[ ] Developer Infrastructure

Expected behavior
Istio-proxy memory should stay relatively constant over time.

Steps to reproduce the bug
Here is a heap profile flame graph after service has been up for 12+ hours:

Screen Shot 2020-06-16 at 11 49 23 AM

The flame graph seems to indicate that the heap has been accumulating blocks allocated from operator() calls through the NullPlugin although I am at a bit of a loss to figure out a workaround from reading the relevant source code.

In case it's helpful, here is the text version of the heap profile:

File: envoy
Type: inuse_space
Showing nodes accounting for 33124.62kB, 98.34% of 33682.43kB total
Dropped 482 nodes (cum <= 168.41kB)
flat flat% sum% cum cum%
27546.82kB 81.78% 81.78% 33411.07kB 99.19% std::__1::__function::__func::operator()
1816.88kB 5.39% 87.18% 2421.37kB 7.19% std::__1::__hash_table::__emplace_unique_key_args
1449.59kB 4.30% 91.48% 2052.53kB 6.09% std::__1::__hash_table::__emplace_unique_impl
1206.61kB 3.58% 95.06% 1206.61kB 3.58% std::__1::__hash_table::__rehash
819.49kB 2.43% 97.50% 819.49kB 2.43% OPENSSL_malloc
246.34kB 0.73% 98.23% 246.34kB 0.73% std::__1::basic_string::basic_string
28.98kB 0.086% 98.31% 1025.07kB 3.04% Envoy::Extensions::TransportSockets::Tls::ContextManagerImpl::createSslClientContext
7.83kB 0.023% 98.34% 32049.05kB 95.15% Envoy::Server::ConnectionHandlerImpl::ActiveTcpListener::newConnection
2.06kB 0.0061% 98.34% 32053.88kB 95.16% Envoy::Network::ListenerImpl::listenCallback
0 0% 98.34% 613.34kB 1.82% ASN1_item_d2i
0 0% 98.34% 1439.37kB 4.27% Envoy::Config::GrpcMuxImpl::onDiscoveryResponse
0 0% 98.34% 1359.13kB 4.04% Envoy::Config::GrpcSubscriptionImpl::onConfigUpdate
0 0% 98.34% 1144.74kB 3.40% Envoy::Event::FileEventImpl::assignEvents(unsigned int, event_base*)::$_0::__invoke
0 0% 98.34% 29594.81kB 87.86% Envoy::Extensions::Common::Wasm::Context::onCreate
0 0% 98.34% 32009.40kB 95.03% Envoy::Extensions::Common::Wasm::Context::onNetworkNewConnection
0 0% 98.34% 29594.81kB 87.86% Envoy::Extensions::Common::Wasm::Null::NullPlugin::ensureContext
0 0% 98.34% 2414.76kB 7.17% Envoy::Extensions::Common::Wasm::Null::Plugin::Stats::PluginContext::onNewConnection
0 0% 98.34% 992.35kB 2.95% Envoy::Extensions::TransportSockets::Tls::ClientContextImpl::ClientContextImpl
0 0% 98.34% 1025.07kB 3.04% Envoy::Extensions::TransportSockets::Tls::ClientSslSocketFactory::onAddOrUpdateSecret
0 0% 98.34% 1081.84kB 3.21% Envoy::Extensions::TransportSockets::Tls::ContextImpl::ContextImpl
0 0% 98.34% 1433.78kB 4.26% Envoy::Grpc::AsyncStreamCallbacks::onReceiveMessageRaw
0 0% 98.34% 1430.18kB 4.25% Envoy::Grpc::AsyncStreamImpl::onData
0 0% 98.34% 1427.35kB 4.24% Envoy::Http::AsyncStreamImpl::encodeData
0 0% 98.34% 1223kB 3.63% Envoy::Http::CodecClient::CodecReadFilter::onData
0 0% 98.34% 1277.10kB 3.79% Envoy::Http::CodecClient::onData
0 0% 98.34% 1293.62kB 3.84% Envoy::Http::Http2::ConnectionImpl::dispatch
0 0% 98.34% 1402.62kB 4.16% Envoy::Http::Http2::ConnectionImpl::onFrameReceived
0 0% 98.34% 1409.21kB 4.18% Envoy::Http::ResponseDecoderWrapper::decodeData
0 0% 98.34% 753.30kB 2.24% Envoy::MainCommonBase::run
0 0% 98.34% 1178.06kB 3.50% Envoy::Network::ConnectionImpl::onFileEvent
0 0% 98.34% 1219.07kB 3.62% Envoy::Network::ConnectionImpl::onReadReady
0 0% 98.34% 32014.61kB 95.05% Envoy::Network::FilterManagerImpl::initializeReadFilters
0 0% 98.34% 33233.24kB 98.67% Envoy::Network::FilterManagerImpl::onContinueReading
0 0% 98.34% 1420.46kB 4.22% Envoy::Router::UpstreamRequest::decodeData
0 0% 98.34% 1359.39kB 4.04% Envoy::Secret::SdsApi::onConfigUpdate
0 0% 98.34% 32050.53kB 95.16% Envoy::Server::ConnectionHandlerImpl::ActiveTcpListener::onAcceptWorker
0 0% 98.34% 32050.53kB 95.16% Envoy::Server::ConnectionHandlerImpl::ActiveTcpSocket::continueFilterChain
0 0% 98.34% 886.31kB 2.63% Envoy::Server::InstanceImpl::run
0 0% 98.34% 32102.04kB 95.31% Envoy::Server::WorkerImpl::threadRoutine
0 0% 98.34% 220.22kB 0.65% Envoy::Ssl::TlsCertificateConfigImpl::TlsCertificateConfigImpl
0 0% 98.34% 32101.35kB 95.31% Envoy::Thread::ThreadImplPosix::ThreadImplPosix(std::__1::function)::$_0::__invoke
0 0% 98.34% 403.05kB 1.20% PEM_ASN1_read_bio
0 0% 98.34% 195.30kB 0.58% PEM_X509_INFO_read_bio
0 0% 98.34% 571.81kB 1.70% [libc-2.27.so]
0 0% 98.34% 223.61kB 0.66% asn1_d2i_ex_primitive
0 0% 98.34% 223.61kB 0.66% asn1_ex_c2i
0 0% 98.34% 175.63kB 0.52% asn1_item_ex_combine_new
0 0% 98.34% 613.34kB 1.82% asn1_item_ex_d2i
0 0% 98.34% 454.71kB 1.35% asn1_template_ex_d2i
0 0% 98.34% 454.71kB 1.35% asn1_template_noexp_d2i
0 0% 98.34% 207.76kB 0.62% d2i_X509_AUX
0 0% 98.34% 33099.96kB 98.27% event_base_loop
0 0% 98.34% 33183.10kB 98.52% event_process_active_single_queue
0 0% 98.34% 32053.88kB 95.16% listener_read_cb
0 0% 98.34% 675.96kB 2.01% main
0 0% 98.34% 1326.59kB 3.94% nghttp2_session_mem_recv
0 0% 98.34% 1401.21kB 4.16% nghttp2_session_on_data_received
0 0% 98.34% 195.30kB 0.58% parse_x509
0 0% 98.34% 32100.56kB 95.30% start_thread
0 0% 98.34% 602.94kB 1.79% std::__1::__hash_table::__node_insert_unique_prepare

Version (include the output of istioctl version --remote and kubectl version and helm version if you used Helm)

$istioctl version --remote
client version: 1.6.2
control plane version: 1.6.2
data plane version: 1.6.2 (20 proxies)

While I was only able to dig into this after the upgrade to 1.6.2, I have observed the increased memory consumption since 1.5.2.

$kubectl version
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.3", GitCommit:"2e7996e3e2712684bc73f0dec0200d64eec7fe40", GitTreeState:"clean", BuildDate:"2020-05-20T12:52:00Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"16+", GitVersion:"v1.16.8-eks-e16311", GitCommit:"e163110a04dcb2f39c3325af96d019b4925419eb", GitTreeState:"clean", BuildDate:"2020-03-27T22:37:12Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/amd64"}

How was Istio installed?

istioctl manifest apply -f/istioctl upgrade -f

Environment where bug was observed (cloud vendor, OS, etc)

AWS EKS 1.6 control plane and worker nodes

@bianpengyuan
Copy link
Contributor

bianpengyuan commented Jun 16, 2020

Can you pasted the stats that are collected by istio-proxy and related to the problematic service (query prometheus server for istio_requests_total & istio_tcp_{received,sent}_bytes_total, or exec into proxy container and curl localhost:15000/stats/prometheus).

Also when you are saying this is a HTTP app, do you mean plain HTTP or HTTPS? I am wondering how does onNetworkNewConnection triggered for HTTP traffics @gargnupur Is this expected?

@gargnupur
Copy link
Contributor

Looks like this is HTTPS because it's creating a TLS client connection from the stacktrace...
I am not sure if this should happen.. will have to test it out..

From the stacktrace.. looks like here is what https://sourcegraph.com/github.com/envoyproxy/envoy@master/-/blob/source/extensions/transport_sockets/tls/context_manager_impl.cc#L35:13 is using space...

@gargnupur
Copy link
Contributor

Could it be related to envoyproxy/envoy#10030?

@rmak-cpi
Copy link
Author

rmak-cpi commented Jun 17, 2020

Can you pasted the stats that are collected by istio-proxy and related to the problematic service (query prometheus server for istio_requests_total & istio_tcp_{received,sent}_bytes_total, or exec into proxy container and curl localhost:15000/stats/prometheus).

Also when you are saying this is a HTTP app, do you mean plain HTTP or HTTPS? I am wondering how does onNetworkNewConnection triggered for HTTP traffics @gargnupur Is this expected?

Hi @bianpengyuan, per your request, here is a link to the prometheus metrics on istio-proxy: https://gist.github.com/rmak-cpi/835570832707ca8a58e67c3947938a00.

HTTPs is terminated at our cloud provider's load-balancer, but the application does serve mTLS traffic for in-cluster clients and that might be useful information.

(gist link was bad, updated)

@rmak-cpi
Copy link
Author

rmak-cpi commented Jul 7, 2020

Hi @bianpengyuan, I took some time reading through the new WASM plugin code (it looks like I have both the metadata_exchange and stats plugins enabled), and if I am reading the code correctly, the Context definition should be coming from the WASM plugin sdk (https://github.com/proxy-wasm/proxy-wasm-cpp-sdk/blob/master/proxy_wasm_api.h#L429); and given that the Context class has a virtual destructor defined, shouldn't the PluginContext class in the stats (and the metadata_exchange) plugin have virtual destructor defined as well otherwise destructors for the members will not be called (in particular request_info_ shared_ptr)? This can explain the slow leak that I am observing but I may be reading too much into this so it would be great if you can confirm one way or another. Thanks!

@rmak-cpi
Copy link
Author

rmak-cpi commented Jul 7, 2020

Please ignore my previous comment about declaring virtual destructor in the PluginContext class, looks like the compiler should take care of generating the correct destructor for PluginContext in the vtable. That said, it means I don't really have anything to go on to help track down this problem :P

@rmak-cpi
Copy link
Author

I have narrowed the issue down to the stats plugin (https://github.com/istio/proxy/blob/master/extensions/stats/plugin.h) by disabling metrics and observing that memory consumption by the istio proxy remained constant for long periods of time. Disabling metrics is clearly less than ideal, and from looking at the code, the leak can definitely occur if the onLog method is not called reliably on each request. That said, I am having a hard time tracing through the (envoy/istio-proxy/wasm-proxy) code to understand how onLog is being called in the request lifecycle so any pointers to the relevant code snippets will be very much appreciated.

@bianpengyuan
Copy link
Contributor

Is there any labels in the metric that has unbounded value? Can you dump the stats page k exec <pod> -c istio-proxy curl localhost:15000/stats/prometheus and check if there is too many time series generated?

@rmak-cpi
Copy link
Author

rmak-cpi commented Aug 14, 2020

Is there any labels in the metric that has unbounded value? Can you dump the stats page k exec <pod> -c istio-proxy curl localhost:15000/stats/prometheus and check if there is too many time series generated?

Hi @bianpengyuan, I have posted a new gist of the prometheus metrics dump at https://gist.github.com/rmak-cpi/fbf7c59eb16a10cc4a0e0dba1c241597, one thing that jumps out at me is that the total number of tcp connections opened is much higher than the total number of tcp connections closed:

TYPE istio_tcp_connections_opened_total counter

istio_tcp_connections_opened_total{reporter="source",source_workload="xxxxxxxxxxxxxxx",source_workload_namespace="default",source_principal="unknown",source_app="unknown",source_version="unknown",destination_workload="unknown",destination_workload_namespace="unknown",destination_principal="unknown",destination_app="unknown",destination_version="unknown",destination_service="unknown",destination_service_name="PassthroughCluster",destination_service_namespace="unknown",request_protocol="tcp",response_flags="-",connection_security_policy="unknown",source_canonical_service="xxxxxxxxxxxxxxx",destination_canonical_service="unknown",source_canonical_revision="latest",destination_canonical_revision="latest"} 55938

TYPE istio_tcp_connections_closed_total counter

istio_tcp_connections_closed_total{reporter="source",source_workload="xxxxxxxxxxxxxxx",source_workload_namespace="default",source_principal="unknown",source_app="unknown",source_version="unknown",destination_workload="unknown",destination_workload_namespace="unknown",destination_principal="unknown",destination_app="unknown",destination_version="unknown",destination_service="unknown",destination_service_name="PassthroughCluster",destination_service_namespace="unknown",request_protocol="tcp",response_flags="-",connection_security_policy="unknown",source_canonical_service="xxxxxxxxxxxxxxx",destination_canonical_service="unknown",source_canonical_revision="latest",destination_canonical_revision="latest"} 861

@gargnupur
Copy link
Contributor

If tcp connections don't close, then RequestInfo will hang around in the plugin:
Map<uint32_t, std::shared_ptr<::Wasm::Common::RequestInfo>>
tcp_request_queue_;
This could be the reason for the leak... do you know why requests are not closing?

@rmak-cpi
Copy link
Author

If tcp connections don't close, then RequestInfo will hang around in the plugin:
Map<uint32_t, std::shared_ptr<::Wasm::Common::RequestInfo>>
tcp_request_queue_;
This could be the reason for the leak... do you know why requests are not closing?

The service is taking http traffic directly from a cloud-provider load-balancer (AWS alb), I am suspecting this may have something to do with http keepalive handling between envoy and the load-balancer.

@rmak-cpi
Copy link
Author

I am also not seeing any signs of TCP sockets leaking at the operating system level through node-exporter metrics.

@bianpengyuan
Copy link
Contributor

Is it that the sidecar with increased memory only received http requests? There should be no TCP metric generated if so.

If it is possible, and to help us narrow down the issue, can you try to delete the 1.6 tcp stats filter (k delete envoyfilter tcp-stats-filter-1.6 -n istio-system), and see if that solves your issue? Without this filter, http metric should still be generated.

@rmak-cpi
Copy link
Author

Hi @bianpengyuan, let me try that later on as I am running an experiment disabling http keepalive, and I will report back once I've obtained data from disabling the tcp-stats-filter.

@rmak-cpi
Copy link
Author

rmak-cpi commented Aug 15, 2020

Hi @bianpengyuan, just want to report back that envy proxy memory usage after disabling the tcp stats filter remained stable, but my own experiment disabling http keepalive hasn't gone so well since I have yet to find the magic config to to actually disable http keepalive.

@rmak-cpi
Copy link
Author

Since onLog is only called when connection is closed on both upstream and downstream, the following code snippet seems a little risky (https://github.com/istio/envoy/blob/master/source/extensions/common/wasm/context.cc#L1571):

if (end_stream) {
// This is called when seeing end_stream=true and not on an upstream connection event,
// because registering for latter requires replicating the whole TCP proxy extension.
onUpstreamConnectionClose(PeerType::Unknown);
}

It sounds like an upstream connection close event can only be detected if there is explicit data flow terminating the data stream but I could be reading too much into this.

@gargnupur
Copy link
Contributor

@rmak-cpi
Copy link
Author

/cc @jplevyak
@rmak-cpi : how about something like this? https://stackoverflow.com/questions/31052911/how-to-prevent-keep-alive-in-a-http-1-1-request

Thanks @gargnupur, I was hoping to disable http keepalive through envoy config only without any (client/server) application code changes. I will try to play around with EnvoyFilter some more to see if I can disable http keepalive there.

@rmak-cpi
Copy link
Author

I managed to disable http keepalive on the server side but the leak is still occurring with the tcp connections opened metric growing much bigger compared with the tcp connections closed metric over time. At this point, I don't think it's productive for me to flail around guessing what the underlying problem is although I will be happy to run an instrumented build (perhaps with additional metrics for upstreamClosed vs downstreamClosed events detected by the wasm context?).

Thanks to @bianpengyuan's suggestion, I do have a decent workaround of disabling the tcp stats plugin, but I do hope to see this issue resolved as this seems to be a fundamental mismatch in how connection lifecycle events should be handled in the wasm plugin.

@gargnupur
Copy link
Contributor

@rmak-cpi : Can you give us some pointers on your setup, so that we can try to repro this on our end?

@rmak-cpi
Copy link
Author

@rmak-cpi : Can you give us some pointers on your setup, so that we can try to repro this on our end?

There really isn't anything particularly special about my setup, it's basically an http service exposed through the following chain of lb/proxies:

public AWS ALB -> istio envoy sidecar-> my java application ( my service is configured as a nodeport service so alb can use all cluster workers as target instances)

Most gcloud/gke setups are probably dominated by grpc which I think uses mostly persistent connection intended to last forever. The leakage that I observe is actually happening pretty slowly given that http keep alive is in place to keep tcp connections for relatively long period but it's definitely noticeable after a week or so.

@bianpengyuan
Copy link
Contributor

Thanks to @bianpengyuan's suggestion, I do have a decent workaround of disabling the tcp stats plugin

@rmak-cpi Sorry I might miss some context here, did you remove tcp filter and envoy memory usage did not increase any more because of that? My interpretation of your comment #24720 (comment) is that you've not performed the test, but seems like you did? If so I think we can try expose some envoy stats and make sure if this is a memory leak for connection leak.

@rmak-cpi
Copy link
Author

Hi @bianpengyuan, to clarify my earlier comment, I did verify that the leak stopped once I removed the (and only the) tcp stat filters. I wanted to experiment with keep alive settings with the hope that I might be able to further isolate the problem.

@kppullin
Copy link

kppullin commented Sep 9, 2020

I believe to have run into the same leak. On our most severely affected pod, the heap pprof flamegraph looks more or less identical to the image in the initial report above. Deleting the stats filter as suggested releases the "leaked" memory as shown in this screenshot (it's only over a 5 minute range so you can't see the growth of the leak too well).
image

It is interesting to note that we only have outbound stats enabled, and sure enough the services running in our cluster that frequently open many outbound connections (e.g. non-pooled connections to the k8s API servers and various database flavors) leak much, much faster than those utilizing HTTP keep-alives and DB connection pools. This aligns with the heap flamegraph showing high allocations in the newConnection path. Those with high inbound connection open rates do not appear to leak. I think we've lucked out here a bit since our inbound open rate is much higher than out : ).

The memory leak is slow, about 2-5MB/hour depending on our dev, test, and production clusters. A quick, rough look via tcpdump shows a connection open rate of ~2 connections/sec.

Happy to provide further traces, tests, etc. to get to the root cause.

@bianpengyuan bianpengyuan added this to the 1.8 milestone Oct 30, 2020
@andrascz
Copy link

andrascz commented Nov 3, 2020

We are hitting this with a https://github.com/DandyDeveloper/charts/tree/master/charts/redis-ha deployment with HAproxy. The envoy sidecar process uses an increasing amount of memory over time, and the difference between opened and closed connections grows too. This is after two hours:

reporter=.=source;.;source_workload=.=redis-haproxy;.;source_workload_namespace=.=<redacted>;.;source_principal=.=spiffe://cluster.local/ns/<redacted>/sa/redis-haproxy;.;source_app=.=redis-ha-haproxy;.;source_version=.=unknown;.;source_canonical_service=.=redis-ha-haproxy;.;source_canonical_revision=.=latest;.;destination_workload=.=redis-server;.;destination_workload_namespace=.=<redacted>;.;destination_principal=.=spiffe://cluster.local/ns/<redacted>/sa/redis;.;destination_app=.=redis-ha;.;destination_version=.=unknown;.;destination_service=.=redis-announce-0.<redacted>.svc.cluster.local;.;destination_service_name=.=redis-announce-0;.;destination_service_namespace=.=<redacted>;.;destination_canonical_service=.=redis-ha;.;destination_canonical_revision=.=latest;.;request_protocol=.=tcp;.;response_flags=.=-;.;connection_security_policy=.=unknown;.;_istio_tcp_connections_closed_total: 14492
reporter=.=source;.;source_workload=.=redis-haproxy;.;source_workload_namespace=.=<redacted>;.;source_principal=.=spiffe://cluster.local/ns/<redacted>/sa/redis-haproxy;.;source_app=.=redis-ha-haproxy;.;source_version=.=unknown;.;source_canonical_service=.=redis-ha-haproxy;.;source_canonical_revision=.=latest;.;destination_workload=.=redis-server;.;destination_workload_namespace=.=<redacted>;.;destination_principal=.=spiffe://cluster.local/ns/<redacted>/sa/redis;.;destination_app=.=redis-ha;.;destination_version=.=unknown;.;destination_service=.=redis-announce-0.<redacted>.svc.cluster.local;.;destination_service_name=.=redis-announce-0;.;destination_service_namespace=.=<redacted>;.;destination_canonical_service=.=redis-ha;.;destination_canonical_revision=.=latest;.;request_protocol=.=tcp;.;response_flags=.=-;.;connection_security_policy=.=unknown;.;_istio_tcp_connections_opened_total: 21674
reporter=.=source;.;source_workload=.=redis-haproxy;.;source_workload_namespace=.=<redacted>;.;source_principal=.=spiffe://cluster.local/ns/<redacted>/sa/redis-haproxy;.;source_app=.=redis-ha-haproxy;.;source_version=.=unknown;.;source_canonical_service=.=redis-ha-haproxy;.;source_canonical_revision=.=latest;.;destination_workload=.=redis-server;.;destination_workload_namespace=.=<redacted>;.;destination_principal=.=spiffe://cluster.local/ns/<redacted>/sa/redis;.;destination_app=.=redis-ha;.;destination_version=.=unknown;.;destination_service=.=redis-announce-1.<redacted>.svc.cluster.local;.;destination_service_name=.=redis-announce-1;.;destination_service_namespace=.=<redacted>;.;destination_canonical_service=.=redis-ha;.;destination_canonical_revision=.=latest;.;request_protocol=.=tcp;.;response_flags=.=-;.;connection_security_policy=.=unknown;.;_istio_tcp_connections_closed_total: 7323
reporter=.=source;.;source_workload=.=redis-haproxy;.;source_workload_namespace=.=<redacted>;.;source_principal=.=spiffe://cluster.local/ns/<redacted>/sa/redis-haproxy;.;source_app=.=redis-ha-haproxy;.;source_version=.=unknown;.;source_canonical_service=.=redis-ha-haproxy;.;source_canonical_revision=.=latest;.;destination_workload=.=redis-server;.;destination_workload_namespace=.=<redacted>;.;destination_principal=.=spiffe://cluster.local/ns/<redacted>/sa/redis;.;destination_app=.=redis-ha;.;destination_version=.=unknown;.;destination_service=.=redis-announce-1.<redacted>.svc.cluster.local;.;destination_service_name=.=redis-announce-1;.;destination_service_namespace=.=<redacted>;.;destination_canonical_service=.=redis-ha;.;destination_canonical_revision=.=latest;.;request_protocol=.=tcp;.;response_flags=.=-;.;connection_security_policy=.=unknown;.;_istio_tcp_connections_opened_total: 19849
reporter=.=source;.;source_workload=.=redis-haproxy;.;source_workload_namespace=.=<redacted>;.;source_principal=.=spiffe://cluster.local/ns/<redacted>/sa/redis-haproxy;.;source_app=.=redis-ha-haproxy;.;source_version=.=unknown;.;source_canonical_service=.=redis-ha-haproxy;.;source_canonical_revision=.=latest;.;destination_workload=.=redis-server;.;destination_workload_namespace=.=<redacted>;.;destination_principal=.=spiffe://cluster.local/ns/<redacted>/sa/redis;.;destination_app=.=redis-ha;.;destination_version=.=unknown;.;destination_service=.=redis-announce-2.<redacted>.svc.cluster.local;.;destination_service_name=.=redis-announce-2;.;destination_service_namespace=.=<redacted>;.;destination_canonical_service=.=redis-ha;.;destination_canonical_revision=.=latest;.;request_protocol=.=tcp;.;response_flags=.=-;.;connection_security_policy=.=unknown;.;_istio_tcp_connections_closed_total: 7163
reporter=.=source;.;source_workload=.=redis-haproxy;.;source_workload_namespace=.=<redacted>;.;source_principal=.=spiffe://cluster.local/ns/<redacted>/sa/redis-haproxy;.;source_app=.=redis-ha-haproxy;.;source_version=.=unknown;.;source_canonical_service=.=redis-ha-haproxy;.;source_canonical_revision=.=latest;.;destination_workload=.=redis-server;.;destination_workload_namespace=.=<redacted>;.;destination_principal=.=spiffe://cluster.local/ns/<redacted>/sa/redis;.;destination_app=.=redis-ha;.;destination_version=.=unknown;.;destination_service=.=redis-announce-2.<redacted>.svc.cluster.local;.;destination_service_name=.=redis-announce-2;.;destination_service_namespace=.=<redacted>;.;destination_canonical_service=.=redis-ha;.;destination_canonical_revision=.=latest;.;request_protocol=.=tcp;.;response_flags=.=-;.;connection_security_policy=.=unknown;.;_istio_tcp_connections_opened_total: 17249

Memory consumption is up to about 100 Mb RES while in other deployments it is around 50-60Mb.
There are lots of connection reset by peer in the envoy debug log.

@Kyslik
Copy link
Contributor

Kyslik commented Nov 20, 2020

We are experiencing the same on Istio 1.6.13; We've tested the fix band-aid mentioned by @bianpengyuan in #24720 (comment).

At about 15:50 EnvoyFilter tcp-stats-filter-1.6 was removed:

istio-1-6-mem-leak

@bianpengyuan
Copy link
Contributor

Fix is going to be shipped with 1.8.1.

@Kyslik
Copy link
Contributor

Kyslik commented Nov 20, 2020

@bianpengyuan what about Istio 1.6?

@bianpengyuan
Copy link
Contributor

@Kyslik 1.6 has just shipped its last release. @kyessenov do you think it is possible to backport to 1.7?

@kyessenov
Copy link
Contributor

Yeah, it's a simple fix. @PiotrSikora to confirm.

@bianpengyuan bianpengyuan modified the milestones: 1.8, Backlog Nov 30, 2020
@Kyslik
Copy link
Contributor

Kyslik commented Dec 1, 2020

@bianpengyuan You mean 1.6.14? https://istio.io/latest/news/releases/1.6.x/announcing-1.6.14/ does not mention this very GitHub issue; please correct me if I am wrong.

@bianpengyuan
Copy link
Contributor

@Kyslik Sorry, I meant we did not catch the last 1.6 release. At the best we can backport this to Istio 1.7.

@kyessenov
Copy link
Contributor

@PiotrSikora are we OK backporting the fix to 1.7 istio proxy?

@SCLogo
Copy link

SCLogo commented Jan 18, 2021

is this fix shipped with 1.8.1 ?

@rmak-cpi
Copy link
Author

is this fix shipped with 1.8.1 ?

FWIW, I deployed 1.8.1 to my environments and I no longer observe the original memory leak (with TCP metrics envoy filter enabled). But I guess @kyessenov, @PiotrSikora, and/or @bianpengyuan can give the more official word on this.

@bianpengyuan
Copy link
Contributor

Yes this has been fixed at 1.8.1+. It will also be fixed at 1.7.next.

@tariq1890
Copy link
Contributor

Hi, has this been backported to Istio 1.7?

@bianpengyuan
Copy link
Contributor

Yes it will be in next 1.7 release.

jenkins-stash-sync bot pushed a commit to Cray-HPE/cray-istio that referenced this issue Apr 30, 2021
The charts are updated for Istio 1.7.8. Here's a summary of the
changes.

* cray-istio-operator

I updated the charts/istio-operator subchart from the 1.7.8
distribution.

I got rid of the wait-jobs hook because this isn't necessary since Helm
takes care of making sure the CRD is ready.

On a related note, I found that when I upgraded from Istio 1.6.13 that
Helm deleted the IstioOperator CRD. To work around this, there's an
upgrade hook that recreates the IstioOperator CRD if it doens't exist.

* cray-istio-deploy

Just changed the default image tags and removed things from the README
that weren't accurate.

* cray-istio

I updated the charts/istio and charts/ingressgatewayhmn subcharts with
the latest version from the 1.7.8 distribution.

I removed the transfer encoding workaround ( CASMPET-3079 ). The
upstream bug is fixed: envoyproxy/envoy#10041

I removed the tcp-stats-filter-1.6 memory leak workaround ( CASMPET-4026 ).
Ths upstream bug is fixed: istio/istio#24720

The istio-ingressgateway now needs `runAsRoot: true`, see
https://istio.io/latest/news/releases/1.7.x/announcing-1.7/upgrade-notes/#gateways-run-as-non-root .
@howardjohn howardjohn moved this to Done in Prioritization Jul 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

No branches or pull requests