-
Notifications
You must be signed in to change notification settings - Fork 7.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential memory leak from WASM NullPlugin onNetworkNewConnection #24720
Comments
Can you pasted the stats that are collected by istio-proxy and related to the problematic service (query prometheus server for istio_requests_total & istio_tcp_{received,sent}_bytes_total, or exec into proxy container and curl localhost:15000/stats/prometheus). Also when you are saying this is a HTTP app, do you mean plain HTTP or HTTPS? I am wondering how does onNetworkNewConnection triggered for HTTP traffics @gargnupur Is this expected? |
Looks like this is HTTPS because it's creating a TLS client connection from the stacktrace... From the stacktrace.. looks like here is what https://sourcegraph.com/github.com/envoyproxy/envoy@master/-/blob/source/extensions/transport_sockets/tls/context_manager_impl.cc#L35:13 is using space... |
Could it be related to envoyproxy/envoy#10030? |
Hi @bianpengyuan, per your request, here is a link to the prometheus metrics on istio-proxy: https://gist.github.com/rmak-cpi/835570832707ca8a58e67c3947938a00. HTTPs is terminated at our cloud provider's load-balancer, but the application does serve mTLS traffic for in-cluster clients and that might be useful information. (gist link was bad, updated) |
Hi @bianpengyuan, I took some time reading through the new WASM plugin code (it looks like I have both the metadata_exchange and stats plugins enabled), and if I am reading the code correctly, the Context definition should be coming from the WASM plugin sdk (https://github.com/proxy-wasm/proxy-wasm-cpp-sdk/blob/master/proxy_wasm_api.h#L429); and given that the Context class has a virtual destructor defined, shouldn't the PluginContext class in the stats (and the metadata_exchange) plugin have virtual destructor defined as well otherwise destructors for the members will not be called (in particular request_info_ shared_ptr)? This can explain the slow leak that I am observing but I may be reading too much into this so it would be great if you can confirm one way or another. Thanks! |
Please ignore my previous comment about declaring virtual destructor in the PluginContext class, looks like the compiler should take care of generating the correct destructor for PluginContext in the vtable. That said, it means I don't really have anything to go on to help track down this problem :P |
I have narrowed the issue down to the stats plugin (https://github.com/istio/proxy/blob/master/extensions/stats/plugin.h) by disabling metrics and observing that memory consumption by the istio proxy remained constant for long periods of time. Disabling metrics is clearly less than ideal, and from looking at the code, the leak can definitely occur if the onLog method is not called reliably on each request. That said, I am having a hard time tracing through the (envoy/istio-proxy/wasm-proxy) code to understand how onLog is being called in the request lifecycle so any pointers to the relevant code snippets will be very much appreciated. |
Is there any labels in the metric that has unbounded value? Can you dump the stats page |
Hi @bianpengyuan, I have posted a new gist of the prometheus metrics dump at https://gist.github.com/rmak-cpi/fbf7c59eb16a10cc4a0e0dba1c241597, one thing that jumps out at me is that the total number of tcp connections opened is much higher than the total number of tcp connections closed: TYPE istio_tcp_connections_opened_total counteristio_tcp_connections_opened_total{reporter="source",source_workload="xxxxxxxxxxxxxxx",source_workload_namespace="default",source_principal="unknown",source_app="unknown",source_version="unknown",destination_workload="unknown",destination_workload_namespace="unknown",destination_principal="unknown",destination_app="unknown",destination_version="unknown",destination_service="unknown",destination_service_name="PassthroughCluster",destination_service_namespace="unknown",request_protocol="tcp",response_flags="-",connection_security_policy="unknown",source_canonical_service="xxxxxxxxxxxxxxx",destination_canonical_service="unknown",source_canonical_revision="latest",destination_canonical_revision="latest"} 55938 TYPE istio_tcp_connections_closed_total counteristio_tcp_connections_closed_total{reporter="source",source_workload="xxxxxxxxxxxxxxx",source_workload_namespace="default",source_principal="unknown",source_app="unknown",source_version="unknown",destination_workload="unknown",destination_workload_namespace="unknown",destination_principal="unknown",destination_app="unknown",destination_version="unknown",destination_service="unknown",destination_service_name="PassthroughCluster",destination_service_namespace="unknown",request_protocol="tcp",response_flags="-",connection_security_policy="unknown",source_canonical_service="xxxxxxxxxxxxxxx",destination_canonical_service="unknown",source_canonical_revision="latest",destination_canonical_revision="latest"} 861 |
If tcp connections don't close, then RequestInfo will hang around in the plugin: |
The service is taking http traffic directly from a cloud-provider load-balancer (AWS alb), I am suspecting this may have something to do with http keepalive handling between envoy and the load-balancer. |
I am also not seeing any signs of TCP sockets leaking at the operating system level through node-exporter metrics. |
Is it that the sidecar with increased memory only received http requests? There should be no TCP metric generated if so. If it is possible, and to help us narrow down the issue, can you try to delete the 1.6 tcp stats filter ( |
Hi @bianpengyuan, let me try that later on as I am running an experiment disabling http keepalive, and I will report back once I've obtained data from disabling the tcp-stats-filter. |
Hi @bianpengyuan, just want to report back that envy proxy memory usage after disabling the tcp stats filter remained stable, but my own experiment disabling http keepalive hasn't gone so well since I have yet to find the magic config to to actually disable http keepalive. |
Since onLog is only called when connection is closed on both upstream and downstream, the following code snippet seems a little risky (https://github.com/istio/envoy/blob/master/source/extensions/common/wasm/context.cc#L1571): if (end_stream) { It sounds like an upstream connection close event can only be detected if there is explicit data flow terminating the data stream but I could be reading too much into this. |
/cc @jplevyak |
Thanks @gargnupur, I was hoping to disable http keepalive through envoy config only without any (client/server) application code changes. I will try to play around with EnvoyFilter some more to see if I can disable http keepalive there. |
I managed to disable http keepalive on the server side but the leak is still occurring with the tcp connections opened metric growing much bigger compared with the tcp connections closed metric over time. At this point, I don't think it's productive for me to flail around guessing what the underlying problem is although I will be happy to run an instrumented build (perhaps with additional metrics for upstreamClosed vs downstreamClosed events detected by the wasm context?). Thanks to @bianpengyuan's suggestion, I do have a decent workaround of disabling the tcp stats plugin, but I do hope to see this issue resolved as this seems to be a fundamental mismatch in how connection lifecycle events should be handled in the wasm plugin. |
@rmak-cpi : Can you give us some pointers on your setup, so that we can try to repro this on our end? |
There really isn't anything particularly special about my setup, it's basically an http service exposed through the following chain of lb/proxies: public AWS ALB -> istio envoy sidecar-> my java application ( my service is configured as a nodeport service so alb can use all cluster workers as target instances) Most gcloud/gke setups are probably dominated by grpc which I think uses mostly persistent connection intended to last forever. The leakage that I observe is actually happening pretty slowly given that http keep alive is in place to keep tcp connections for relatively long period but it's definitely noticeable after a week or so. |
@rmak-cpi Sorry I might miss some context here, did you remove tcp filter and envoy memory usage did not increase any more because of that? My interpretation of your comment #24720 (comment) is that you've not performed the test, but seems like you did? If so I think we can try expose some envoy stats and make sure if this is a memory leak for connection leak. |
Hi @bianpengyuan, to clarify my earlier comment, I did verify that the leak stopped once I removed the (and only the) tcp stat filters. I wanted to experiment with keep alive settings with the hope that I might be able to further isolate the problem. |
We are hitting this with a https://github.com/DandyDeveloper/charts/tree/master/charts/redis-ha deployment with HAproxy. The envoy sidecar process uses an increasing amount of memory over time, and the difference between opened and closed connections grows too. This is after two hours:
Memory consumption is up to about 100 Mb RES while in other deployments it is around 50-60Mb. |
We are experiencing the same on Istio At about 15:50 EnvoyFilter |
Fix is going to be shipped with 1.8.1. |
@bianpengyuan what about Istio |
@Kyslik 1.6 has just shipped its last release. @kyessenov do you think it is possible to backport to 1.7? |
Yeah, it's a simple fix. @PiotrSikora to confirm. |
@bianpengyuan You mean 1.6.14? https://istio.io/latest/news/releases/1.6.x/announcing-1.6.14/ does not mention this very GitHub issue; please correct me if I am wrong. |
@Kyslik Sorry, I meant we did not catch the last 1.6 release. At the best we can backport this to Istio 1.7. |
@PiotrSikora are we OK backporting the fix to 1.7 istio proxy? |
is this fix shipped with 1.8.1 ? |
FWIW, I deployed 1.8.1 to my environments and I no longer observe the original memory leak (with TCP metrics envoy filter enabled). But I guess @kyessenov, @PiotrSikora, and/or @bianpengyuan can give the more official word on this. |
Yes this has been fixed at 1.8.1+. It will also be fixed at 1.7.next. |
Hi, has this been backported to Istio 1.7? |
Yes it will be in next 1.7 release. |
The charts are updated for Istio 1.7.8. Here's a summary of the changes. * cray-istio-operator I updated the charts/istio-operator subchart from the 1.7.8 distribution. I got rid of the wait-jobs hook because this isn't necessary since Helm takes care of making sure the CRD is ready. On a related note, I found that when I upgraded from Istio 1.6.13 that Helm deleted the IstioOperator CRD. To work around this, there's an upgrade hook that recreates the IstioOperator CRD if it doens't exist. * cray-istio-deploy Just changed the default image tags and removed things from the README that weren't accurate. * cray-istio I updated the charts/istio and charts/ingressgatewayhmn subcharts with the latest version from the 1.7.8 distribution. I removed the transfer encoding workaround ( CASMPET-3079 ). The upstream bug is fixed: envoyproxy/envoy#10041 I removed the tcp-stats-filter-1.6 memory leak workaround ( CASMPET-4026 ). Ths upstream bug is fixed: istio/istio#24720 The istio-ingressgateway now needs `runAsRoot: true`, see https://istio.io/latest/news/releases/1.7.x/announcing-1.7/upgrade-notes/#gateways-run-as-non-root .
Bug description
Istio-proxy is consuming increasing amount of memory for an application with moderate HTTP traffic (10-20 qps)
[ ] Configuration Infrastructure
[ ] Docs
[ ] Installation
[ ] Networking
[x ] Performance and Scalability
[ ] Policies and Telemetry
[ ] Security
[ ] Test and Release
[ ] User Experience
[ ] Developer Infrastructure
Expected behavior
Istio-proxy memory should stay relatively constant over time.
Steps to reproduce the bug
Here is a heap profile flame graph after service has been up for 12+ hours:
The flame graph seems to indicate that the heap has been accumulating blocks allocated from operator() calls through the NullPlugin although I am at a bit of a loss to figure out a workaround from reading the relevant source code.
In case it's helpful, here is the text version of the heap profile:
File: envoy
Type: inuse_space
Showing nodes accounting for 33124.62kB, 98.34% of 33682.43kB total
Dropped 482 nodes (cum <= 168.41kB)
flat flat% sum% cum cum%
27546.82kB 81.78% 81.78% 33411.07kB 99.19% std::__1::__function::__func::operator()
1816.88kB 5.39% 87.18% 2421.37kB 7.19% std::__1::__hash_table::__emplace_unique_key_args
1449.59kB 4.30% 91.48% 2052.53kB 6.09% std::__1::__hash_table::__emplace_unique_impl
1206.61kB 3.58% 95.06% 1206.61kB 3.58% std::__1::__hash_table::__rehash
819.49kB 2.43% 97.50% 819.49kB 2.43% OPENSSL_malloc
246.34kB 0.73% 98.23% 246.34kB 0.73% std::__1::basic_string::basic_string
28.98kB 0.086% 98.31% 1025.07kB 3.04% Envoy::Extensions::TransportSockets::Tls::ContextManagerImpl::createSslClientContext
7.83kB 0.023% 98.34% 32049.05kB 95.15% Envoy::Server::ConnectionHandlerImpl::ActiveTcpListener::newConnection
2.06kB 0.0061% 98.34% 32053.88kB 95.16% Envoy::Network::ListenerImpl::listenCallback
0 0% 98.34% 613.34kB 1.82% ASN1_item_d2i
0 0% 98.34% 1439.37kB 4.27% Envoy::Config::GrpcMuxImpl::onDiscoveryResponse
0 0% 98.34% 1359.13kB 4.04% Envoy::Config::GrpcSubscriptionImpl::onConfigUpdate
0 0% 98.34% 1144.74kB 3.40% Envoy::Event::FileEventImpl::assignEvents(unsigned int, event_base*)::$_0::__invoke
0 0% 98.34% 29594.81kB 87.86% Envoy::Extensions::Common::Wasm::Context::onCreate
0 0% 98.34% 32009.40kB 95.03% Envoy::Extensions::Common::Wasm::Context::onNetworkNewConnection
0 0% 98.34% 29594.81kB 87.86% Envoy::Extensions::Common::Wasm::Null::NullPlugin::ensureContext
0 0% 98.34% 2414.76kB 7.17% Envoy::Extensions::Common::Wasm::Null::Plugin::Stats::PluginContext::onNewConnection
0 0% 98.34% 992.35kB 2.95% Envoy::Extensions::TransportSockets::Tls::ClientContextImpl::ClientContextImpl
0 0% 98.34% 1025.07kB 3.04% Envoy::Extensions::TransportSockets::Tls::ClientSslSocketFactory::onAddOrUpdateSecret
0 0% 98.34% 1081.84kB 3.21% Envoy::Extensions::TransportSockets::Tls::ContextImpl::ContextImpl
0 0% 98.34% 1433.78kB 4.26% Envoy::Grpc::AsyncStreamCallbacks::onReceiveMessageRaw
0 0% 98.34% 1430.18kB 4.25% Envoy::Grpc::AsyncStreamImpl::onData
0 0% 98.34% 1427.35kB 4.24% Envoy::Http::AsyncStreamImpl::encodeData
0 0% 98.34% 1223kB 3.63% Envoy::Http::CodecClient::CodecReadFilter::onData
0 0% 98.34% 1277.10kB 3.79% Envoy::Http::CodecClient::onData
0 0% 98.34% 1293.62kB 3.84% Envoy::Http::Http2::ConnectionImpl::dispatch
0 0% 98.34% 1402.62kB 4.16% Envoy::Http::Http2::ConnectionImpl::onFrameReceived
0 0% 98.34% 1409.21kB 4.18% Envoy::Http::ResponseDecoderWrapper::decodeData
0 0% 98.34% 753.30kB 2.24% Envoy::MainCommonBase::run
0 0% 98.34% 1178.06kB 3.50% Envoy::Network::ConnectionImpl::onFileEvent
0 0% 98.34% 1219.07kB 3.62% Envoy::Network::ConnectionImpl::onReadReady
0 0% 98.34% 32014.61kB 95.05% Envoy::Network::FilterManagerImpl::initializeReadFilters
0 0% 98.34% 33233.24kB 98.67% Envoy::Network::FilterManagerImpl::onContinueReading
0 0% 98.34% 1420.46kB 4.22% Envoy::Router::UpstreamRequest::decodeData
0 0% 98.34% 1359.39kB 4.04% Envoy::Secret::SdsApi::onConfigUpdate
0 0% 98.34% 32050.53kB 95.16% Envoy::Server::ConnectionHandlerImpl::ActiveTcpListener::onAcceptWorker
0 0% 98.34% 32050.53kB 95.16% Envoy::Server::ConnectionHandlerImpl::ActiveTcpSocket::continueFilterChain
0 0% 98.34% 886.31kB 2.63% Envoy::Server::InstanceImpl::run
0 0% 98.34% 32102.04kB 95.31% Envoy::Server::WorkerImpl::threadRoutine
0 0% 98.34% 220.22kB 0.65% Envoy::Ssl::TlsCertificateConfigImpl::TlsCertificateConfigImpl
0 0% 98.34% 32101.35kB 95.31% Envoy::Thread::ThreadImplPosix::ThreadImplPosix(std::__1::function)::$_0::__invoke
0 0% 98.34% 403.05kB 1.20% PEM_ASN1_read_bio
0 0% 98.34% 195.30kB 0.58% PEM_X509_INFO_read_bio
0 0% 98.34% 571.81kB 1.70% [libc-2.27.so]
0 0% 98.34% 223.61kB 0.66% asn1_d2i_ex_primitive
0 0% 98.34% 223.61kB 0.66% asn1_ex_c2i
0 0% 98.34% 175.63kB 0.52% asn1_item_ex_combine_new
0 0% 98.34% 613.34kB 1.82% asn1_item_ex_d2i
0 0% 98.34% 454.71kB 1.35% asn1_template_ex_d2i
0 0% 98.34% 454.71kB 1.35% asn1_template_noexp_d2i
0 0% 98.34% 207.76kB 0.62% d2i_X509_AUX
0 0% 98.34% 33099.96kB 98.27% event_base_loop
0 0% 98.34% 33183.10kB 98.52% event_process_active_single_queue
0 0% 98.34% 32053.88kB 95.16% listener_read_cb
0 0% 98.34% 675.96kB 2.01% main
0 0% 98.34% 1326.59kB 3.94% nghttp2_session_mem_recv
0 0% 98.34% 1401.21kB 4.16% nghttp2_session_on_data_received
0 0% 98.34% 195.30kB 0.58% parse_x509
0 0% 98.34% 32100.56kB 95.30% start_thread
0 0% 98.34% 602.94kB 1.79% std::__1::__hash_table::__node_insert_unique_prepare
Version (include the output of
istioctl version --remote
andkubectl version
andhelm version
if you used Helm)$istioctl version --remote
client version: 1.6.2
control plane version: 1.6.2
data plane version: 1.6.2 (20 proxies)
While I was only able to dig into this after the upgrade to 1.6.2, I have observed the increased memory consumption since 1.5.2.
$kubectl version
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.3", GitCommit:"2e7996e3e2712684bc73f0dec0200d64eec7fe40", GitTreeState:"clean", BuildDate:"2020-05-20T12:52:00Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"16+", GitVersion:"v1.16.8-eks-e16311", GitCommit:"e163110a04dcb2f39c3325af96d019b4925419eb", GitTreeState:"clean", BuildDate:"2020-03-27T22:37:12Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/amd64"}
How was Istio installed?
istioctl manifest apply -f/istioctl upgrade -f
Environment where bug was observed (cloud vendor, OS, etc)
AWS EKS 1.6 control plane and worker nodes
The text was updated successfully, but these errors were encountered: