Memory leak on Interceptor #1078

Leonardo-Ferreira · 2024-06-28T23:54:38Z

Report

I think we have a memory leak.
Here is the memory usage over 2 days:

The memory drops is the pod getting OOM killed. At first we thought 64mb is too little, so we decided to experiment higher values in the last 24hrs, but the symptom persisted:

Expected Behavior

the pods should release memory and as traffic fluctuates down and not get OOM killed

Actual Behavior

Memory usage never decreases

Steps to Reproduce the Problem

Install version 0.8
Put a lot of load on apis controlled by this guy

Logs from KEDA HTTP operator

example

HTTP Add-on Version

0.8.0

Kubernetes Version

< 1.28

Platform

Microsoft Azure

Anything else?

AKS v 1.27.9

The text was updated successfully, but these errors were encountered:

tomkerkhove · 2024-07-01T07:03:30Z

@JorTurFer / @wozniakjan do you have bandwidth to check this one?

wozniakjan · 2024-07-01T07:21:27Z

thank you for reporting this, from the graphs this indeed looks like a memory leak, I can investigate later today.

wozniakjan · 2024-07-01T10:44:12Z

after a quick investigation, this might actually be related to HTTP connections overhead, GC and request caching rather than an actual memory leak. I am able to reproduce quick memory growth with concurrent requests. When idle on a tiny cluster with a single HTTPScaledObject, the interceptor uses around 21Mi

$ kubectl top pods -nkeda --selector app.kubernetes.io/component=interceptor
NAME                                            CPU(cores)   MEMORY(bytes)
keda-add-ons-http-interceptor-6f6d78bd9-rkn2r   4m           21Mi

with requests sent sequentially, the memory usage barely increases and then goes down to idle levels. With simple benchmark using wrk with 100 concurrent threads and 100 connections

$ wrk -t100 -c100 -d30s -H "Host: myhost2.com" http://172.19.0.2

the memory usage does increase, peaking at 54Mi:

$ kubectl top pods -nkeda --selector app.kubernetes.io/component=interceptor
NAME                                            CPU(cores)   MEMORY(bytes)
keda-add-ons-http-interceptor-6f6d78bd9-rkn2r   494m         54Mi

but after wrk benchmark terminates, the CPU usage drops down quickly and while the memory usage decreases rather slowly, both eventually reach similar levels to the initial idle. With 100 threads and 500 connections, interceptor OOMKills pretty fast with a default limit of 64Mi, after bumping the memory limit to 1Gi, the memory again seems well flexible - growing as well as shrinking with benchmark executions.

Attached is also trace output from net/http/pprof during the execution of the wrk benchmark. The memory growth is apparent but so is also the memory decrease. There doesn't appear to be any place in code where the memory just grows. I will perform some longer running benchmarks with profiling traces to try to spot something more sneaky.

trace-wrk.txt

wozniakjan · 2024-07-01T12:55:08Z

I may have noticed something, with higher error rate in the benchmark, the memory consumption doesn't go down. I just bumped the memory limit for interceptor to 1Gi and executed benchmark that may have been too much, 5000 connections in 200 threads for 10 minutes

$ wrk -t200 -c5000 -d10m -H "Host: myhost2.com" http://172.19.0.2
Running 10m test @ http://172.19.0.2
  200 threads and 5000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   688.19ms  506.07ms   2.00s    69.41%
    Req/Sec     6.19      5.99   101.00     83.53%
  351661 requests in 10.00m, 277.64MB read
  Socket errors: connect 0, read 0, write 0, timeout 323052
  Non-2xx or 3xx responses: 16769
Requests/sec:    586.00
Transfer/sec:    473.75KB

out of ~350000 requests, 91% timeouts and 4.7% errors. The memory usage is still at 531Mi which is close to its ~600Mi peak during the benchmark even though it's been idle for 10 minutes, and it doesn't seem to go down.

Leonardo-Ferreira · 2024-07-01T13:27:19Z

@wozniakjan my current number is around 15k requests every minute. I currently have 3 interceptor pods, at 640 MB each... in about one hour and half that number is expected to peak at 25k... during the late night, between 2am-5am, the number drops to 5k...

wozniakjan · 2024-07-01T14:13:51Z

@Leonardo-Ferreira I played around with benchmarks a bit more and afaict interceptor seems to be very CPU intensive, for wrk simulated load that would match the request rate of 25k/min I am able to saturate 12 CPU cores easily. With this amount of CPU, it's able to maintain a relatively low error rate 0.42%.

With a low error rate, the memory usage appears to go down to idle levels eventually, but when the error rate is high, even when later the interceptor is off traffic, memory usage doesn't seem to go down.

Go's pprof heap analysis doesn't point to any code that would accumulate memory and not release it, but also heap as counted by pprof appears to be just a fraction of memory used compared to cgroup accounting. The pprof recorded 44MB while cgroup show 318MB used.

Leonardo-Ferreira · 2024-07-01T14:16:34Z

@wozniakjan after looking deeply at my data, I could not correlate the memory usage/increase-rate to the number of errors. Luckily (or not) we had an issue this morning where a dependency was flaky so there was a significant spike in errors related to timeouts. on the memory consumption graph there was no significant change during the 1 hour long event...

Leonardo-Ferreira · 2024-07-01T14:19:00Z

simulated load that would match the request rate of 25k/min I am able to saturate 12 CPU cores easily

well, that's weird... our pods have a limit of 0.7

this is the last 24hrs looking at the 95th percentile of cpu usage

wozniakjan · 2024-07-01T14:26:15Z

well, that's weird... our pods have a limit of 0.7

good to hear that, perhaps my artificial benchmark setup just ends up being CPU-intensive.

my current hypothesis I am trying to either prove or disprove is that maybe when the interceptor itself fails to process the requests and it has a higher error rate, maybe then it leaves some resources dangling, e.g. open connections or entries in routing tables caches..

Leonardo-Ferreira · 2024-07-01T14:28:40Z

would that output a specific log message that I could query for?

wozniakjan · 2024-07-01T14:46:44Z

I can see plenty of these access logs in the interceptor

2024-07-01T14:26:38Z    INFO    LoggingMiddleware   10.42.0.8:35634 - - [01/Jul/2024:14:26:36 +0000] "GET / HTTP/1.1" 502 11 "" ""

despite the fact the scaled app returns always OK 200.

Also, I can observe these errors in the interceptor logs

2024-07-01T14:27:13Z    ERROR   LoggingMiddleware.RoutingMiddleware.CountingMiddleware.UpstreamHandler.StaticHandler    Bad Gateway {"routingKey": "//myhost2.com/", "namespacedName": {"name":"demo","namespace":"keda"}, "stream": "http://demo.keda:8080/", "error": "context canceled"}
github.com/kedacore/http-add-on/interceptor/handler.(*Static).ServeHTTP
    github.com/kedacore/http-add-on/interceptor/handler/static.go:36
github.com/kedacore/http-add-on/interceptor/handler.(*Upstream).ServeHTTP.func2
    github.com/kedacore/http-add-on/interceptor/handler/upstream.go:54
net/http/httputil.(*ReverseProxy).ServeHTTP
    net/http/httputil/reverseproxy.go:472
github.com/kedacore/http-add-on/interceptor/handler.(*Upstream).ServeHTTP
    github.com/kedacore/http-add-on/interceptor/handler/upstream.go:57
main.runProxyServer.newForwardingHandler.func2
    github.com/kedacore/http-add-on/interceptor/proxy_handlers.go:87
net/http.HandlerFunc.ServeHTTP
    net/http/server.go:2166
github.com/kedacore/http-add-on/interceptor/middleware.(*Counting).ServeHTTP
    github.com/kedacore/http-add-on/interceptor/middleware/counting.go:39
github.com/kedacore/http-add-on/interceptor/middleware.(*Routing).ServeHTTP
    github.com/kedacore/http-add-on/interceptor/middleware/routing.go:93
github.com/kedacore/http-add-on/interceptor/middleware.(*Logging).ServeHTTP
    github.com/kedacore/http-add-on/interceptor/middleware/logging.go:42
github.com/kedacore/http-add-on/interceptor/middleware.(*Metrics).ServeHTTP
    github.com/kedacore/http-add-on/interceptor/middleware/metrics.go:24
net/http.serverHandler.ServeHTTP
    net/http/server.go:3137
net/http.(*conn).serve
    net/http/server.go:2039

Leonardo-Ferreira · 2024-07-01T17:14:50Z

should we log some extra info and release a 0.8.1 version? unfortunately my security dpto does not allow me to use "unofficial" images in production

wozniakjan · 2024-07-02T18:10:26Z

I'm still in the process of trying to figure out the root cause but as soon as I have some valid hypothesis to test, I can distribute a build for your testing

Leonardo-Ferreira · 2024-07-11T23:40:56Z

any news @wozniakjan, can I help somehow?

wozniakjan · 2024-07-12T06:19:59Z

I didn't get very far and now I had to put it on hold temporarily, but I will get back to this soon

Leonardo-Ferreira · 2024-07-25T18:38:37Z

hey @wozniakjan, I'd like to contribute here. would you be willing to connect for like 30-45min so you can "boost" me? this way I could make significant contributions faster

wozniakjan · 2024-07-29T09:03:50Z

sure, you can ping me on KEDA slack https://keda.sh/community/
https://kubernetes.slack.com/team/UF5R8HHB7

stale · 2024-09-28T05:23:57Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

stale · 2024-10-05T07:03:10Z

This issue has been automatically closed due to inactivity.

stale · 2024-10-28T09:06:21Z

This issue has been automatically closed due to inactivity.

Leonardo-Ferreira · 2024-10-29T13:51:35Z

@zroubalik reopen?

Leonardo-Ferreira added the bug Something isn't working label Jun 28, 2024

keda-automation added this to Roadmap - KEDA HTTP Add-On Jun 28, 2024

github-project-automation bot moved this to To Triage in Roadmap - KEDA HTTP Add-On Jun 28, 2024

Leonardo-Ferreira changed the title ~~Memory leak~~ Memory leak on Interceptor Jun 29, 2024

stale bot added the stale All issues that are marked as stale due to inactivity label Sep 28, 2024

stale bot closed this as completed Oct 5, 2024

github-project-automation bot moved this from To Triage to Done in Roadmap - KEDA HTTP Add-On Oct 5, 2024

zroubalik reopened this Oct 18, 2024

stale bot closed this as completed Oct 28, 2024

wozniakjan reopened this Oct 29, 2024

stale bot removed the stale All issues that are marked as stale due to inactivity label Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak on Interceptor #1078

Memory leak on Interceptor #1078

Leonardo-Ferreira commented Jun 28, 2024

tomkerkhove commented Jul 1, 2024

wozniakjan commented Jul 1, 2024 •

edited

Loading

wozniakjan commented Jul 1, 2024

wozniakjan commented Jul 1, 2024 •

edited

Loading

Leonardo-Ferreira commented Jul 1, 2024

wozniakjan commented Jul 1, 2024

Leonardo-Ferreira commented Jul 1, 2024

Leonardo-Ferreira commented Jul 1, 2024 •

edited

Loading

wozniakjan commented Jul 1, 2024 •

edited

Loading

Leonardo-Ferreira commented Jul 1, 2024

wozniakjan commented Jul 1, 2024 •

edited

Loading

Leonardo-Ferreira commented Jul 1, 2024

wozniakjan commented Jul 2, 2024

Leonardo-Ferreira commented Jul 11, 2024

wozniakjan commented Jul 12, 2024

Leonardo-Ferreira commented Jul 25, 2024

wozniakjan commented Jul 29, 2024

stale bot commented Sep 28, 2024

stale bot commented Oct 5, 2024

stale bot commented Oct 28, 2024

Leonardo-Ferreira commented Oct 29, 2024

Memory leak on Interceptor #1078

Memory leak on Interceptor #1078

Comments

Leonardo-Ferreira commented Jun 28, 2024

Report

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Logs from KEDA HTTP operator

HTTP Add-on Version

Kubernetes Version

Platform

Anything else?

tomkerkhove commented Jul 1, 2024

wozniakjan commented Jul 1, 2024 • edited Loading

wozniakjan commented Jul 1, 2024

wozniakjan commented Jul 1, 2024 • edited Loading

Leonardo-Ferreira commented Jul 1, 2024

wozniakjan commented Jul 1, 2024

Leonardo-Ferreira commented Jul 1, 2024

Leonardo-Ferreira commented Jul 1, 2024 • edited Loading

wozniakjan commented Jul 1, 2024 • edited Loading

Leonardo-Ferreira commented Jul 1, 2024

wozniakjan commented Jul 1, 2024 • edited Loading

Leonardo-Ferreira commented Jul 1, 2024

wozniakjan commented Jul 2, 2024

Leonardo-Ferreira commented Jul 11, 2024

wozniakjan commented Jul 12, 2024

Leonardo-Ferreira commented Jul 25, 2024

wozniakjan commented Jul 29, 2024

stale bot commented Sep 28, 2024

stale bot commented Oct 5, 2024

stale bot commented Oct 28, 2024

Leonardo-Ferreira commented Oct 29, 2024

wozniakjan commented Jul 1, 2024 •

edited

Loading

wozniakjan commented Jul 1, 2024 •

edited

Loading

Leonardo-Ferreira commented Jul 1, 2024 •

edited

Loading

wozniakjan commented Jul 1, 2024 •

edited

Loading

wozniakjan commented Jul 1, 2024 •

edited

Loading