Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

keda-operator memory leak when prometheus scaler having errors #5248

Closed
Tracked by #5275
GoaMind opened this issue Dec 4, 2023 · 29 comments · Fixed by #5293
Closed
Tracked by #5275

keda-operator memory leak when prometheus scaler having errors #5248

GoaMind opened this issue Dec 4, 2023 · 29 comments · Fixed by #5293
Assignees
Labels
bug Something isn't working

Comments

@GoaMind
Copy link

GoaMind commented Dec 4, 2023

Report

When prometheus scaler having errors while fetching metrics, memory starts to grow on keda-operator until it gets OOMKilled.

Behaviour as follows:
graphana

Installation of Keda is done via plain manifest: https://github.com/kedacore/keda/releases/download/v2.11.2/keda-2.11.2.yaml

Expected Behavior

Memory is not growing when any of scalers have errors

Actual Behavior

Memory is growing, when prometheus scaler having errors (example fetch metrics from prometheus)

Steps to Reproduce the Problem

  1. Deploy service with prometheus scaler type with address that does not exists
	        - type: prometheus
            metadata:
              query: sum(rate(rabbitmq_client_messages_published_total{service_name=~'kafka-api-events-to-rabbitmq'}[2m]))
              threshold: '200'
              serverAddress: https://non-existing-prometheus-url # that returns 404
  1. keda-operator will start pushing Errors in stderr
  2. memory usage will start to grow

Logs from KEDA operator

2023-12-04T14:38:18Z	ERROR	prometheus_scaler	error executing prometheus query	{"type": "ScaledObject", "namespace": "tooling", "name": "debug-service", "error": "prometheus query api returned error. status: 404 response: "}
github.com/kedacore/keda/v2/pkg/scalers.(*prometheusScaler).GetMetricsAndActivity
	/workspace/pkg/scalers/prometheus_scaler.go:359
github.com/kedacore/keda/v2/pkg/scaling/cache.(*ScalersCache).GetMetricsAndActivityForScaler
	/workspace/pkg/scaling/cache/scalers_cache.go:139
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).GetScaledObjectMetrics
	/workspace/pkg/scaling/scale_handler.go:508
github.com/kedacore/keda/v2/pkg/metricsservice.(*GrpcServer).GetMetrics
	/workspace/pkg/metricsservice/server.go:47
github.com/kedacore/keda/v2/pkg/metricsservice/api._MetricsService_GetMetrics_Handler
	/workspace/pkg/metricsservice/api/metrics_grpc.pb.go:99
google.golang.org/grpc.(*Server).processUnaryRPC
	/workspace/vendor/google.golang.org/grpc/server.go:1343
google.golang.org/grpc.(*Server).handleStream
	/workspace/vendor/google.golang.org/grpc/server.go:1737
google.golang.org/grpc.(*Server).serveStreams.func1.1
	/workspace/vendor/google.golang.org/grpc/server.go:986

KEDA Version

2.12.1

Kubernetes Version

1.26

Platform

Amazon Web Services

Scaler Details

prometheus

Anything else?

No response

@GoaMind GoaMind added the bug Something isn't working label Dec 4, 2023
@JorTurFer
Copy link
Member

JorTurFer commented Dec 4, 2023

Hello,
Are you using default values for memory & cpu?
I'm trying v2.12.1 and I can't reproduce the issue (maybe it's solved or default values need to be updated). How many ScaledObjects do you have in cluster? How many are failing?

I have deployed 10 ScaledObjects with pollingInterval: 1 and using prometheus scaler (all of them return 404). The memory looks stable:
image

I'm going to keep it all the night just in case, it could need some hours. Is there any other step that I can do to reproduce it?

I've been profiling the pod, looking for something weird, but I haven't seen anything

@JorTurFer
Copy link
Member

After 8 hours, nothing has changed:
image

Thne memory is stable even though I added 10 failing ScaledObject more.

Now I'm going to downgrade KEDA to v2.11.2 to check if I can reproduce the issue to double-check if it has been solved

@GoaMind
Copy link
Author

GoaMind commented Dec 5, 2023

Good day Jorge,

This happens in all 10 clusters that we have and number of ScaledObject varies from 12 to 110.
Memory starts to grow even if only 1 failed ScaledObject is added.
And we observed this behaviour with Keda versions: 2.9.2, 2.11.2 as well
For keda-operator we have increased requested memory from 100Mi to 200Mi, other params remains the same as in original manifest

        resources:
          limits:
            cpu: 1000m
            memory: 1000Mi
          requests:
            cpu: 100m
            memory: 200Mi

For experiment I took one cluster, that already had 12 ScaledObject and I have added one more (with failed prometheus trigger):

NAME                       SCALETARGETKIND      SCALETARGETNAME            MIN   MAX   TRIGGERS   AUTHENTICATION   READY   ACTIVE   FALLBACK   PAUSED    AGE
onboarding-debug-service   apps/v1.Deployment   onboarding-debug-service   3     10    cpu                         True    False    False      Unknown   10m

Full manifest for this ScaledObject

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  annotations:
    meta.helm.sh/release-name: onboarding-debug-service
    meta.helm.sh/release-namespace: tooling
  creationTimestamp: "2023-12-05T09:29:56Z"
  finalizers:
  - finalizer.keda.sh
  generation: 2
  labels:
    app: onboarding-debug-service
    app.kubernetes.io/managed-by: Helm
    repository: onboarding-debug-service
    scaledobject.keda.sh/name: onboarding-debug-service
  name: onboarding-debug-service
  namespace: tooling
  resourceVersion: "400763468"
  uid: 0809533e-9fa5-40a4-b310-80fa69e9df4d
spec:
  advanced:
    horizontalPodAutoscalerConfig:
      behavior:
        scaleUp:
          stabilizationWindowSeconds: 10
    scalingModifiers: {}
  cooldownPeriod: 300
  maxReplicaCount: 10
  minReplicaCount: 3
  pollingInterval: 30
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: onboarding-debug-service
  triggers:
  - metadata:
      type: Utilization
      value: "70"
    type: cpu
  - metadata:
      type: Utilization
      value: "10"
    type: memory
  - metadata:
      query: sum(rate(rabbitmq_client_messages_published_total{service_name=~'kafka-api-events-to-rabbitmq'}[2m]))
      serverAddress: https://prometheus-404-endpoint
      threshold: "200"
    type: prometheus
status:
  conditions:
  - message: ScaledObject is defined correctly and is ready for scaling
    reason: ScaledObjectReady
    status: "True"
    type: Ready
  - message: Scaling is not performed because triggers are not active
    reason: ScalerNotActive
    status: "False"
    type: Active
  - message: No fallbacks are active on this scaled object
    reason: NoFallbackFound
    status: "False"
    type: Fallback
  - status: Unknown
    type: Paused
  externalMetricNames:
  - s2-prometheus
  health:
    s2-prometheus:
      numberOfFailures: 48
      status: Failing
  hpaName: keda-hpa-onboarding-debug-service
  originalReplicaCount: 10
  resourceMetricNames:
  - cpu
  - memory
  scaleTargetGVKR:
    group: apps
    kind: Deployment
    resource: deployments
    version: v1
  scaleTargetKind: apps/v1.Deployment

And right after adding this one single scaler memory utilisation started to grow staidly (from 35% to 43% in 15m)
2023-12-05 11-45-28 - KEDA Operator monitoring - K8s Resource Management - Dashboards - Grafana

@JorTurFer
Copy link
Member

JorTurFer commented Dec 6, 2023

Thanks for the info,
I kept KEDA deployed with current the same scenario and after 36h it looks exactly the same. It uses ~180Mi stable (16 ScaledObject, 8 of them with 404 in the endpoint and other 8 with invalid url):
image

And right after adding this one single scaler memory utilisation started to grow staidly (from 35% to 43% in 15m)

Is this over the requests or over the limits? the original 40% can be 40Mi or 400Mi and current 7% can be 14Mi or 70Mi xD
I mean, 14Mi of memory increasing can could be just because it tries to reconnect, increasing the memory usage due to allocation of resources for regenerating the internal cache

I'm going to test the same scenario with KEDA v2.9.2 because we introduced several performance improvements and maybe 2.9.2 was affected by something.

@JorTurFer
Copy link
Member

After 2 hours with v2.9.2, it's almost the same:
image

@JorTurFer
Copy link
Member

Latest KEDA version has the option for enabling the profile port.
You can do it by setting an extra arg --profiling-bind-address=:PORT
If could you enable the profiler, export the heap and send it to us, we can go deeper on your case. Don't enable the profiler on huge clusters because profiling can have a performance impact.

@JorTurFer
Copy link
Member

Same behavior using KEDA v2.9.2, we'd need the memory dump to check the root cause

@zroubalik
Copy link
Member

@GoaMind Thanks for reporting! Could you please also share an example Deployment for the workload that you are scaling?

@JorTurFer do you have the same configuration of ScaledObjects?

@JorTurFer
Copy link
Member

IDK, I hope so,
This is an example:

spec:
  cooldownPeriod: 1
  maxReplicaCount: 2
  minReplicaCount: 0
  pollingInterval: 1
  scaleTargetRef:
    name: prometheus-test-deployment
  triggers:
    - metadata:
        activationThreshold: '20'
        metricName: http_requests_total
        query: >-
          sum(rate(http_requests_total{app="prometheus-test-monitored-app"}[2m]))
        serverAddress: http://20.238.174.237
        threshold: '20'
      type: prometheus

@GoaMind , Could you confirm that this is similar to yours? The IP is public (and mine) so you can try the ScaledObject in your cluster if you want

@zroubalik
Copy link
Member

IDK, I hope so, This is an example:

spec:
  cooldownPeriod: 1
  maxReplicaCount: 2
  minReplicaCount: 0
  pollingInterval: 1
  scaleTargetRef:
    name: prometheus-test-deployment
  triggers:
    - metadata:
        activationThreshold: '20'
        metricName: http_requests_total
        query: >-
          sum(rate(http_requests_total{app="prometheus-test-monitored-app"}[2m]))
        serverAddress: http://20.238.174.237
        threshold: '20'
      type: prometheus

@GoaMind , Could you confirm that this is similar to yours? The IP is public (and mine) so you can try the ScaledObject in your cluster if you want

That would be great!

@GoaMind
Copy link
Author

GoaMind commented Dec 8, 2023

Hey,

Also here is the behaviour with Memory consumption points:
2023-12-08 14-29-07 - KEDA Operator monitoring (memory leak investigation) - K8s Resource Management - Dashboards - Grafana

To verify that reflection is correct, I also checked on K8s side, at some point:

NAME                                      CPU(cores)   MEMORY(bytes)
keda-admission-7744888c69-2sphk           1m           40Mi
keda-metrics-apiserver-599b5f957c-pfmxq   3m           31Mi
keda-operator-6d5686ff7c-94xnb            7m           967Mi

And after OOMKill:

NAME                                      CPU(cores)   MEMORY(bytes)
keda-admission-7744888c69-2sphk           2m           40Mi
keda-metrics-apiserver-599b5f957c-pfmxq   3m           31Mi
keda-operator-6d5686ff7c-94xnb            8m           71Mi

If describing the pod:

    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Tue, 05 Dec 2023 17:14:47 +0200
      Finished:     Fri, 08 Dec 2023 11:12:28 +0200

I will put a scaler provided by @JorTurFer to monitor if Keda will behave the same.
And play with profiling in the beginning of next week.

@zroubalik
Copy link
Member

@GoaMind great, thanks for the update!

@GoaMind
Copy link
Author

GoaMind commented Dec 8, 2023

I have deployed trigger proposed by @JorTurFer ) and there was not any memory leak visible.

But, after changing serverAddress from IP (20.238.174.237) to random DNS that returns. 404
2023-12-08 17-16-26 - KEDA Operator monitoring (memory leak investigation) - K8s Resource Management - Dashboards - Grafana

@zroubalik
Copy link
Member

@GoaMind what happens if you put there random IP that returns 404s?

@JorTurFer
Copy link
Member

Could you share an example of the random DNS? I'd like to replicate your case closest as possible

@JorTurFer
Copy link
Member

I've added a ScaledObject like this:

- metadata:
        activationThreshold: '20'
        metricName: http_requests_total
        query: >-
          sum(rate(http_requests_total{app="prometheus-test-monitored-app"}[2m]))
        serverAddress: https://google.es
        threshold: '20'
      type: prometheus

and the memory has increased a bit but it's still stable:
image

Could you share with us a ScaledObject that I can use to replicate the scenario please?

@JorTurFer JorTurFer mentioned this issue Dec 11, 2023
25 tasks
@GoaMind
Copy link
Author

GoaMind commented Dec 12, 2023

I was a bit wrong that it is reproducible with random DNS.

In fact I was able to reproduce only when calling our internal prometheus server that is not available from outside.

I have tried to configure my personal server https://kedacore-test.hdo.ee/test to give response the same way. However even I was able to align all the headers and set wildcard SSL certs, I still cannot reproduce it on this server. I have compared ciphers and other configs, but still cannot figure out what is the difference.

Here is what I checked so far (but without luck)

Checked full internal prometheus server response, that causes memory leak in keda-operator

*   Trying xxx.xxx.xxx.xxx:443...
* Connected to XXXX (xxx.xxx.xxx.xxx) port 443
* ALPN: curl offers h2,http/1.1
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
*  CAfile: /etc/ssl/certs/ca-certificates.crt
*  CApath: /etc/ssl/certs
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256
* ALPN: server accepted h2
* Server certificate:
*  subject: CN=*.XXXX.XXX
*  start date: Nov  6 06:57:35 2023 GMT
*  expire date: Feb  4 06:57:34 2024 GMT
*  subjectAltName: host "XXXX" matched cert's "*.XXXX.XXX"
*  issuer: C=US; O=Let's Encrypt; CN=R3
*  SSL certificate verify ok.
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* using HTTP/2
* [HTTP/2] [1] OPENED stream for https://XXXX/test
* [HTTP/2] [1] [:method: GET]
* [HTTP/2] [1] [:scheme: https]
* [HTTP/2] [1] [:authority: XXXX]
* [HTTP/2] [1] [:path: /test]
* [HTTP/2] [1] [user-agent: curl/8.4.0]
* [HTTP/2] [1] [accept: */*]
> GET /test HTTP/2
> Host: XXXX
> User-Agent: curl/8.4.0
> Accept: */*
>
< HTTP/2 404
HTTP/2 404
< content-length: 0
content-length: 0
< date: Tue, 12 Dec 2023 19:05:26 GMT
date: Tue, 12 Dec 2023 19:05:26 GMT

<
* Connection #0 to host XXXX left intact

Full specs for SO that I used:

spec:
  advanced:
    horizontalPodAutoscalerConfig:
      behavior:
        scaleUp:
          stabilizationWindowSeconds: 10
    scalingModifiers: {}
  cooldownPeriod: 1
  maxReplicaCount: 2
  minReplicaCount: 0
  pollingInterval: 1
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: onboarding-debug-service
  triggers:
  - metadata:
      activationThreshold: "20"
      query: sum(rate(http_requests_total{app="prometheus-test-monitored-app"}[2m]))
      serverAddress: https://XXXXXXX # With https://kedacore-test.hdo.ee/test it is not reproducible for now
      threshold: "20"
    type: prometheus

To replicate internal server response with https://kedacore-test.hdo.ee/test I have configured Nginx (requires nginx-extras to be installed to make it working):
nginx.conf in addition to defaults

http {
	server_tokens off;
        more_clear_headers Server;
        
        ssl_protocols TLSv1.2 TLSv1.3; 
	ssl_prefer_server_ciphers off;
        ssl_ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES128-GCM-SHA256;
}

/etc/nginx/sites-enabled/test full conf:

server {
   listen 443 ssl http2;
   server_name kedacore-test.hdo.ee;
   ssl_certificate /etc/letsencrypt/live/kedacore-test.hdo.ee/fullchain.pem;
   ssl_certificate_key /etc/letsencrypt/live/kedacore-test.hdo.ee/privkey.pem;

   location / {
     more_clear_headers 'Content-Type';
     more_clear_headers 'last-modified';
     more_clear_headers 'etag';
     more_clear_headers 'accept-ranges';
     more_set_headers "Content-Length: 0"
     return 404;
   }
   location /api {
     more_clear_headers 'Content-Type';
     more_set_headers "Content-Length: 0"
     return 404;
   }
}

But no luck, despite response have only difference in hosts, IPs and cipher
I'm out of ideas for now.

I will try to play with profiling in next few days and will drop an update once anything is figured out.

@JorTurFer
Copy link
Member

We'd appreciate a profile as we can go deeper on the issue.
The easiest way is using main branch and enabling the profiler as argument on pod: https://github.com/kedacore/keda/blob/main/cmd/operator/main.go#L93
But in case of this isn't possible (because you use other KEDA version for example), you can take a look to this post: https://dev.to/tsuyoshiushio/enabling-memory-profiling-on-keda-v2-157g

@GoaMind
Copy link
Author

GoaMind commented Dec 15, 2023

@JorTurFer thank you for information, I will check it shortly.

Could you please check if you can reproduce it with this trigger:

  - metadata:
      activationThreshold: "20"
      customHeaders: Host=abc.pipedrive.tools
      query: sum(rate(http_requests_total{app="prometheus-test-monitored-app"}[2m]))
      serverAddress: https://pimp.pipedrive.tools
      threshold: "20"
    type: prometheus

Key thing here is that you need to get 404 without body, that why we use customHeaders here

@JorTurFer
Copy link
Member

Let me check your trigger :)

@JorTurFer
Copy link
Member

JorTurFer commented Dec 15, 2023

I'm not 100% sure... maybe it's not related... xD
image

Now seriously, thanks for the report and reproduction path. I do confirm that I can reproduce the issue, I'll check it in dept later on

@GoaMind
Copy link
Author

GoaMind commented Dec 15, 2023

Thank you for prompt checking. 🙇
I'm still curios why this trigger produces such behaviour, while manually prepared address with 404 error -https://kedacore-test.hdo.ee/test does not have such footprints.

@JorTurFer
Copy link
Member

It looks that after some time the consumption is stable
image

But definitively it looks weird and I'll profile the workload to detect the cause

@JorTurFer JorTurFer self-assigned this Dec 15, 2023
@JorTurFer
Copy link
Member

I think that I've found a possible problem. I'll draft a PR later on, but before merging it, would you be willing to test the fix if I build an image that contains it @GoaMind ?

@GoaMind
Copy link
Author

GoaMind commented Dec 15, 2023

Hey, sure I can test it once the docker image is available. If it is possible go this way.

@JorTurFer
Copy link
Member

Yeah, I'm preparing the PR and once I open it, I'll give you the docker tag 😄
thanks!

@JorTurFer
Copy link
Member

Hey @GoaMind
This tag ghcr.io/kedacore/keda-test:pr-5293-fe19d3a3233bef79ac7f53ba4f967a58b569f5f8 has been created based on this PR (so basically it's main + my PR).
Could you give a try and tell us if the problem is solved?

@GoaMind
Copy link
Author

GoaMind commented Dec 19, 2023

@JorTurFer apologise for late reply.

I can confirm that I do not observe memory leaking with your provided image.
Thank you so much for digging into this problem 🙇

@JorTurFer
Copy link
Member

Nice!
The fix is already merged so it'll be included as part of next release 😁

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants