-
Notifications
You must be signed in to change notification settings - Fork 268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increasing memory usage #665
Comments
@abhijitawachar thanks for bringing this to our attention. I noticed that your metrics show a similar pattern of memory usage in period between 5/4/22 - 5/12/22. We released v1.16.3 on 5/11/22, so it looks like this memory behavior may have existed prior to v1.16.3. |
@AustinSiu thanks for looking into this. |
@snay2 / @AustinSiu did you get a chance to look at this. |
@abhijitawachar Yes, I am actively investigating this week. So far, I have been able to reproduce this behavior in a sandboxed environment. Currently deep diving possible causes with a memory profiler. |
Hey @snay2 , did you get a chance to look into this |
My initial research from 2022-08-29 was inconclusive, and so far I haven't found a root cause. For now, I'm pausing active work on this issue. Sorry I don't have better news at this time. :( However, we have long-term plans to build a more robust simulation system in our CI/CD process that will help us monitor for memory leaks. To help with that, we want to write test drivers that simulate real-world workloads. Could you tell me a bit more about how your cluster is set up (Queue Processor or IMDS; how frequently does NTH process messages, and what kinds; etc.)? From your graphs, it looks like every few weeks, the NTH pod gets OOM killed and restarted; is that correct? Your graph is measuring the memory usage of the NTH pod, yes? Is this behavior causing any negative performance impact to your cluster? That kind of information will help us design a realistic simulation system and hopefully track down this leak. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want this issue to never become stale, please ask a maintainer to apply the "stalebot-ignore" label. |
This issue was closed because it has become stale with no activity. |
Describe the bug
We have deployed v1.16.3 of the node termination handler. I noticed that the memory usage is increasing over certain period of time and eventually it reaches the pod memory limit and is OOMKilled. Is there a memory leak somewhere?
Application Logs
Following are the logs for the aws-node-termination-handler pod. Logs do not have anything erroneous:
2022/07/28 07:15:01 INF Starting to serve handler /metrics, port 9092
2022/07/28 07:15:01 INF Starting to serve handler /healthz, port 8080
2022/07/28 07:15:01 INF Startup Metadata Retrieved metadata={"accountId":"xxxx","availabilityZone":"us-west-2b","instanceId":"i-xxxx","instanceLifeCycle":"on-demand","instanceType":"c6i.4xlarge","localHostname":"xxxx.us-west-2.compute.internal","privateIp":"x.x.x.x","publicHostname":"","publicIp":"","region":"us-west-2"}
2022/07/28 07:15:01 INF aws-node-termination-handler arguments:
dry-run: false,
node-name: xxxx.us-west-2.compute.internal,
pod-name: aws-node-termination-handler-b56bf578b-79x5m,
metadata-url: http://abcd,
kubernetes-service-host: x.x.x.X,
kubernetes-service-port: 443,
delete-local-data: true,
ignore-daemon-sets: true,
pod-termination-grace-period: -1,
node-termination-grace-period: 120,
enable-scheduled-event-draining: false,
enable-spot-interruption-draining: false,
enable-sqs-termination-draining: true,
enable-rebalance-monitoring: false,
enable-rebalance-draining: false,
metadata-tries: 3,
cordon-only: false,
taint-node: true,
taint-effect: NoSchedule,
exclude-from-load-balancers: false,
json-logging: false,
log-level: info,
webhook-proxy: ,
webhook-headers: ,
webhook-url: ,
webhook-template: ,
uptime-from-file: ,
enable-prometheus-server: true,
prometheus-server-port: 9092,
emit-kubernetes-events: true,
kubernetes-events-extra-annotations: ,
aws-region: us-west-2,
queue-url: https://xxxx,
check-asg-tag-before-draining: false,
managed-asg-tag: aws-node-termination-handler/managed,
use-provider-id: false,
aws-endpoint: ,
2022/07/28 07:15:01 INF Started watching for interruption events
2022/07/28 07:15:01 INF Kubernetes AWS Node Termination Handler has started successfully!
2022/07/28 07:15:01 INF Started watching for event cancellations
2022/07/28 07:15:01 INF Started monitoring for events event_type=SQS_TERMINATE
2022/07/28 07:20:12 INF Adding new event to the event store event={"AutoScalingGroupName":"xxxx","Description":"EC2 State Change event received. Instance i-xxxx went into shutting-down at 2022-07-28 07:20:11 +0000 UTC \n","EndTime":"0001-01-01T00:00:00Z","EventID":"ec2-state-change-event-xxxx","InProgress":false,"InstanceID":"i-xxxx","IsManaged":true,"Kind":"SQS_TERMINATE","NodeLabels":null,"NodeName":"xxxx.us-west-2.compute.internal","NodeProcessed":false,"Pods":null,"ProviderID":"aws:///us-west-2c/xxxx","StartTime":"2022-07-28T07:20:11Z","State":""}
The text was updated successfully, but these errors were encountered: