-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster Autoscaler on AWS is OOM killed on startup in GenerateEC2InstanceTypes #3506
Comments
Experienced the same OOM issue when disabling IMDSv1 and switching purely to use IRSA, but the deployment was missing AWS_REGION environment variable, which leads Cluster Autoscaler to query the pricing information for all available regions. With these JSON document sizes, OOMKills are likely to happen. With AWS_REGION specified, only matching pricing data will be retrieved
|
I am seeing this error even with 6Gi of memory limits... something is wrong. |
We've seen similar issues with our AWS autoscaler. We didn't have 1.18 to take a pprof but it was taking > 5GB of ram. Maybe we should default to the static list? I don't think it's been updated recently |
I cross-commented on this related issue: #3044 (comment) |
I actually don't believe the issues #3044 and this one are due to the same problem. This one is pretty clearly the result of the instance type dynamic generation pulling down 100+MB JSON files on startup. In #3044, however, yourself and another poster point out that using the static instance type list does not solve memory leak issues. I believe the root cause of these issues is different. |
If adding |
I'm trying to reproduce. How big of a cluster are you trying out? I'm running 1.18.2 on EKS 1.18. 100 node cluster, 400 pods and it's sitting stable at 300mb of memory. |
Update after deep diving this. @seamusabshere's OOM was due to listwatch caches filling up on startup due to a large number of Job objects in the API Server. @timothyb89, is there any chance your cluster is suffering similar fate? |
😆 i thought i was safe because we were using So, I had thousands of months-old jobs. |
Our largest cluster has 250 job objects at the moment, which I'd hope isn't nearly large enough to cause any trouble. For what it's worth we been using |
I reproduce the issue, even if contrary to the initial ticket, the default limit it now not set at 250 Mi but at 300 Mi. Sometimes, the
Increasing this limit accordingly solves the issue on my side. |
Is there any chance we could grab this list from the local filesystem and in combination with using an We have to configure |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
/remove-lifecycle stale |
CA has a known bug kubernetes/autoscaler#3506 The container consumes more memory than it is limited to. This fix will prevent issues with OOMKill errors with cluster-autoscaler container
Fixed by increment the memory limit to 800Mi |
That option was mentioned and referenced above several times. #3506 (comment) Idea of having the list provided as input is IMHO one of the better options. |
Having the same issue with 1.21.0, downgrading to 1.20.0 fixes the problem. |
I am suffering from this problem as well on AWS EKS... It started with 1.19.1; I've upgraded to 1.20.0 and now I'm looking at upgrading the memory. |
This is also affecting me, although for some reason only on a single cluster, even using the exact same configuration and limits. Upping the mem limits to 500 has been the workaround I used successully |
we also have this issue, even up to 1G.. stills dying... somewhere there is a leak and is related with CA & AWS. @gjtempleton do u guys have a date for the #4199 milestone ? in which release we can expect this? |
I've deployed v1.22.1 into a cluster which was previously seeing an oomkill with a memory limit of 300Mi. It's fixed the problem for us. |
Awesome, thanks for letting us know, all credit to @aidy for doing the hard work. I'll leave this issue open for a bit longer to see if anyone's still seeing these issues with the new patch releases that include the streaming change, but if not will close it off in a week. |
Given the lack of any new reports of this issue, I'm going to close this as resolved for now, please let us know if you see any recurrence of this behaviour though. /close |
@gjtempleton: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
Cluster Autoscaler internally downloads a JSON file as @hhamalai mentioned. In my case it was 114MB. By giving the cluster autoscaler pod some additional memory, the issue was fixed. Checkout for the log |
This worked for me thanks. |
We noticed our cluster autoscaler occasionally getting OOM killed on startup or when elected as leader. The memory usage spike on startup is fairly consistent even when not OOM killed, sitting just below the default limits at 250Mi or so. When it doesn't OOM, this memory is eventually garbage collected and the autoscaler stabilizes at well under 100Mi used:
After a pprof trace (requiring an ad-hoc upgrade to cluster-autoscaler v1.18.2 to get the
--profiling
flag) we noticed a large chunk of memory allocated in theGenerateEC2InstanceTypes
function. We were able to trace this back to PR #2249 which fetches an updated list of EC2 instance types from an AWS-hosted JSON file. Surprisingly, this file is 94 MiB, the entirety of which is fetched onto the heap before parsing. The data extracted is fairly small (under 43KiB perec2_instance_types.go
) but unfortunately the allocations sometimes live long enough to push the autoscaler over the (default) memory limit.Additionally, with the
--aws-use-static-instance-list=true
flag set, the memory spike disappears:Is there some solution that could fetch the updated list without requiring an otherwise unnecessary memory limit increase? Given the autoscaler's special priority class, raising the limit well beyond what it actually needs at runtime feels a bit wrong.
Additional information:
autoscaler image:
k8s.gcr.io/autoscaling/cluster-autoscaler:v1.16.6
Kubernetes version:
pprof svg: cluster-autoscaler-pprof.tar.gz (.svg in a tarball to satisfy GitHub)
kubectl describe pod
output:The text was updated successfully, but these errors were encountered: