Cluster Autoscaler on AWS is OOM killed on startup in GenerateEC2InstanceTypes #3506

timothyb89 · 2020-09-11T19:29:16Z

We noticed our cluster autoscaler occasionally getting OOM killed on startup or when elected as leader. The memory usage spike on startup is fairly consistent even when not OOM killed, sitting just below the default limits at 250Mi or so. When it doesn't OOM, this memory is eventually garbage collected and the autoscaler stabilizes at well under 100Mi used:

After a pprof trace (requiring an ad-hoc upgrade to cluster-autoscaler v1.18.2 to get the --profiling flag) we noticed a large chunk of memory allocated in the GenerateEC2InstanceTypes function. We were able to trace this back to PR #2249 which fetches an updated list of EC2 instance types from an AWS-hosted JSON file. Surprisingly, this file is 94 MiB, the entirety of which is fetched onto the heap before parsing. The data extracted is fairly small (under 43KiB per ec2_instance_types.go) but unfortunately the allocations sometimes live long enough to push the autoscaler over the (default) memory limit.

Additionally, with the --aws-use-static-instance-list=true flag set, the memory spike disappears:

Is there some solution that could fetch the updated list without requiring an otherwise unnecessary memory limit increase? Given the autoscaler's special priority class, raising the limit well beyond what it actually needs at runtime feels a bit wrong.

Additional information:

autoscaler image: k8s.gcr.io/autoscaling/cluster-autoscaler:v1.16.6

Kubernetes version:

Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.0", GitCommit:"9e991415386e4cf155a24b1da15becaa390438d8", GitTreeState:"clean", BuildDate:"2020-03-25T14:58:59Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"16+", GitVersion:"v1.16.13-eks-2ba888", GitCommit:"2ba888155c7f8093a1bc06e3336333fbdb27b3da", GitTreeState:"clean", BuildDate:"2020-07-17T18:48:53Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}

pprof svg: cluster-autoscaler-pprof.tar.gz (.svg in a tarball to satisfy GitHub)

kubectl describe pod output:

Name:                 cluster-autoscaler-7b9c56647d-9v8pr
Namespace:            kube-system
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Node:                 ip-192-168-125-18.us-west-2.compute.internal/192.168.125.18
Start Time:           Tue, 08 Sep 2020 05:18:08 -0600
Labels:               app=cluster-autoscaler
                      app.kubernetes.io/instance=cluster-autoscaler
                      app.kubernetes.io/name=cluster-autoscaler
                      pod-template-hash=7b9c56647d
Annotations:          cluster-autoscaler.kubernetes.io/safe-to-evict: false
                      kubernetes.io/psp: psp.privileged
                      prometheus.io/path: /metrics
                      prometheus.io/port: 8085
                      prometheus.io/scrape: true
Status:               Running
IP:                   192.168.121.122
IPs:
  IP:           192.168.121.122
Controlled By:  ReplicaSet/cluster-autoscaler-7b9c56647d
Containers:
  cluster-autoscaler:
    Container ID:  docker://cbdbb11a7c20b042d79744edbb5dd0c6fde71303be697a1a773307c9d5ac442c
    Image:         k8s.gcr.io/autoscaling/cluster-autoscaler:v1.16.6
    Image ID:      docker-pullable://k8s.gcr.io/autoscaling/cluster-autoscaler@sha256:cbbe98dd8f325bef54557bc2854e48983cfc706aba126bedb0c52d593e869072
    Port:          8085/TCP
    Host Port:     0/TCP
    Command:
      ./cluster-autoscaler
      --v=4
      --stderrthreshold=info
      --cloud-provider=aws
      --skip-nodes-with-local-storage=false
      --expander=least-waste
      --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/<snip>1
      --balance-similar-node-groups
      --skip-nodes-with-system-pods=false
    State:          Running
      Started:      Wed, 09 Sep 2020 16:56:37 -0600
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Wed, 09 Sep 2020 16:52:52 -0600
      Finished:     Wed, 09 Sep 2020 16:56:21 -0600
    Ready:          True
    Restart Count:  2
    Limits:
      cpu:     100m
      memory:  300Mi
    Requests:
      cpu:        100m
      memory:     300Mi
    Environment:  <none>
    Mounts:
      /etc/ssl/certs/ca-certificates.crt from ssl-certs (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from cluster-autoscaler-token-bwpc6 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  ssl-certs:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/ssl/certs/ca-bundle.crt
    HostPathType:  
  cluster-autoscaler-token-bwpc6:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  cluster-autoscaler-token-bwpc6
    Optional:    false
QoS Class:       Guaranteed
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:          <none>

The text was updated successfully, but these errors were encountered:

hhamalai · 2020-10-19T16:31:39Z

Experienced the same OOM issue when disabling IMDSv1 and switching purely to use IRSA, but the deployment was missing AWS_REGION environment variable, which leads Cluster Autoscaler to query the pricing information for all available regions. With these JSON document sizes, OOMKills are likely to happen. With AWS_REGION specified, only matching pricing data will be retrieved

autoscaler/cluster-autoscaler/cloudprovider/aws/aws_util.go

Line 71 in d054bb2

if region != "" && region != r.ID() {

seamusabshere · 2020-12-07T06:02:20Z

I am seeing this error even with 6Gi of memory limits... something is wrong.

mcristina422 · 2020-12-07T15:35:31Z

We've seen similar issues with our AWS autoscaler. We didn't have 1.18 to take a pprof but it was taking > 5GB of ram. Maybe we should default to the static list? I don't think it's been updated recently

seamusabshere · 2020-12-15T01:14:55Z

I cross-commented on this related issue: #3044 (comment)

jaypipes · 2021-01-13T20:22:42Z

I cross-commented on this related issue: #3044 (comment)

I actually don't believe the issues #3044 and this one are due to the same problem. This one is pretty clearly the result of the instance type dynamic generation pulling down 100+MB JSON files on startup. In #3044, however, yourself and another poster point out that using the static instance type list does not solve memory leak issues. I believe the root cause of these issues is different.

jqmichael · 2021-01-13T20:26:11Z

If adding AWS_REGION to container could fix pulling down huge instance type file, it was added to the latest https://github.com/aws/amazon-eks-pod-identity-webhook and also enabled in 1.18 EKS cluster.

seamusabshere · 2021-01-13T23:01:15Z

[ @jaypipes means #3044 not #3004 above ]

jaypipes · 2021-01-14T14:19:25Z

[ @jaypipes means #3044 not #3004 above ]

doh, yep, sorry about that! fixed :)

ellistarn · 2021-01-22T01:29:56Z

I'm trying to reproduce. How big of a cluster are you trying out? I'm running 1.18.2 on EKS 1.18. 100 node cluster, 400 pods and it's sitting stable at 300mb of memory.

ellistarn · 2021-01-25T21:46:29Z

Update after deep diving this. @seamusabshere's OOM was due to listwatch caches filling up on startup due to a large number of Job objects in the API Server.

@timothyb89, is there any chance your cluster is suffering similar fate?

seamusabshere · 2021-01-26T03:02:45Z

# courtesy https://stackoverflow.com/a/61231027/310192
kubectl delete jobs --field-selector status.successful=1

😆

i thought i was safe because we were using ttlSecondsAfterFinished... but that's an alpha feature and per @ellistarn "[EKS runs] feature gates that are in Beta."

So, I had thousands of months-old jobs.

timothyb89 · 2021-01-26T18:57:12Z

@timothyb89, is there any chance your cluster is suffering similar fate?

Our largest cluster has 250 job objects at the moment, which I'd hope isn't nearly large enough to cause any trouble.

For what it's worth we been using --aws-use-static-instance-list=true since September and have not seen any unexpected restarts.

e-nalepa · 2021-02-15T13:07:31Z

I reproduce the issue, even if contrary to the initial ticket, the default limit it now not set at 250 Mi but at 300 Mi.

Sometimes, the cluster-autoscaler pod needs a lot more, like bellow (572 Mi):

NAME                                        CPU(cores)   MEMORY(bytes)
...
aws-node-m548b                              4m           40Mi
cluster-autoscaler-6478668dc5-j6gql         93m          572Mi
coredns-6d97dc4b59-dc72r                    3m           7Mi
...

Increasing this limit accordingly solves the issue on my side.

carlosjgp · 2021-03-16T17:53:45Z

Is there any chance we could grab this list from the local filesystem and in combination with using an initContainer or a static configmap we could contain the "controller" memory limits closer to the requests?

We have to configure requests: 96Mi and limits: 512Mi... I bet that list is going to keep growing and eventually crash the PODs 😓

fejta-bot · 2021-06-14T18:17:17Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

carlosjgp · 2021-06-16T19:03:38Z

/remove-lifecycle stale

CA has a known bug kubernetes/autoscaler#3506 The container consumes more memory than it is limited to. This fix will prevent issues with OOMKill errors with cluster-autoscaler container

jolestar · 2021-08-09T05:26:48Z

Fixed by increment the memory limit to 800Mi

mindw · 2021-08-09T11:14:33Z

Fixed by increment the memory limit to 800Mi

That option was mentioned and referenced above several times. #3506 (comment) Idea of having the list provided as input is IMHO one of the better options.

zerthimon · 2021-08-10T12:29:21Z

Having the same issue with 1.21.0, downgrading to 1.20.0 fixes the problem.
Edit: increasing memory limit of the pod also fixes the issue.
The process takes ~500MB resident memory. Does it really need so much memory to work ?

Zauxst · 2021-08-10T22:15:35Z

I am suffering from this problem as well on AWS EKS... It started with 1.19.1; I've upgraded to 1.20.0 and now I'm looking at upgrading the memory.
We have default configurations.

watkinsmike · 2021-08-18T19:23:53Z

This is also affecting me, although for some reason only on a single cluster, even using the exact same configuration and limits. Upping the mem limits to 500 has been the workaround I used successully

See https://github.com/argoflow/argoflow-aws/pull/178/files and kubernetes/autoscaler#3506

nakamasato · 2021-09-16T12:14:31Z

This issue seems to be mitigated by #4199, but for now, it seems the only way is to increase the memory requests and limits like #4207

fblgit · 2021-09-25T16:26:44Z

we also have this issue, even up to 1G.. stills dying... somewhere there is a leak and is related with CA & AWS.

@gjtempleton do u guys have a date for the #4199 milestone ? in which release we can expect this?

gjtempleton · 2021-09-27T08:01:18Z

Currently #4199 has made it into the default branch and been cherry-picked back to the 1.19, 1.20, 1.21 and 1.22 release branches, so will make it into the next patch releases of all those versions. #4251 is the issue to watch for the cutting of those releases.

kragniz · 2021-10-12T10:35:43Z

I've deployed v1.22.1 into a cluster which was previously seeing an oomkill with a memory limit of 300Mi. It's fixed the problem for us.

gjtempleton · 2021-10-12T10:59:09Z

Awesome, thanks for letting us know, all credit to @aidy for doing the hard work.

I'll leave this issue open for a bit longer to see if anyone's still seeing these issues with the new patch releases that include the streaming change, but if not will close it off in a week.

gjtempleton · 2021-10-20T13:58:34Z

Given the lack of any new reports of this issue, I'm going to close this as resolved for now, please let us know if you see any recurrence of this behaviour though.

/close

k8s-ci-robot · 2021-10-20T13:58:53Z

@gjtempleton: Closing this issue.

In response to this:

Given the lack of any new reports of this issue, I'm going to close this as resolved for now, please let us know if you see any recurrence of this behaviour though.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

zerthimon · 2021-10-27T20:53:27Z

v1.22.1 fixed the issue for me.

tarun4279 · 2021-11-26T09:14:38Z

Cluster Autoscaler internally downloads a JSON file as @hhamalai mentioned. In my case it was 114MB. By giving the cluster autoscaler pod some additional memory, the issue was fixed.

Checkout for the log fetching https://pricing.us-east-1.amazonaws.com/offers/v1.0/aws/AmazonEC2/current/ap-south-1/index.json

bytewhisperer11 · 2022-02-07T05:53:02Z

I've deployed v1.22.1 into a cluster which was previously seeing an oomkill with a memory limit of 300Mi. It's fixed the problem for us.

This worked for me thanks.

jawadqur mentioned this issue Mar 18, 2021

Fix OOM for cluster-autoscaler on startup uc-cdis/cloud-automation#1552

Merged

k8s-ci-robot added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 14, 2021

aidy mentioned this issue Jul 15, 2021

Optimise generate ec2 #4199

Merged

scottyhq mentioned this issue Jul 22, 2021

cluster autoscaler was broken on AWS pangeo-data/pangeo-cloud-federation#969

Open

georgio-sd mentioned this issue Jul 23, 2021

Fix memory limit awsdocs/amazon-eks-user-guide#413

Closed

rauerhans mentioned this issue Jul 28, 2021

fix OOM issue, increase mem requests + limits argoflow/argoflow-aws#178

Merged

jai added a commit to honestbank/argoflow-aws that referenced this issue Sep 13, 2021

fix(cluster-autoscaler): memory requests/limits

b4de645

See https://github.com/argoflow/argoflow-aws/pull/178/files and kubernetes/autoscaler#3506

jbartosik added the area/cluster-autoscaler label Sep 15, 2021

k8s-ci-robot closed this as completed Oct 20, 2021

aidy mentioned this issue Nov 10, 2021

Upgrade cluster autoscaler cookpad/terraform-aws-eks#269

Closed

jaypipes mentioned this issue Dec 2, 2021

CA - AWS CloudProvider - Fallback to Static EC2 list rather than fatal error #4480

Closed

dalvarezquiroga mentioned this issue Feb 21, 2022

Memory leak? #3044

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster Autoscaler on AWS is OOM killed on startup in GenerateEC2InstanceTypes #3506

Cluster Autoscaler on AWS is OOM killed on startup in GenerateEC2InstanceTypes #3506

timothyb89 commented Sep 11, 2020

hhamalai commented Oct 19, 2020

seamusabshere commented Dec 7, 2020

mcristina422 commented Dec 7, 2020

seamusabshere commented Dec 15, 2020

jaypipes commented Jan 13, 2021 •

edited

Loading

jqmichael commented Jan 13, 2021

seamusabshere commented Jan 13, 2021

jaypipes commented Jan 14, 2021

ellistarn commented Jan 22, 2021 •

edited

Loading

ellistarn commented Jan 25, 2021

seamusabshere commented Jan 26, 2021

timothyb89 commented Jan 26, 2021

e-nalepa commented Feb 15, 2021 •

edited

Loading

carlosjgp commented Mar 16, 2021

fejta-bot commented Jun 14, 2021

carlosjgp commented Jun 16, 2021

jolestar commented Aug 9, 2021

mindw commented Aug 9, 2021

zerthimon commented Aug 10, 2021 •

edited

Loading

Zauxst commented Aug 10, 2021 •

edited

Loading

watkinsmike commented Aug 18, 2021

nakamasato commented Sep 16, 2021

fblgit commented Sep 25, 2021

gjtempleton commented Sep 27, 2021

kragniz commented Oct 12, 2021

gjtempleton commented Oct 12, 2021

gjtempleton commented Oct 20, 2021

k8s-ci-robot commented Oct 20, 2021

zerthimon commented Oct 27, 2021

tarun4279 commented Nov 26, 2021

bytewhisperer11 commented Feb 7, 2022 •

edited

Loading

Cluster Autoscaler on AWS is OOM killed on startup in GenerateEC2InstanceTypes #3506

Cluster Autoscaler on AWS is OOM killed on startup in GenerateEC2InstanceTypes #3506

Comments

timothyb89 commented Sep 11, 2020

hhamalai commented Oct 19, 2020

seamusabshere commented Dec 7, 2020

mcristina422 commented Dec 7, 2020

seamusabshere commented Dec 15, 2020

jaypipes commented Jan 13, 2021 • edited Loading

jqmichael commented Jan 13, 2021

seamusabshere commented Jan 13, 2021

jaypipes commented Jan 14, 2021

ellistarn commented Jan 22, 2021 • edited Loading

ellistarn commented Jan 25, 2021

seamusabshere commented Jan 26, 2021

timothyb89 commented Jan 26, 2021

e-nalepa commented Feb 15, 2021 • edited Loading

carlosjgp commented Mar 16, 2021

fejta-bot commented Jun 14, 2021

carlosjgp commented Jun 16, 2021

jolestar commented Aug 9, 2021

mindw commented Aug 9, 2021

zerthimon commented Aug 10, 2021 • edited Loading

Zauxst commented Aug 10, 2021 • edited Loading

watkinsmike commented Aug 18, 2021

nakamasato commented Sep 16, 2021

fblgit commented Sep 25, 2021

gjtempleton commented Sep 27, 2021

kragniz commented Oct 12, 2021

gjtempleton commented Oct 12, 2021

gjtempleton commented Oct 20, 2021

k8s-ci-robot commented Oct 20, 2021

zerthimon commented Oct 27, 2021

tarun4279 commented Nov 26, 2021

bytewhisperer11 commented Feb 7, 2022 • edited Loading

jaypipes commented Jan 13, 2021 •

edited

Loading

ellistarn commented Jan 22, 2021 •

edited

Loading

e-nalepa commented Feb 15, 2021 •

edited

Loading

zerthimon commented Aug 10, 2021 •

edited

Loading

Zauxst commented Aug 10, 2021 •

edited

Loading

bytewhisperer11 commented Feb 7, 2022 •

edited

Loading