cluster autoscaler was broken on AWS #969

scottyhq · 2021-07-22T23:24:01Z

tried to log into https://aws-uswest2.pangeo.io today but was not getting a server. turns out our core node (running on spot) was shut down and when the cluster-autoscaler pod was recreated it was crashing due to kubernetes/autoscaler#3506.

just wanted to document here another case of things running on k8s will eventually break due to version incompatibilities or other things. The fix was pretty easy, and configuration files are here #968.

Just had to give the autoscaler pod more memory, so we went from:

        - image: us.gcr.io/k8s-artifacts-prod/autoscaling/cluster-autoscaler:v1.16.5
           name: cluster-autoscaler
           resources:
             limits:
               cpu: 100m
               memory: 300Mi
             requests:
               cpu: 100m
               memory: 300Mi

to:

        - image: us.gcr.io/k8s-artifacts-prod/autoscaling/cluster-autoscaler:v1.16.7
           name: cluster-autoscaler
           resources:
             limits:
               cpu: 100m
               memory: 600Mi
             requests:
               cpu: 100m
               memory: 600Mi

The text was updated successfully, but these errors were encountered:

scottyhq mentioned this issue Jul 23, 2021

bump autoscaler resource to avoid OOMKilled pangeo-data/pangeo-binder#190

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cluster autoscaler was broken on AWS #969

cluster autoscaler was broken on AWS #969

scottyhq commented Jul 22, 2021

cluster autoscaler was broken on AWS #969

cluster autoscaler was broken on AWS #969

Comments

scottyhq commented Jul 22, 2021