Skip to content

CPUThrottlingHigh on metrics server (Prometheus alert)

Natan Yellin edited this page Dec 28, 2021 · 10 revisions

Disambiguation

This is a special case of the general CPUThrottlingHigh alert on the metrics-server deployment. When the alert occurs elsewhere, see the general case.

Alert explanation

The default CPU limits for metrics-server are too low resulting in CPU starvation. When possible, this should be fixed so your cluster runs more smoothly. There are two important caveats for fixing this:

  1. metrics-server dynamically updates its CPU limits using Kubernetes addon-resizer so you cannot update the CPU limits in the normal way. See instructions below for how to correctly update the limits.
  2. You cannot fix this issue on GKE. Any changes you make to the metrics-server deployment on GKE are reverted by GCP.

Recommended Remediation (non-GCP clusters)

metrics-server does not respect normal CPU limits. To fix this issue, edit the metrics-server Deployment and increase the --cpu parameter for the metrics-server-nanny container. See line in bold below. A good value for most clusters is 100m.


apiVersion: apps/v1
kind: Deployment
metadata:
  name: metrics-server-v0.3.6
  namespace: kube-system
spec:
  template:
    spec:
      containers:
        ...
        - command:
            - /pod_nanny
            - '--config-dir=/etc/config'
            - '--cpu=40m'
            - '--extra-cpu=0.5m'
            - '--memory=35Mi'
            - '--extra-memory=4Mi'
            - '--threshold=5'
            - '--deployment=metrics-server-v0.3.6'
            - '--container=metrics-server'
            - '--poll-period=300000'
            - '--estimator=exponential'
            - '--scale-down-delay=24h'
            - '--minClusterSize=5'
            - '--use-metrics=true'
          image: 'gke.gcr.io/addon-resizer:1.8.11-gke.0'
          name: metrics-server-nanny

Recommended Remediation for GKE clusters

This issue can be fixed on GKE only by updating your GKE clusters to a more recent Kubernetes version. The real-world impact of this issue is often neglible. To ignore this issue you can change the CPUThrottlingHigh alert in Prometheus rules to exclude metrics-server.

If you use Robusta, no configuration is necessary. This alert is automatically silenced for GKE metrics-server pods.

Clone this wiki locally