CA does not work properly while using AWS EC2 IMDSv2 only in EKS #3592

hans72118 · 2020-10-08T19:56:55Z

Recently AWS EKS supports EC2 Instance Metadata Service v2.

https://aws.amazon.com/about-aws/whats-new/2020/08/amazon-eks-supports-ec2-instance-metadata-service-v2/

In my testing environment, I create a worker node with IMDSv2 only and it requires to use token-backed sessions to access IMDS.

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/configuring-instance-metadata-service.html

However with this condition, CA seems cannot unmarshall it.

I1008 18:57:01.160950       1 aws_util.go:150] fetching http://169.254.169.254/latest/dynamic/instance-identity/document
..........
W1008 18:57:01.760556       1 aws_util.go:166] Error unmarshalling http://169.254.169.254/latest/dynamic/instance-identity/document, skip...

Check the CA pod, it keeps OOMed and results in CrashLoopBackOff.

# kubectl get pod -n kube-system
NAME                                  READY   STATUS             RESTARTS   AGE
cluster-autoscaler-5b5489859f-2pkdt   0/1     CrashLoopBackOff   6          13m

# kubectl describe pod cluster-autoscaler-5b5489859f-2pkdt -n kube-system
Name:           cluster-autoscaler-5b5489859f-2pkdt
Namespace:      kube-system
Priority:       0
Node:           ip-172-31-23-13.ap-northeast-1.compute.internal/172.31.23.13
Start Time:     Thu, 08 Oct 2020 19:22:15 +0000
Labels:         app=cluster-autoscaler
                pod-template-hash=5b5489859f
Annotations:    kubernetes.io/psp: eks.privileged
                prometheus.io/port: 8085
                prometheus.io/scrape: true
Status:         Running
IP:             172.31.20.73
IPs:            <none>
Controlled By:  ReplicaSet/cluster-autoscaler-5b5489859f
Containers:
  cluster-autoscaler:
    Container ID:  docker://8cea864df872af960650f9f01061ca52e62855f680306238f75a12cbc798f8a5
    Image:         k8s.gcr.io/autoscaling/cluster-autoscaler:v1.15.7
    Image ID:      docker-pullable://k8s.gcr.io/autoscaling/cluster-autoscaler@sha256:6641a69b4ea5f911ccbb11b75b2675261d90bf169f612c9e960f60036336d664
    Port:          <none>
    Host Port:     <none>
    Command:
      ./cluster-autoscaler
      --v=4
      --stderrthreshold=info
      --cloud-provider=aws
      --skip-nodes-with-local-storage=false
      --expander=least-waste
      --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/LAB-EKS-15
      --balance-similar-node-groups
      --skip-nodes-with-system-pods=false
    State:          Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Thu, 08 Oct 2020 19:32:27 +0000
      Finished:     Thu, 08 Oct 2020 19:33:06 +0000
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Thu, 08 Oct 2020 19:29:07 +0000
      Finished:     Thu, 08 Oct 2020 19:29:46 +0000

If uses IMDSv1 back, it works without issue as following:

I1008 19:05:20.256839       1 aws_util.go:150] fetching http://169.254.169.254/latest/dynamic/instance-identity/document
I1008 19:05:38.256216       1 aws_cloud_provider.go:380] Successfully load 354 EC2 Instance Types [u-9tb1 m5n.8xlarge z1d.12xlarge m5dn.12xlarge m5.12xlarge c5d.4xlarge c5d.xlarge r6g.2xlarge m4.4xlarge c5.24xlarge r3.8xlarge i3en.24xlarge i3.4xlarge a1.xlarge r5ad.large r5dn.metal x1e u-9tb1.metal m5dn.16xlarge r5n.4xlarge t3.small c5n.2xlarge m5ad.large t3.micro c5d.2xlarge c1.xlarge r5a.24xlarge t3.large r6g.metal r5a.xlarge c6g.xlarge i3en.metal g4dn.xlarge r6g.16xlarge c3.large i2.4xlarge r5d.xlarge t4g.small t3a.xlarge c3.8xlarge m5d.4xlarge r5ad.xlarge h1 c5d.18xlarge u-6tb1.metal p2.8xlarge m6g.2xlarge c5d.metal i3en.2xlarge 
........
I1008 19:05:44.609556       1 auto_scaling_groups.go:354] Regenerating instance to ASG map for ASGs: []
I1008 19:05:44.609579       1 aws_manager.go:266] Refreshed ASG list, next refresh after 2020-10-08 19:06:44.609574794 +0000 UTC m=+102.445561263
I1008 19:05:44.609801       1 main.go:271] Registered cleanup signal handler
I1008 19:05:44.610023       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I1008 19:05:44.610039       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 1.791µs
I1008 19:05:54.610015       1 static_autoscaler.go:187] Starting main loop
I1008 19:05:54.610119       1 utils.go:622] No pod using affinity / antiaffinity found in cluster, disabling affinity predicate for this loop
I1008 19:05:54.610130       1 filter_out_schedulable.go:63] Filtering out schedulables
I1008 19:05:54.610168       1 filter_out_schedulable.go:80] No schedulable pods
I1008 19:05:54.610188       1 static_autoscaler.go:334] No unschedulable pods
I1008 19:05:54.610203       1 static_autoscaler.go:381] Calculating unneeded nodes

I suspect CA does not use token-backed sessions to access IMDS.

https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/aws_util.go

The text was updated successfully, but these errors were encountered:

brenwhyte · 2020-10-13T23:21:15Z

Got hit with this too, EKS 1.17

MattiasPernhult · 2020-11-06T13:53:26Z

We worked around this issue by injecting the AWS_REGION environment variable to the cluster-autoscaler container. Obviously not an ideal solution, which would be to add support for it, but it works.

bryankaraffa · 2020-11-30T21:21:23Z

We worked around this issue by injecting the AWS_REGION environment variable to the cluster-autoscaler container. Obviously not an ideal solution, which would be to add support for it, but it works.

I was not able to workaround this issue by injecting AWS_REGION or AWS_DEFAULT_REGION into the aws-cluster-autoscaler container. With v1 metadata service [token optional] cluster-autoscaler does not error and has no issues.

Error log / behavior with IMDSv2 [token required]:

I1130 21:13:10.946968       1 aws_cloud_provider.go:371] Successfully load 392 EC2 Instance Types [...truncated...]
E1130 21:13:14.176281       1 aws_manager.go:262] Failed to regenerate ASG cache: cannot autodiscover ASGs: NoCredentialProviders: no valid providers in chain. Deprecated.
        For verbose messaging see aws.Config.CredentialsChainVerboseErrors
F1130 21:13:14.176302       1 aws_cloud_provider.go:376] Failed to create AWS Manager: cannot autodiscover ASGs: NoCredentialProviders: no valid providers in chain. Deprecated.
        For verbose messaging see aws.Config.CredentialsChainVerboseErrors

Here's our cluster-autoscaler helm release [chart v9.1.0 setting awsRegion and autoDiscovery.clusterName] as well as attempting to set the ENV variable:

resource "helm_release" "cluster_autoscaler" {
  depends_on = [
    module.eks, # Wait for cluster to be ready
  ]

  repository       = "https://kubernetes.github.io/autoscaler"
  chart            = "cluster-autoscaler"
  version          = "9.1.0"
  name             = "cluster-autoscaler"
  namespace        = "kube-system"

  values = [
    # Values set from terraform outputs
    <<EOL
awsRegion: ${module.eks.cluster_region}
autoDiscovery:
  clusterName: ${module.eks.cluster_name}
EOL
    ,
    # Workaround issue with IMDSv2
    # Inject AWS_DEFAULT_REGION into environment
    # https://github.com/kubernetes/autoscaler/issues/3592
    <<EOL
extraEnv:
  AWS_DEFAULT_REGION: ${module.eks.cluster_region}
EOL
    ,
  ] # End helm_release.values[]
}

and resulting pod description -- AWS_REGION is already set from the chart:

Name:         cluster-autoscaler-aws-cluster-autoscaler-c4b7bdd58-cm2d2
Namespace:    kube-system
Priority:     0
Node:         ip-10-100-1-57.us-west-2.compute.internal/10.100.1.57
Start Time:   Mon, 30 Nov 2020 13:06:38 -0800
Labels:       app.kubernetes.io/instance=cluster-autoscaler
              app.kubernetes.io/name=aws-cluster-autoscaler
              pod-template-hash=c4b7bdd58
Annotations:  kubernetes.io/psp: eks.privileged
Status:       Running
IP:           10.100.0.110
IPs:
  IP:           10.100.0.110
Controlled By:  ReplicaSet/cluster-autoscaler-aws-cluster-autoscaler-c4b7bdd58
Containers:
  aws-cluster-autoscaler:
    Container ID:  docker://f91c44b21712ebcf385dfd687c5631dd44ceeb76d25afb765e6b9a5cfc43f96c
    Image:         us.gcr.io/k8s-artifacts-prod/autoscaling/cluster-autoscaler:v1.18.1
    Image ID:      docker-pullable://us.gcr.io/k8s-artifacts-prod/autoscaling/cluster-autoscaler@sha256:1f5b11617389b8e4ce15eb45fdbbfd4321daeb63c234d46533449ab780b6ca9a
    Port:          8085/TCP
    Host Port:     0/TCP
    Command:
      ./cluster-autoscaler
      --cloud-provider=aws
      --namespace=kube-system
      --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/kg-cet-917-staging-us-west-2
      --logtostderr=true
      --stderrthreshold=info
      --v=4
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    255
      Started:      Mon, 30 Nov 2020 13:10:10 -0800
      Finished:     Mon, 30 Nov 2020 13:10:16 -0800
    Ready:          False
    Restart Count:  5
    Liveness:       http-get http://:8085/health-check delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      AWS_REGION:  us-west-2
      AWS_DEFAULT_REGION:  us-west-2
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from cluster-autoscaler-aws-cluster-autoscaler-token-dlxmc (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  cluster-autoscaler-aws-cluster-autoscaler-token-dlxmc:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  cluster-autoscaler-aws-cluster-autoscaler-token-dlxmc
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  4m43s                  default-scheduler  Successfully assigned kube-system/cluster-autoscaler-aws-cluster-autoscaler-c4b7bdd58-cm2d2 to ip-10-100-1-57.us-west-2.compute.internal
  Normal   Pulling    4m42s                  kubelet            Pulling image "us.gcr.io/k8s-artifacts-prod/autoscaling/cluster-autoscaler:v1.18.1"
  Normal   Pulled     4m40s                  kubelet            Successfully pulled image "us.gcr.io/k8s-artifacts-prod/autoscaling/cluster-autoscaler:v1.18.1"
  Warning  BackOff    2m52s (x9 over 4m10s)  kubelet            Back-off restarting failed container
  Normal   Created    2m38s (x5 over 4m40s)  kubelet            Created container aws-cluster-autoscaler
  Normal   Started    2m38s (x5 over 4m39s)  kubelet            Started container aws-cluster-autoscaler
  Normal   Pulled     2m38s (x4 over 4m16s)  kubelet            Container image "us.gcr.io/k8s-artifacts-prod/autoscaling/cluster-autoscaler:v1.18.1" already present on machine

kubectl version:

Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.4", GitCommit:"d360454c9bcd1634cf4cc52d1867af5491dc9c5f", GitTreeState:"clean", BuildDate:"2020-11-12T01:09:16Z", GoVersion:"go1.15.4", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"17+", GitVersion:"v1.17.12-eks-7684af", GitCommit:"7684af4ac41370dd109ac13817023cb8063e3d45", GitTreeState:"clean", BuildDate:"2020-10-20T22:57:40Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}

helm version:

version.BuildInfo{Version:"v3.4.1", GitCommit:"c4e74854886b2efe3321e185578e6db9be0a6e29", GitTreeState:"dirty", GoVersion:"go1.15.4"}

focaaby · 2021-01-08T07:33:36Z

I was not able to workaround this issue by injecting AWS_REGION or AWS_DEFAULT_REGION environment into the aws-cluster-autoscaler either.

Also, there are other issues #3276 #3216 related to the load the Instance Type list from pricing API. Thus, I upgraded to the latest version 1.20, and added --aws-use-static-instance-list=true flag. However, it still keeps Terminated with 255 exit code and results in CrashLoopBackOff status.

Here are error log message with IMDSv2 [token required]:

$ kubectl -n kube-system logs deployment.apps/cluster-autoscaler
...
...
I0108 07:20:04.590454       1 reflector.go:255] Listing and watching *v1.Node from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:246
I0108 07:20:04.944164       1 cloud_provider_builder.go:29] Building aws cloud provider.
W0108 07:20:04.944198       1 aws_cloud_provider.go:349] Use static EC2 Instance Types and list could be outdated. Last update time: 2019-10-14
I0108 07:20:04.945035       1 reflector.go:219] Starting reflector *v1.PersistentVolumeClaim (0s) from k8s.io/client-go/informers/factory.go:134
I0108 07:20:04.945051       1 reflector.go:255] Listing and watching *v1.PersistentVolumeClaim from k8s.io/client-go/informers/factory.go:134
I0108 07:20:04.945402       1 reflector.go:219] Starting reflector *v1.Pod (0s) from k8s.io/client-go/informers/factory.go:134
I0108 07:20:04.945415       1 reflector.go:255] Listing and watching *v1.Pod from k8s.io/client-go/informers/factory.go:134
I0108 07:20:04.945683       1 reflector.go:219] Starting reflector *v1.Node (0s) from k8s.io/client-go/informers/factory.go:134
I0108 07:20:04.945695       1 reflector.go:255] Listing and watching *v1.Node from k8s.io/client-go/informers/factory.go:134
I0108 07:20:04.945952       1 reflector.go:219] Starting reflector *v1.StatefulSet (0s) from k8s.io/client-go/informers/factory.go:134
I0108 07:20:04.945964       1 reflector.go:255] Listing and watching *v1.StatefulSet from k8s.io/client-go/informers/factory.go:134
I0108 07:20:04.946231       1 reflector.go:219] Starting reflector *v1.PersistentVolume (0s) from k8s.io/client-go/informers/factory.go:134
I0108 07:20:04.946242       1 reflector.go:255] Listing and watching *v1.PersistentVolume from k8s.io/client-go/informers/factory.go:134
I0108 07:20:04.946531       1 reflector.go:219] Starting reflector *v1.StorageClass (0s) from k8s.io/client-go/informers/factory.go:134
I0108 07:20:04.946542       1 reflector.go:255] Listing and watching *v1.StorageClass from k8s.io/client-go/informers/factory.go:134
I0108 07:20:04.946838       1 reflector.go:219] Starting reflector *v1.CSINode (0s) from k8s.io/client-go/informers/factory.go:134
I0108 07:20:04.946850       1 reflector.go:255] Listing and watching *v1.CSINode from k8s.io/client-go/informers/factory.go:134
I0108 07:20:05.039201       1 reflector.go:219] Starting reflector *v1beta1.PodDisruptionBudget (0s) from k8s.io/client-go/informers/factory.go:134
I0108 07:20:05.039225       1 reflector.go:255] Listing and watching *v1beta1.PodDisruptionBudget from k8s.io/client-go/informers/factory.go:134
I0108 07:20:05.539276       1 reflector.go:219] Starting reflector *v1.Service (0s) from k8s.io/client-go/informers/factory.go:134
I0108 07:20:05.539475       1 reflector.go:255] Listing and watching *v1.Service from k8s.io/client-go/informers/factory.go:134
I0108 07:20:05.543333       1 reflector.go:219] Starting reflector *v1.ReplicationController (0s) from k8s.io/client-go/informers/factory.go:134
I0108 07:20:05.543349       1 reflector.go:255] Listing and watching *v1.ReplicationController from k8s.io/client-go/informers/factory.go:134
I0108 07:20:05.543835       1 reflector.go:219] Starting reflector *v1.ReplicaSet (0s) from k8s.io/client-go/informers/factory.go:134
I0108 07:20:05.543850       1 reflector.go:255] Listing and watching *v1.ReplicaSet from k8s.io/client-go/informers/factory.go:134

$ kubectl get po -A -w | grep "cluster"
kube-system         cluster-autoscaler-bcbc77bc7-lcsf5              1/1     Running   0          2m7s
kube-system         cluster-autoscaler-bcbc77bc7-lcsf5              0/1     Error     0          2m21s
kube-system         cluster-autoscaler-bcbc77bc7-lcsf5              1/1     Running   1          2m23s

$ kubectl -n kube-system describe po cluster-autoscaler-bcbc77bc7-lcsf5
Name:         cluster-autoscaler-bcbc77bc7-lcsf5
Namespace:    kube-system
Priority:     0
Node:         ip-192-168-33-189.ap-northeast-1.compute.internal/192.168.33.189
Start Time:   Fri, 08 Jan 2021 07:19:44 +0000
Labels:       app=cluster-autoscaler
              pod-template-hash=bcbc77bc7
Annotations:  kubectl.kubernetes.io/restartedAt: 2021-01-08T05:40:22Z
              kubernetes.io/psp: eks.privileged
              prometheus.io/port: 8085
              prometheus.io/scrape: true
Status:       Running
IP:           192.168.43.50
IPs:
  IP:           192.168.43.50
Controlled By:  ReplicaSet/cluster-autoscaler-bcbc77bc7
Containers:
  cluster-autoscaler:
    Container ID:  docker://2f0a7f6f1f514c0c75c75499020e788886da125fe1c865cebd0647bb3bf95a64
    Image:         k8s.gcr.io/autoscaling/cluster-autoscaler:v1.20.0
    Image ID:      docker-pullable://k8s.gcr.io/autoscaling/cluster-autoscaler@sha256:1c19fa17b29db548d0304e9444adf84e8a6f38ee4c0a12d2ecaf262cb10c0e50
    Port:          <none>
    Host Port:     <none>
    Command:
      ./cluster-autoscaler
      --v=4
      --stderrthreshold=info
      --cloud-provider=aws
      --skip-nodes-with-local-storage=false
      --expander=least-waste
      --aws-use-static-instance-list=true
      --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/EKS-LAB
    State:          Running
      Started:      Fri, 08 Jan 2021 07:22:07 +0000
    Last State:     Terminated
      Reason:       Error
      Exit Code:    255
      Started:      Fri, 08 Jan 2021 07:19:46 +0000
      Finished:     Fri, 08 Jan 2021 07:22:05 +0000
    Ready:          True
    Restart Count:  1
    Limits:
      cpu:     100m
      memory:  300Mi
    Requests:
      cpu:     100m
      memory:  300Mi
    Environment:
      AWS_REGION:                   ap-northeast-1
      AWS_DEFAULT_REGION:           ap-northeast-1
      AWS_ROLE_ARN:                 arn:aws:iam::561333300361:role/eksctl-EKS-LAB-addon-iamserviceaccount-kube-Role1-ZKVBFVVOBNUX
      AWS_WEB_IDENTITY_TOKEN_FILE:  /var/run/secrets/eks.amazonaws.com/serviceaccount/token
    Mounts:
      /etc/ssl/certs/ca-certificates.crt from ssl-certs (ro)
      /var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from cluster-autoscaler-token-vkd8b (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  aws-iam-token:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  86400
  ssl-certs:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/ssl/certs/ca-bundle.crt
    HostPathType:
  cluster-autoscaler-token-vkd8b:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  cluster-autoscaler-token-vkd8b
    Optional:    false
QoS Class:       Guaranteed
Node-Selectors:  ng=console
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age                  From               Message
  ----    ------     ----                 ----               -------
  Normal  Scheduled  3m37s                default-scheduler  Successfully assigned kube-system/cluster-autoscaler-bcbc77bc7-lcsf5 to ip-192-168-33-189.ap-northeast-1.compute.internal
  Normal  Pulling    77s (x2 over 3m37s)  kubelet            Pulling image "k8s.gcr.io/autoscaling/cluster-autoscaler:v1.20.0"
  Normal  Pulled     76s (x2 over 3m36s)  kubelet            Successfully pulled image "k8s.gcr.io/autoscaling/cluster-autoscaler:v1.20.0"
  Normal  Created    76s (x2 over 3m36s)  kubelet            Created container cluster-autoscaler
  Normal  Started    75s (x2 over 3m36s)  kubelet            Started container cluster-autoscaler

Rollback to the worker node with IMDSv1.

$ kubectl -n kube-system logs deployment.apps/cluster-autoscaler
...
...
I0108 07:15:03.847604       1 reflector.go:255] Listing and watching *v1beta1.PodDisruptionBudget from k8s.io/client-go/informers/factory.go:134
I0108 07:15:03.847633       1 reflector.go:219] Starting reflector *v1.StatefulSet (0s) from k8s.io/client-go/informers/factory.go:134
I0108 07:15:03.847640       1 reflector.go:255] Listing and watching *v1.StatefulSet from k8s.io/client-go/informers/factory.go:134
I0108 07:15:03.943862       1 request.go:591] Throttling request took 96.568619ms, request: GET:https://10.100.0.1:443/api/v1/persistentvolumes?limit=500&resourceVersion=0
I0108 07:15:04.243872       1 request.go:591] Throttling request took 396.383321ms, request: GET:https://10.100.0.1:443/api/v1/pods?limit=500&resourceVersion=0
I0108 07:15:07.069368       1 auto_scaling_groups.go:351] Regenerating instance to ASG map for ASGs: [eks-96bb7009-0e0a-3450-075d-3c7ed43c94e6]
I0108 07:15:07.180416       1 auto_scaling.go:199] 1 launch configurations already in cache
I0108 07:15:07.180443       1 auto_scaling_groups.go:136] Registering ASG eks-96bb7009-0e0a-3450-075d-3c7ed43c94e6
I0108 07:15:07.180456       1 aws_manager.go:269] Refreshed ASG list, next refresh after 2021-01-08 07:16:07.180451669 +0000 UTC m=+81.757019680
I0108 07:15:07.180599       1 main.go:279] Registered cleanup signal handler
I0108 07:15:07.180643       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0108 07:15:07.180654       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 4.43µs
I0108 07:15:17.180736       1 static_autoscaler.go:229] Starting main loop
W0108 07:15:17.181232       1 clusterstate.go:436] AcceptableRanges have not been populated yet. Skip checking
I0108 07:15:17.181367       1 filter_out_schedulable.go:65] Filtering out schedulables
I0108 07:15:17.181381       1 filter_out_schedulable.go:132] Filtered out 0 pods using hints
I0108 07:15:17.181390       1 filter_out_schedulable.go:170] 0 pods were kept as unschedulable based on caching
I0108 07:15:17.181397       1 filter_out_schedulable.go:171] 0 pods marked as unschedulable can be scheduled.
I0108 07:15:17.181464       1 filter_out_schedulable.go:82] No schedulable pods
I0108 07:15:17.181490       1 static_autoscaler.go:402] No unschedulable pods
I0108 07:15:17.181509       1 static_autoscaler.go:449] Calculating unneeded nodes

gagvirk · 2021-01-19T16:55:42Z

Hi Contributors @mwielgus @losipiuk @aleksandra-malinowska @bskiba. As this is causing eks cluster not be upgraded to IMDSv2 support, Can this issue be prioritized, I suspect CA does not use token-backed sessions to access IMDS. CA pod, it keeps OOMed and results in CrashLoopBackOff. Thank you.

ellistarn · 2021-03-22T22:25:22Z

It appears there are multiple symptoms here.

OOMKill
CrashLoop NoCredentialProviders: no valid providers in chain.

My guess is that (1) is a spurious error, it's difficult to tell. @hans72118, can you follow up with your memory settings? I'll take a look at how IMDSv2 works and what the path forward is here to make sure CAS can use these tokens.

ellistarn · 2021-03-22T22:31:03Z

It looks like there's some custom imds logic here https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/aws_util.go#L77. It's not clear why we don't rely on https://docs.aws.amazon.com/sdk-for-go/api/aws/ec2metadata/#EC2Metadata.GetMetadata

ellistarn · 2021-03-22T22:35:07Z

It should be possible to skip this logic by using --aws-use-static-instance-list=true

autoscaler/cluster-autoscaler/main.go

Line 175 in 43ab030

    
           awsUseStaticInstanceList           = flag.Bool("aws-use-static-instance-list", false, "Should CA fetch instance types in runtime or use a static list. AWS only")

Alternatively, it should be possible to skip by including the AWS_REGION environment variable:
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/aws_util.go#L155

@focaaby, it's not clear from your logs or describe pods that this wasn't working for you. Looks like the CA started up normally and populated all listers/watchers?

fejta-bot · 2021-06-21T00:03:23Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

fejta-bot · 2021-07-21T00:36:45Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

ivankovnatsky · 2022-01-17T13:44:03Z

We still experiencing this error:

F0117 13:40:50.048753       1 aws_cloud_provider.go:369] Failed to generate AWS EC2 Instance Types: NoCredentialProviders: no valid providers in chain. Deprecated.
        For verbose messaging see aws.Config.CredentialsChainVerboseErrors
goroutine 59 [running]:
k8s.io/klog/v2.stacks(0x1)
        /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2/klog.go:1038 +0x8a
k8s.io/klog/v2.(*loggingT).output(0x611e4e0, 0x3, 0x0, 0xc0001e45b0, 0x0, {0x4d2e584, 0x1}, 0xc000ddd9a0, 0x0)
        /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2/klog.go:987 +0x5fd

eks: v1.19.13-eks-8df270
ca: 1.21.1 (also tried 1.22.2 and 1.23.0)

│     Environment:                                                                                                                                                                                                                             │
│       AWS_REGION:          eu-central-1                                                                                                                                                                                                      │
│       AWS_DEFAULT_REGION:  eu-central-1

Update: using this terraform snippet in module: https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/modules/self-managed-node-group/main.tf#L186 works fine:

      # Enforce only IMDSv2
      metadata_http_tokens = "required"
      metadata_http_put_response_hop_limit = 2

tudormi · 2022-05-30T10:22:44Z

We found this issue after setting HttpTokens to required on the EC2 instance for the k8s node.
Suddenly cluster-autoscaler was not running on new nodes with the HttpTokens metadata but it was running on the old ones without that metadata.

We found this note here:

If you want to run it on instances with IMDSv1 disabled make sure your EC2 launch configuration has the setting Metadata response hop limit set to 2. Otherwise, the /latest/api/token call will timeout and result in an error. See AWS docs here for further information.

So we updated the "HttpPutResponseHopLimit" to 2 and it is working now.
Useful link to do this through aws cli: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/configuring-instance-metadata-options.html

bitva77 · 2022-09-22T19:49:33Z

We just ran into this option after disabling IMDSv1 but we've had to set HttpPutResponseHopLimit to 3 for whatever reason.

dgard1981 · 2023-01-18T16:48:24Z

This doesn't seem to work with HttpPutResponseHopLimit set to 2 or 3. I even set it to 64 (the max), just to be sure.

F0118 16:40:34.650058       1 aws_cloud_provider.go:369] Failed to generate AWS EC2 Instance Types: NoCredentialProviders: no valid providers in chain. Deprecated.
	For verbose messaging see aws.Config.CredentialsChainVerboseErrors

shreyas-srinivas mentioned this issue Jun 8, 2021

Fix: CA does not work properly while using AWS EC2 IMDSv2 #4123

Closed

This was referenced Jun 9, 2021

Fix: CA does not work properly while using AWS EC2 IMDSv2 #4126

Closed

Fix: CA does not work properly while using AWS EC2 IMDSv2 #4127

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 21, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 21, 2021

k8s-ci-robot closed this as completed in #4127 Jul 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CA does not work properly while using AWS EC2 IMDSv2 only in EKS #3592

CA does not work properly while using AWS EC2 IMDSv2 only in EKS #3592

hans72118 commented Oct 8, 2020

brenwhyte commented Oct 13, 2020

MattiasPernhult commented Nov 6, 2020 •

edited

Loading

bryankaraffa commented Nov 30, 2020 •

edited

Loading

focaaby commented Jan 8, 2021

gagvirk commented Jan 19, 2021

ellistarn commented Mar 22, 2021

ellistarn commented Mar 22, 2021

ellistarn commented Mar 22, 2021 •

edited

Loading

fejta-bot commented Jun 21, 2021

fejta-bot commented Jul 21, 2021

ivankovnatsky commented Jan 17, 2022 •

edited

Loading

tudormi commented May 30, 2022

bitva77 commented Sep 22, 2022

dgard1981 commented Jan 18, 2023

CA does not work properly while using AWS EC2 IMDSv2 only in EKS #3592

CA does not work properly while using AWS EC2 IMDSv2 only in EKS #3592

Comments

hans72118 commented Oct 8, 2020

brenwhyte commented Oct 13, 2020

MattiasPernhult commented Nov 6, 2020 • edited Loading

bryankaraffa commented Nov 30, 2020 • edited Loading

focaaby commented Jan 8, 2021

gagvirk commented Jan 19, 2021

ellistarn commented Mar 22, 2021

ellistarn commented Mar 22, 2021

ellistarn commented Mar 22, 2021 • edited Loading

fejta-bot commented Jun 21, 2021

fejta-bot commented Jul 21, 2021

ivankovnatsky commented Jan 17, 2022 • edited Loading

tudormi commented May 30, 2022

bitva77 commented Sep 22, 2022

dgard1981 commented Jan 18, 2023

MattiasPernhult commented Nov 6, 2020 •

edited

Loading

bryankaraffa commented Nov 30, 2020 •

edited

Loading

ellistarn commented Mar 22, 2021 •

edited

Loading

ivankovnatsky commented Jan 17, 2022 •

edited

Loading