Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CA does not work properly while using AWS EC2 IMDSv2 only in EKS #3592

Closed
hans72118 opened this issue Oct 8, 2020 · 14 comments · Fixed by #4127
Closed

CA does not work properly while using AWS EC2 IMDSv2 only in EKS #3592

hans72118 opened this issue Oct 8, 2020 · 14 comments · Fixed by #4127
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@hans72118
Copy link

Recently AWS EKS supports EC2 Instance Metadata Service v2.

In my testing environment, I create a worker node with IMDSv2 only and it requires to use token-backed sessions to access IMDS.

However with this condition, CA seems cannot unmarshall it.

I1008 18:57:01.160950       1 aws_util.go:150] fetching http://169.254.169.254/latest/dynamic/instance-identity/document
..........
W1008 18:57:01.760556       1 aws_util.go:166] Error unmarshalling http://169.254.169.254/latest/dynamic/instance-identity/document, skip...

Check the CA pod, it keeps OOMed and results in CrashLoopBackOff.

# kubectl get pod -n kube-system
NAME                                  READY   STATUS             RESTARTS   AGE
cluster-autoscaler-5b5489859f-2pkdt   0/1     CrashLoopBackOff   6          13m
# kubectl describe pod cluster-autoscaler-5b5489859f-2pkdt -n kube-system
Name:           cluster-autoscaler-5b5489859f-2pkdt
Namespace:      kube-system
Priority:       0
Node:           ip-172-31-23-13.ap-northeast-1.compute.internal/172.31.23.13
Start Time:     Thu, 08 Oct 2020 19:22:15 +0000
Labels:         app=cluster-autoscaler
                pod-template-hash=5b5489859f
Annotations:    kubernetes.io/psp: eks.privileged
                prometheus.io/port: 8085
                prometheus.io/scrape: true
Status:         Running
IP:             172.31.20.73
IPs:            <none>
Controlled By:  ReplicaSet/cluster-autoscaler-5b5489859f
Containers:
  cluster-autoscaler:
    Container ID:  docker://8cea864df872af960650f9f01061ca52e62855f680306238f75a12cbc798f8a5
    Image:         k8s.gcr.io/autoscaling/cluster-autoscaler:v1.15.7
    Image ID:      docker-pullable://k8s.gcr.io/autoscaling/cluster-autoscaler@sha256:6641a69b4ea5f911ccbb11b75b2675261d90bf169f612c9e960f60036336d664
    Port:          <none>
    Host Port:     <none>
    Command:
      ./cluster-autoscaler
      --v=4
      --stderrthreshold=info
      --cloud-provider=aws
      --skip-nodes-with-local-storage=false
      --expander=least-waste
      --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/LAB-EKS-15
      --balance-similar-node-groups
      --skip-nodes-with-system-pods=false
    State:          Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Thu, 08 Oct 2020 19:32:27 +0000
      Finished:     Thu, 08 Oct 2020 19:33:06 +0000
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Thu, 08 Oct 2020 19:29:07 +0000
      Finished:     Thu, 08 Oct 2020 19:29:46 +0000

If uses IMDSv1 back, it works without issue as following:

I1008 19:05:20.256839       1 aws_util.go:150] fetching http://169.254.169.254/latest/dynamic/instance-identity/document
I1008 19:05:38.256216       1 aws_cloud_provider.go:380] Successfully load 354 EC2 Instance Types [u-9tb1 m5n.8xlarge z1d.12xlarge m5dn.12xlarge m5.12xlarge c5d.4xlarge c5d.xlarge r6g.2xlarge m4.4xlarge c5.24xlarge r3.8xlarge i3en.24xlarge i3.4xlarge a1.xlarge r5ad.large r5dn.metal x1e u-9tb1.metal m5dn.16xlarge r5n.4xlarge t3.small c5n.2xlarge m5ad.large t3.micro c5d.2xlarge c1.xlarge r5a.24xlarge t3.large r6g.metal r5a.xlarge c6g.xlarge i3en.metal g4dn.xlarge r6g.16xlarge c3.large i2.4xlarge r5d.xlarge t4g.small t3a.xlarge c3.8xlarge m5d.4xlarge r5ad.xlarge h1 c5d.18xlarge u-6tb1.metal p2.8xlarge m6g.2xlarge c5d.metal i3en.2xlarge 
........
I1008 19:05:44.609556       1 auto_scaling_groups.go:354] Regenerating instance to ASG map for ASGs: []
I1008 19:05:44.609579       1 aws_manager.go:266] Refreshed ASG list, next refresh after 2020-10-08 19:06:44.609574794 +0000 UTC m=+102.445561263
I1008 19:05:44.609801       1 main.go:271] Registered cleanup signal handler
I1008 19:05:44.610023       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I1008 19:05:44.610039       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 1.791µs
I1008 19:05:54.610015       1 static_autoscaler.go:187] Starting main loop
I1008 19:05:54.610119       1 utils.go:622] No pod using affinity / antiaffinity found in cluster, disabling affinity predicate for this loop
I1008 19:05:54.610130       1 filter_out_schedulable.go:63] Filtering out schedulables
I1008 19:05:54.610168       1 filter_out_schedulable.go:80] No schedulable pods
I1008 19:05:54.610188       1 static_autoscaler.go:334] No unschedulable pods
I1008 19:05:54.610203       1 static_autoscaler.go:381] Calculating unneeded nodes

I suspect CA does not use token-backed sessions to access IMDS.

@brenwhyte
Copy link

Got hit with this too, EKS 1.17

@MattiasPernhult
Copy link

MattiasPernhult commented Nov 6, 2020

We worked around this issue by injecting the AWS_REGION environment variable to the cluster-autoscaler container. Obviously not an ideal solution, which would be to add support for it, but it works.

@bryankaraffa
Copy link

bryankaraffa commented Nov 30, 2020

We worked around this issue by injecting the AWS_REGION environment variable to the cluster-autoscaler container. Obviously not an ideal solution, which would be to add support for it, but it works.

I was not able to workaround this issue by injecting AWS_REGION or AWS_DEFAULT_REGION into the aws-cluster-autoscaler container. With v1 metadata service [token optional] cluster-autoscaler does not error and has no issues.

Error log / behavior with IMDSv2 [token required]:

I1130 21:13:10.946968       1 aws_cloud_provider.go:371] Successfully load 392 EC2 Instance Types [...truncated...]
E1130 21:13:14.176281       1 aws_manager.go:262] Failed to regenerate ASG cache: cannot autodiscover ASGs: NoCredentialProviders: no valid providers in chain. Deprecated.
        For verbose messaging see aws.Config.CredentialsChainVerboseErrors
F1130 21:13:14.176302       1 aws_cloud_provider.go:376] Failed to create AWS Manager: cannot autodiscover ASGs: NoCredentialProviders: no valid providers in chain. Deprecated.
        For verbose messaging see aws.Config.CredentialsChainVerboseErrors

Here's our cluster-autoscaler helm release [chart v9.1.0 setting awsRegion and autoDiscovery.clusterName] as well as attempting to set the ENV variable:

resource "helm_release" "cluster_autoscaler" {
  depends_on = [
    module.eks, # Wait for cluster to be ready
  ]

  repository       = "https://kubernetes.github.io/autoscaler"
  chart            = "cluster-autoscaler"
  version          = "9.1.0"
  name             = "cluster-autoscaler"
  namespace        = "kube-system"

  values = [
    # Values set from terraform outputs
    <<EOL
awsRegion: ${module.eks.cluster_region}
autoDiscovery:
  clusterName: ${module.eks.cluster_name}
EOL
    ,
    # Workaround issue with IMDSv2
    # Inject AWS_DEFAULT_REGION into environment
    # https://github.com/kubernetes/autoscaler/issues/3592
    <<EOL
extraEnv:
  AWS_DEFAULT_REGION: ${module.eks.cluster_region}
EOL
    ,
  ] # End helm_release.values[]
}

and resulting pod description -- AWS_REGION is already set from the chart:

Name:         cluster-autoscaler-aws-cluster-autoscaler-c4b7bdd58-cm2d2
Namespace:    kube-system
Priority:     0
Node:         ip-10-100-1-57.us-west-2.compute.internal/10.100.1.57
Start Time:   Mon, 30 Nov 2020 13:06:38 -0800
Labels:       app.kubernetes.io/instance=cluster-autoscaler
              app.kubernetes.io/name=aws-cluster-autoscaler
              pod-template-hash=c4b7bdd58
Annotations:  kubernetes.io/psp: eks.privileged
Status:       Running
IP:           10.100.0.110
IPs:
  IP:           10.100.0.110
Controlled By:  ReplicaSet/cluster-autoscaler-aws-cluster-autoscaler-c4b7bdd58
Containers:
  aws-cluster-autoscaler:
    Container ID:  docker://f91c44b21712ebcf385dfd687c5631dd44ceeb76d25afb765e6b9a5cfc43f96c
    Image:         us.gcr.io/k8s-artifacts-prod/autoscaling/cluster-autoscaler:v1.18.1
    Image ID:      docker-pullable://us.gcr.io/k8s-artifacts-prod/autoscaling/cluster-autoscaler@sha256:1f5b11617389b8e4ce15eb45fdbbfd4321daeb63c234d46533449ab780b6ca9a
    Port:          8085/TCP
    Host Port:     0/TCP
    Command:
      ./cluster-autoscaler
      --cloud-provider=aws
      --namespace=kube-system
      --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/kg-cet-917-staging-us-west-2
      --logtostderr=true
      --stderrthreshold=info
      --v=4
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    255
      Started:      Mon, 30 Nov 2020 13:10:10 -0800
      Finished:     Mon, 30 Nov 2020 13:10:16 -0800
    Ready:          False
    Restart Count:  5
    Liveness:       http-get http://:8085/health-check delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      AWS_REGION:  us-west-2
      AWS_DEFAULT_REGION:  us-west-2
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from cluster-autoscaler-aws-cluster-autoscaler-token-dlxmc (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  cluster-autoscaler-aws-cluster-autoscaler-token-dlxmc:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  cluster-autoscaler-aws-cluster-autoscaler-token-dlxmc
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  4m43s                  default-scheduler  Successfully assigned kube-system/cluster-autoscaler-aws-cluster-autoscaler-c4b7bdd58-cm2d2 to ip-10-100-1-57.us-west-2.compute.internal
  Normal   Pulling    4m42s                  kubelet            Pulling image "us.gcr.io/k8s-artifacts-prod/autoscaling/cluster-autoscaler:v1.18.1"
  Normal   Pulled     4m40s                  kubelet            Successfully pulled image "us.gcr.io/k8s-artifacts-prod/autoscaling/cluster-autoscaler:v1.18.1"
  Warning  BackOff    2m52s (x9 over 4m10s)  kubelet            Back-off restarting failed container
  Normal   Created    2m38s (x5 over 4m40s)  kubelet            Created container aws-cluster-autoscaler
  Normal   Started    2m38s (x5 over 4m39s)  kubelet            Started container aws-cluster-autoscaler
  Normal   Pulled     2m38s (x4 over 4m16s)  kubelet            Container image "us.gcr.io/k8s-artifacts-prod/autoscaling/cluster-autoscaler:v1.18.1" already present on machine

kubectl version:

Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.4", GitCommit:"d360454c9bcd1634cf4cc52d1867af5491dc9c5f", GitTreeState:"clean", BuildDate:"2020-11-12T01:09:16Z", GoVersion:"go1.15.4", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"17+", GitVersion:"v1.17.12-eks-7684af", GitCommit:"7684af4ac41370dd109ac13817023cb8063e3d45", GitTreeState:"clean", BuildDate:"2020-10-20T22:57:40Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}

helm version:

version.BuildInfo{Version:"v3.4.1", GitCommit:"c4e74854886b2efe3321e185578e6db9be0a6e29", GitTreeState:"dirty", GoVersion:"go1.15.4"}

@focaaby
Copy link

focaaby commented Jan 8, 2021

I was not able to workaround this issue by injecting AWS_REGION or AWS_DEFAULT_REGION environment into the aws-cluster-autoscaler either.

Also, there are other issues #3276 #3216 related to the load the Instance Type list from pricing API. Thus, I upgraded to the latest version 1.20, and added --aws-use-static-instance-list=true flag. However, it still keeps Terminated with 255 exit code and results in CrashLoopBackOff status.

Here are error log message with IMDSv2 [token required]:

$ kubectl -n kube-system logs deployment.apps/cluster-autoscaler
...
...
I0108 07:20:04.590454       1 reflector.go:255] Listing and watching *v1.Node from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:246
I0108 07:20:04.944164       1 cloud_provider_builder.go:29] Building aws cloud provider.
W0108 07:20:04.944198       1 aws_cloud_provider.go:349] Use static EC2 Instance Types and list could be outdated. Last update time: 2019-10-14
I0108 07:20:04.945035       1 reflector.go:219] Starting reflector *v1.PersistentVolumeClaim (0s) from k8s.io/client-go/informers/factory.go:134
I0108 07:20:04.945051       1 reflector.go:255] Listing and watching *v1.PersistentVolumeClaim from k8s.io/client-go/informers/factory.go:134
I0108 07:20:04.945402       1 reflector.go:219] Starting reflector *v1.Pod (0s) from k8s.io/client-go/informers/factory.go:134
I0108 07:20:04.945415       1 reflector.go:255] Listing and watching *v1.Pod from k8s.io/client-go/informers/factory.go:134
I0108 07:20:04.945683       1 reflector.go:219] Starting reflector *v1.Node (0s) from k8s.io/client-go/informers/factory.go:134
I0108 07:20:04.945695       1 reflector.go:255] Listing and watching *v1.Node from k8s.io/client-go/informers/factory.go:134
I0108 07:20:04.945952       1 reflector.go:219] Starting reflector *v1.StatefulSet (0s) from k8s.io/client-go/informers/factory.go:134
I0108 07:20:04.945964       1 reflector.go:255] Listing and watching *v1.StatefulSet from k8s.io/client-go/informers/factory.go:134
I0108 07:20:04.946231       1 reflector.go:219] Starting reflector *v1.PersistentVolume (0s) from k8s.io/client-go/informers/factory.go:134
I0108 07:20:04.946242       1 reflector.go:255] Listing and watching *v1.PersistentVolume from k8s.io/client-go/informers/factory.go:134
I0108 07:20:04.946531       1 reflector.go:219] Starting reflector *v1.StorageClass (0s) from k8s.io/client-go/informers/factory.go:134
I0108 07:20:04.946542       1 reflector.go:255] Listing and watching *v1.StorageClass from k8s.io/client-go/informers/factory.go:134
I0108 07:20:04.946838       1 reflector.go:219] Starting reflector *v1.CSINode (0s) from k8s.io/client-go/informers/factory.go:134
I0108 07:20:04.946850       1 reflector.go:255] Listing and watching *v1.CSINode from k8s.io/client-go/informers/factory.go:134
I0108 07:20:05.039201       1 reflector.go:219] Starting reflector *v1beta1.PodDisruptionBudget (0s) from k8s.io/client-go/informers/factory.go:134
I0108 07:20:05.039225       1 reflector.go:255] Listing and watching *v1beta1.PodDisruptionBudget from k8s.io/client-go/informers/factory.go:134
I0108 07:20:05.539276       1 reflector.go:219] Starting reflector *v1.Service (0s) from k8s.io/client-go/informers/factory.go:134
I0108 07:20:05.539475       1 reflector.go:255] Listing and watching *v1.Service from k8s.io/client-go/informers/factory.go:134
I0108 07:20:05.543333       1 reflector.go:219] Starting reflector *v1.ReplicationController (0s) from k8s.io/client-go/informers/factory.go:134
I0108 07:20:05.543349       1 reflector.go:255] Listing and watching *v1.ReplicationController from k8s.io/client-go/informers/factory.go:134
I0108 07:20:05.543835       1 reflector.go:219] Starting reflector *v1.ReplicaSet (0s) from k8s.io/client-go/informers/factory.go:134
I0108 07:20:05.543850       1 reflector.go:255] Listing and watching *v1.ReplicaSet from k8s.io/client-go/informers/factory.go:134
$ kubectl get po -A -w | grep "cluster"
kube-system         cluster-autoscaler-bcbc77bc7-lcsf5              1/1     Running   0          2m7s
kube-system         cluster-autoscaler-bcbc77bc7-lcsf5              0/1     Error     0          2m21s
kube-system         cluster-autoscaler-bcbc77bc7-lcsf5              1/1     Running   1          2m23s
$ kubectl -n kube-system describe po cluster-autoscaler-bcbc77bc7-lcsf5
Name:         cluster-autoscaler-bcbc77bc7-lcsf5
Namespace:    kube-system
Priority:     0
Node:         ip-192-168-33-189.ap-northeast-1.compute.internal/192.168.33.189
Start Time:   Fri, 08 Jan 2021 07:19:44 +0000
Labels:       app=cluster-autoscaler
              pod-template-hash=bcbc77bc7
Annotations:  kubectl.kubernetes.io/restartedAt: 2021-01-08T05:40:22Z
              kubernetes.io/psp: eks.privileged
              prometheus.io/port: 8085
              prometheus.io/scrape: true
Status:       Running
IP:           192.168.43.50
IPs:
  IP:           192.168.43.50
Controlled By:  ReplicaSet/cluster-autoscaler-bcbc77bc7
Containers:
  cluster-autoscaler:
    Container ID:  docker://2f0a7f6f1f514c0c75c75499020e788886da125fe1c865cebd0647bb3bf95a64
    Image:         k8s.gcr.io/autoscaling/cluster-autoscaler:v1.20.0
    Image ID:      docker-pullable://k8s.gcr.io/autoscaling/cluster-autoscaler@sha256:1c19fa17b29db548d0304e9444adf84e8a6f38ee4c0a12d2ecaf262cb10c0e50
    Port:          <none>
    Host Port:     <none>
    Command:
      ./cluster-autoscaler
      --v=4
      --stderrthreshold=info
      --cloud-provider=aws
      --skip-nodes-with-local-storage=false
      --expander=least-waste
      --aws-use-static-instance-list=true
      --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/EKS-LAB
    State:          Running
      Started:      Fri, 08 Jan 2021 07:22:07 +0000
    Last State:     Terminated
      Reason:       Error
      Exit Code:    255
      Started:      Fri, 08 Jan 2021 07:19:46 +0000
      Finished:     Fri, 08 Jan 2021 07:22:05 +0000
    Ready:          True
    Restart Count:  1
    Limits:
      cpu:     100m
      memory:  300Mi
    Requests:
      cpu:     100m
      memory:  300Mi
    Environment:
      AWS_REGION:                   ap-northeast-1
      AWS_DEFAULT_REGION:           ap-northeast-1
      AWS_ROLE_ARN:                 arn:aws:iam::561333300361:role/eksctl-EKS-LAB-addon-iamserviceaccount-kube-Role1-ZKVBFVVOBNUX
      AWS_WEB_IDENTITY_TOKEN_FILE:  /var/run/secrets/eks.amazonaws.com/serviceaccount/token
    Mounts:
      /etc/ssl/certs/ca-certificates.crt from ssl-certs (ro)
      /var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from cluster-autoscaler-token-vkd8b (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  aws-iam-token:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  86400
  ssl-certs:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/ssl/certs/ca-bundle.crt
    HostPathType:
  cluster-autoscaler-token-vkd8b:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  cluster-autoscaler-token-vkd8b
    Optional:    false
QoS Class:       Guaranteed
Node-Selectors:  ng=console
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age                  From               Message
  ----    ------     ----                 ----               -------
  Normal  Scheduled  3m37s                default-scheduler  Successfully assigned kube-system/cluster-autoscaler-bcbc77bc7-lcsf5 to ip-192-168-33-189.ap-northeast-1.compute.internal
  Normal  Pulling    77s (x2 over 3m37s)  kubelet            Pulling image "k8s.gcr.io/autoscaling/cluster-autoscaler:v1.20.0"
  Normal  Pulled     76s (x2 over 3m36s)  kubelet            Successfully pulled image "k8s.gcr.io/autoscaling/cluster-autoscaler:v1.20.0"
  Normal  Created    76s (x2 over 3m36s)  kubelet            Created container cluster-autoscaler
  Normal  Started    75s (x2 over 3m36s)  kubelet            Started container cluster-autoscaler

Rollback to the worker node with IMDSv1.

$ kubectl -n kube-system logs deployment.apps/cluster-autoscaler
...
...
I0108 07:15:03.847604       1 reflector.go:255] Listing and watching *v1beta1.PodDisruptionBudget from k8s.io/client-go/informers/factory.go:134
I0108 07:15:03.847633       1 reflector.go:219] Starting reflector *v1.StatefulSet (0s) from k8s.io/client-go/informers/factory.go:134
I0108 07:15:03.847640       1 reflector.go:255] Listing and watching *v1.StatefulSet from k8s.io/client-go/informers/factory.go:134
I0108 07:15:03.943862       1 request.go:591] Throttling request took 96.568619ms, request: GET:https://10.100.0.1:443/api/v1/persistentvolumes?limit=500&resourceVersion=0
I0108 07:15:04.243872       1 request.go:591] Throttling request took 396.383321ms, request: GET:https://10.100.0.1:443/api/v1/pods?limit=500&resourceVersion=0
I0108 07:15:07.069368       1 auto_scaling_groups.go:351] Regenerating instance to ASG map for ASGs: [eks-96bb7009-0e0a-3450-075d-3c7ed43c94e6]
I0108 07:15:07.180416       1 auto_scaling.go:199] 1 launch configurations already in cache
I0108 07:15:07.180443       1 auto_scaling_groups.go:136] Registering ASG eks-96bb7009-0e0a-3450-075d-3c7ed43c94e6
I0108 07:15:07.180456       1 aws_manager.go:269] Refreshed ASG list, next refresh after 2021-01-08 07:16:07.180451669 +0000 UTC m=+81.757019680
I0108 07:15:07.180599       1 main.go:279] Registered cleanup signal handler
I0108 07:15:07.180643       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0108 07:15:07.180654       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 4.43µs
I0108 07:15:17.180736       1 static_autoscaler.go:229] Starting main loop
W0108 07:15:17.181232       1 clusterstate.go:436] AcceptableRanges have not been populated yet. Skip checking
I0108 07:15:17.181367       1 filter_out_schedulable.go:65] Filtering out schedulables
I0108 07:15:17.181381       1 filter_out_schedulable.go:132] Filtered out 0 pods using hints
I0108 07:15:17.181390       1 filter_out_schedulable.go:170] 0 pods were kept as unschedulable based on caching
I0108 07:15:17.181397       1 filter_out_schedulable.go:171] 0 pods marked as unschedulable can be scheduled.
I0108 07:15:17.181464       1 filter_out_schedulable.go:82] No schedulable pods
I0108 07:15:17.181490       1 static_autoscaler.go:402] No unschedulable pods
I0108 07:15:17.181509       1 static_autoscaler.go:449] Calculating unneeded nodes

@gagvirk
Copy link

gagvirk commented Jan 19, 2021

Hi Contributors @mwielgus @losipiuk @aleksandra-malinowska @bskiba. As this is causing eks cluster not be upgraded to IMDSv2 support, Can this issue be prioritized, I suspect CA does not use token-backed sessions to access IMDS. CA pod, it keeps OOMed and results in CrashLoopBackOff. Thank you.

@ellistarn
Copy link
Contributor

It appears there are multiple symptoms here.

  1. OOMKill
  2. CrashLoop NoCredentialProviders: no valid providers in chain.

My guess is that (1) is a spurious error, it's difficult to tell. @hans72118, can you follow up with your memory settings? I'll take a look at how IMDSv2 works and what the path forward is here to make sure CAS can use these tokens.

@ellistarn
Copy link
Contributor

@ellistarn
Copy link
Contributor

ellistarn commented Mar 22, 2021

It should be possible to skip this logic by using --aws-use-static-instance-list=true

awsUseStaticInstanceList = flag.Bool("aws-use-static-instance-list", false, "Should CA fetch instance types in runtime or use a static list. AWS only")

Alternatively, it should be possible to skip by including the AWS_REGION environment variable:
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/aws_util.go#L155

@focaaby, it's not clear from your logs or describe pods that this wasn't working for you. Looks like the CA started up normally and populated all listers/watchers?

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 21, 2021
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 21, 2021
@ivankovnatsky
Copy link

ivankovnatsky commented Jan 17, 2022

We still experiencing this error:

F0117 13:40:50.048753       1 aws_cloud_provider.go:369] Failed to generate AWS EC2 Instance Types: NoCredentialProviders: no valid providers in chain. Deprecated.
        For verbose messaging see aws.Config.CredentialsChainVerboseErrors
goroutine 59 [running]:
k8s.io/klog/v2.stacks(0x1)
        /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2/klog.go:1038 +0x8a
k8s.io/klog/v2.(*loggingT).output(0x611e4e0, 0x3, 0x0, 0xc0001e45b0, 0x0, {0x4d2e584, 0x1}, 0xc000ddd9a0, 0x0)
        /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2/klog.go:987 +0x5fd

eks: v1.19.13-eks-8df270
ca: 1.21.1 (also tried 1.22.2 and 1.23.0)

│     Environment:                                                                                                                                                                                                                             │
│       AWS_REGION:          eu-central-1                                                                                                                                                                                                      │
│       AWS_DEFAULT_REGION:  eu-central-1  

Update: using this terraform snippet in module: https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/modules/self-managed-node-group/main.tf#L186 works fine:

      # Enforce only IMDSv2
      metadata_http_tokens = "required"
      metadata_http_put_response_hop_limit = 2

@tudormi
Copy link

tudormi commented May 30, 2022

We found this issue after setting HttpTokens to required on the EC2 instance for the k8s node.
Suddenly cluster-autoscaler was not running on new nodes with the HttpTokens metadata but it was running on the old ones without that metadata.

We found this note here:

If you want to run it on instances with IMDSv1 disabled make sure your EC2 launch configuration has the setting Metadata response hop limit set to 2. Otherwise, the /latest/api/token call will timeout and result in an error. See AWS docs here for further information.

So we updated the "HttpPutResponseHopLimit" to 2 and it is working now.
Useful link to do this through aws cli: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/configuring-instance-metadata-options.html

@bitva77
Copy link

bitva77 commented Sep 22, 2022

We just ran into this option after disabling IMDSv1 but we've had to set HttpPutResponseHopLimit to 3 for whatever reason.

@dgard1981
Copy link

This doesn't seem to work with HttpPutResponseHopLimit set to 2 or 3. I even set it to 64 (the max), just to be sure.

F0118 16:40:34.650058       1 aws_cloud_provider.go:369] Failed to generate AWS EC2 Instance Types: NoCredentialProviders: no valid providers in chain. Deprecated.
	For verbose messaging see aws.Config.CredentialsChainVerboseErrors

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet