Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with cdk blueprint version 1.4 when running ClusterAutoScalerAddOn on Kubernetes 1.23 #531

Closed
bnaydenov opened this issue Nov 3, 2022 · 6 comments
Labels
bug Something isn't working

Comments

@bnaydenov
Copy link
Contributor

bnaydenov commented Nov 3, 2022

Describe the bug

When cdk blueprint version 1.4 is used and ClusterAutoScalerAddOn is installed on eks kubernetes 1.23 pod for blueprints-addon-cluster-autoscaler-aws-cluster-autoscaler is fails to start.

The same setup works without problems on eks kubernetes 1.21 and 1.22

Expected Behavior

When installing ClusterAutoScalerAddOn on eks kubernetes 1.23 pod for blueprints-addon-cluster-autoscaler-aws-cluster-autoscaler suppose to start without errors.

Current Behavior

pod blueprints-addon-cluster-autoscaler-aws-cluster-autoscaler starts and during starting phase which takes about 15-20 sec pod crash with following errors:

W1103 20:50:48.673793       1 warnings.go:70] policy/v1beta1 PodDisruptionBudget is deprecated in v1.21+, unavailable in v1.25+; use policy/v1 PodDisruptionBudget
 112 W1103 20:50:48.691943       1 warnings.go:70] policy/v1beta1 PodDisruptionBudget is deprecated in v1.21+, unavailable in v1.25+; use policy/v1 PodDisruptionBudget
 113 I1103 20:50:48.697788       1 cloud_provider_builder.go:29] Building aws cloud provider.
 114 I1103 20:50:48.697953       1 reflector.go:219] Starting reflector *v1.CSIDriver (0s) from k8s.io/client-go/informers/factory.go:134
 115 I1103 20:50:48.698079       1 reflector.go:255] Listing and watching *v1.CSIDriver from k8s.io/client-go/informers/factory.go:134
 116 I1103 20:50:48.698147       1 reflector.go:219] Starting reflector *v1.CSINode (0s) from k8s.io/client-go/informers/factory.go:134
 117 I1103 20:50:48.698298       1 reflector.go:255] Listing and watching *v1.CSINode from k8s.io/client-go/informers/factory.go:134
 118 I1103 20:50:48.698382       1 reflector.go:219] Starting reflector *v1.Namespace (0s) from k8s.io/client-go/informers/factory.go:134
 119 I1103 20:50:48.698397       1 reflector.go:255] Listing and watching *v1.Namespace from k8s.io/client-go/informers/factory.go:134
 120 I1103 20:50:48.698482       1 reflector.go:219] Starting reflector *v1.StatefulSet (0s) from k8s.io/client-go/informers/factory.go:134
 121 I1103 20:50:48.698490       1 reflector.go:255] Listing and watching *v1.StatefulSet from k8s.io/client-go/informers/factory.go:134
 122 I1103 20:50:48.698691       1 reflector.go:219] Starting reflector *v1beta1.CSIStorageCapacity (0s) from k8s.io/client-go/informers/factory.go:134
 123 I1103 20:50:48.698707       1 reflector.go:219] Starting reflector *v1.PersistentVolumeClaim (0s) from k8s.io/client-go/informers/factory.go:134
 124 I1103 20:50:48.698193       1 reflector.go:219] Starting reflector *v1.Service (0s) from k8s.io/client-go/informers/factory.go:134
 125 I1103 20:50:48.698713       1 reflector.go:255] Listing and watching *v1.PersistentVolumeClaim from k8s.io/client-go/informers/factory.go:134
 126 I1103 20:50:48.698717       1 reflector.go:255] Listing and watching *v1.Service from k8s.io/client-go/informers/factory.go:134
 127 I1103 20:50:48.698090       1 reflector.go:219] Starting reflector *v1.StorageClass (0s) from k8s.io/client-go/informers/factory.go:134
 128 I1103 20:50:48.698939       1 reflector.go:255] Listing and watching *v1.StorageClass from k8s.io/client-go/informers/factory.go:134
 129 I1103 20:50:48.699055       1 reflector.go:219] Starting reflector *v1.PodDisruptionBudget (0s) from k8s.io/client-go/informers/factory.go:134
 130 I1103 20:50:48.699069       1 reflector.go:255] Listing and watching *v1.PodDisruptionBudget from k8s.io/client-go/informers/factory.go:134
 131 I1103 20:50:48.699107       1 reflector.go:219] Starting reflector *v1.PersistentVolume (0s) from k8s.io/client-go/informers/factory.go:134
 132 I1103 20:50:48.699113       1 reflector.go:255] Listing and watching *v1.PersistentVolume from k8s.io/client-go/informers/factory.go:134
 133 I1103 20:50:48.698707       1 reflector.go:255] Listing and watching *v1beta1.CSIStorageCapacity from k8s.io/client-go/informers/factory.go:134
 134 I1103 20:50:48.698280       1 reflector.go:219] Starting reflector *v1.ReplicationController (0s) from k8s.io/client-go/informers/factory.go:134
 135 I1103 20:50:48.699200       1 reflector.go:255] Listing and watching *v1.ReplicationController from k8s.io/client-go/informers/factory.go:134
 136 I1103 20:50:48.698877       1 reflector.go:219] Starting reflector *v1.Node (0s) from k8s.io/client-go/informers/factory.go:134
 137 I1103 20:50:48.697981       1 reflector.go:219] Starting reflector *v1.Pod (0s) from k8s.io/client-go/informers/factory.go:134
 138 I1103 20:50:48.699249       1 reflector.go:255] Listing and watching *v1.Node from k8s.io/client-go/informers/factory.go:134
 139 I1103 20:50:48.698994       1 reflector.go:219] Starting reflector *v1.ReplicaSet (0s) from k8s.io/client-go/informers/factory.go:134
 140 I1103 20:50:48.699267       1 reflector.go:255] Listing and watching *v1.ReplicaSet from k8s.io/client-go/informers/factory.go:134
 141 I1103 20:50:48.699254       1 reflector.go:255] Listing and watching *v1.Pod from k8s.io/client-go/informers/factory.go:134
 142 F1103 20:50:48.774746       1 aws_cloud_provider.go:369] Failed to generate AWS EC2 Instance Types: UnauthorizedOperation: You are not authorized to perform this ope     ration.
 143         status code: 403, request id: daf5f899-2e44-4ffe-b66a-232afba2e473
 144 goroutine 48 [running]:
 145 k8s.io/klog/v2.stacks(0x1)
 146         /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2/klog.go:1038 +0x8a
 147 k8s.io/klog/v2.(*loggingT).output(0x611e4e0, 0x3, 0x0, 0xc00039b6c0, 0x0, {0x4d2e584, 0x1}, 0xc000f042a0, 0x0)
 148         /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2/klog.go:987 +0x5fd
 149 k8s.io/klog/v2.(*loggingT).printf(0x203000, 0x203000, 0x0, {0x0, 0x0}, {0x3cd1b88, 0x2d}, {0xc000f042a0, 0x1, 0x1})
 150         /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2/klog.go:753 +0x1c5
 151 k8s.io/klog/v2.Fatalf(...)
 152         /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2/klog.go:1532
 153 k8s.io/autoscaler/cluster-autoscaler/cloudprovider/aws.BuildAWS({{0x3fe0000000000000, 0x3fe0000000000000, 0x8bb2c97000, 0x1176592e000}, 0xa, 0x0, 0x4e200, 0x0, 0x186     a0000000000, 0x0, ...}, ...)
 154         /gopath/src/k8s.io/autoscaler/cluster-autoscaler/cloudprovider/aws/aws_cloud_provider.go:369 +0x3f7
 155 k8s.io/autoscaler/cluster-autoscaler/cloudprovider/builder.buildCloudProvider({{0x3fe0000000000000, 0x3fe0000000000000, 0x8bb2c97000, 0x1176592e000}, 0xa, 0x0, 0x4e2     00, 0x0, 0x186a0000000000, 0x0, ...}, ...)
 156         /gopath/src/k8s.io/autoscaler/cluster-autoscaler/cloudprovider/builder/builder_all.go:77 +0xea

The main error is Failed to generate AWS EC2 Instance Types: UnauthorizedOperation: You are not authorized to perform this operation.

Reproduction Steps

just use cdk blueprint 1.4 to spin up brand new eks k8s cluster 1.23 and using ClusterAutoScalerAddOn
cdk deploy step will be successful, but after that pod blueprints-addon-cluster-autoscaler-aws-cluster-autoscaler will crash and can not be started.

Possible Solution

I have found what is wrong and will prepare PR to fix this.

TLDR: We need to add missing policy ec2:DescribeInstanceTypes for cluster-autoscaler IAM statements here:

For more info check here:
kubernetes/autoscaler#3216
particuleio/terraform-kubernetes-addons#1320

I have created local monkey patch of addon/cluster-autoscaler in

with adding this missing policy "ec2:DescribeInstanceTypes" and everything is working as expected.

Additional Information/Context

No response

CDK CLI Version

2.50.0 (build 4c11af6)

EKS Blueprints Version

1.4.0

Node.js Version

v16.17.0

Environment details (OS name and version, etc.)

Mac OS Monterey - Version 12.6

Other information

No response

@bnaydenov bnaydenov added the bug Something isn't working label Nov 3, 2022
bnaydenov added a commit to bnaydenov/cdk-eks-blueprints that referenced this issue Nov 3, 2022
shapirov103 added a commit that referenced this issue Nov 3, 2022
@bnaydenov
Copy link
Contributor Author

@shapirov103 you can close this one

@softmates
Copy link

@bnaydenov how is it working on 1.22 without the policy "ec2:DescribeInstanceTypes" ?

@bnaydenov
Copy link
Contributor Author

bnaydenov commented Nov 4, 2022

@softmates most most likely is due to different version of the helm chart which autoscaler uses for different version of eks kubernetes. Check this file:

const versionMap = new Map([

eks k8s 1.21 and 1.22 are using the same helm chart version, but in 1.23 is different

const versionMap = new Map([
    [KubernetesVersion.V1_23, "9.21.0"],
    [KubernetesVersion.V1_22, "9.13.1"],
    [KubernetesVersion.V1_21, "9.13.1"],
    [KubernetesVersion.V1_20, "9.9.2"],
    [KubernetesVersion.V1_19, "9.4.0"],
    [KubernetesVersion.V1_18, "9.4.0"],
]);

@softmates
Copy link

Yep, it make sense. Looking at the spec don't see a need for additional policy "ec2:DescribeInstanceTypes" refer the comparison

https://artifacthub.io/packages/helm/cluster-autoscaler/cluster-autoscaler/9.13.0

spec:
additionalPolicies:
node: |
[
{"Effect":"Allow","Action":["autoscaling:DescribeAutoScalingGroups","autoscaling:DescribeAutoScalingInstances","autoscaling:DescribeLaunchConfigurations","autoscaling:DescribeTags","autoscaling:SetDesiredCapacity","autoscaling:TerminateInstanceInAutoScalingGroup"],"Resource":"*"}
]
...

https://artifacthub.io/packages/helm/cluster-autoscaler/cluster-autoscaler/9.21.0

spec:
additionalPolicies:
node: |
[
{"Effect":"Allow","Action":["autoscaling:DescribeAutoScalingGroups","autoscaling:DescribeAutoScalingInstances","autoscaling:DescribeLaunchConfigurations","autoscaling:DescribeTags","autoscaling:SetDesiredCapacity","autoscaling:TerminateInstanceInAutoScalingGroup"],"Resource":"*"}
]
...

@bnaydenov
Copy link
Contributor Author

@bnaydenov
Copy link
Contributor Author

this changes are released in 1.4.1 so I will close the issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants