EbsCsiDriverAddon: Waiter has timed out #894

dedrone-fb · 2023-12-21T10:50:29Z

Describe the bug

We are trying to deploy an EKS Blueprint with the EBS CNI AddOn. We resproducibly run into this error message

10:56:04 AM | CREATE_FAILED        | Custom::AWSCDK-EKS-KubernetesResource | eks-stack/ebs-csi-...e/Resource/Default
Received response status [FAILED] from custom resource. Message returned: TimeoutError: {"state":"TIMEOUT","reason":"Waiter has timed out"}
at checkExceptions (/var/runtime/node_modules/@aws-sdk/util-waiter/dist-cjs/waiter.js:26:30)
at waitUntilFunctionActiveV2 (/var/runtime/node_modules/@aws-sdk/client-lambda/dist-cjs/waiters/waitForFunctionActiveV2.js:52:46)
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
at async defaultInvokeFunction (/var/task/outbound.js:1:875)
at async invokeUserFunction (/var/task/framework.js:1:2192)
at async onEvent (/var/task/framework.js:1:369)
at async Runtime.handler (/var/task/cfn-response.js:1:1573) (RequestId: 3b206e15-a3df-4a4e-b222-b58893c77dd5)
10:56:06 AM | UPDATE_ROLLBACK_IN_P | AWS::CloudFormation::Stack            | eks-stack
The following resource(s) failed to create: [eksstackAwsAuthmanifest65E07027, eksstackebscsicontrollersasamanifestebscsicontrollersasaServiceAccountResource71971128].
10:56:14 AM | UPDATE_ROLLBACK_COMP | AWS::CloudFormation::Stack            | eks-stack

Expected Behavior

EBS CNI Addon successfully added to cluster to be spawned

Current Behavior

Rollback initiated

Reproduction Steps

        const addOns = [
            new eksblueprints.addons.CalicoOperatorAddOn(),
            new eksblueprints.addons.MetricsServerAddOn(),
            new eksblueprints.addons.ClusterAutoScalerAddOn(),
            new eksblueprints.addons.AwsLoadBalancerControllerAddOn(),
            new eksblueprints.addons.VpcCniAddOn(),
            new eksblueprints.addons.CoreDnsAddOn(),
            new eksblueprints.addons.KubeProxyAddOn(),
            new eksblueprints.addons.EbsCsiDriverAddOn()
        ];

        const clusterProvider = new eksblueprints.MngClusterProvider({
            version: props.version,
            minSize: props.minSize,
            maxSize: props.maxSize,
            instanceTypes: props.instanceTypes.map(s => new InstanceType(s)),
        });

        const eksBlueprint = eksblueprints.EksBlueprint.builder()
            .account(props.env!.account!)
            .region(props.env!.region!)
            .addOns(...addOns)
            .version(props.version)
            .useDefaultSecretEncryption(props.useDefaultSecretEncryption)
            .clusterProvider(clusterProvider)
            .name(props.clusterName)
            .build(app, id);

        this.blueprint = eksBlueprint;
        this.cluster = eksBlueprint.getClusterInfo().cluster;

Possible Solution

No response

Additional Information/Context

Looked at and tried aws-samples/stable-diffusion-on-eks#5 - but no luck

CDK CLI Version

2.115.0 (build 58027ee)

EKS Blueprints Version

1.13.1

Node.js Version

v18.16.0

Environment details (OS name and version, etc.)

Ubuntu Linux 22.04

Other information

No response

The text was updated successfully, but these errors were encountered:

shapirov103 · 2023-12-21T18:54:24Z

@dedrone-fb Do you have worker nodes running? The reason I ask is because it unclear what kind of EC2 instance types you fed to your cluster and whether they were provisioned.

You can run cdk deploy <your-blueprint-name> --no-rollback to check the cluster state if provisioning fails, it prevents rollback and cleanup of resources.

Another possible reason is insufficient capacity. I assume cluster autoscaler should address it (it is in your list) but it may take longer than expected to roll out a new node and hence result in the timeout.

Please also share your props object: minSize, cluster version.

shapirov103 · 2023-12-21T19:50:17Z

The following blueprint provisioned fine:

const addOns = [
    new blueprints.addons.CalicoOperatorAddOn(),
    new blueprints.addons.MetricsServerAddOn(),
    new blueprints.addons.ClusterAutoScalerAddOn(),
    new blueprints.addons.AwsLoadBalancerControllerAddOn(),
    new blueprints.addons.VpcCniAddOn(),
    new blueprints.addons.CoreDnsAddOn(),
    new blueprints.addons.KubeProxyAddOn(),
    new blueprints.addons.EbsCsiDriverAddOn()
];

const clusterProvider = new blueprints.MngClusterProvider();

const eksBlueprint = blueprints.EksBlueprint.builder()

    .addOns(...addOns)
    .region("us-east-1")
    .version("auto")
    .useDefaultSecretEncryption(true)
    .clusterProvider(clusterProvider)
    .name("reprod-case-ebs")
    .build(app, "reprod-case-ebs");

dedrone-fb · 2023-12-26T08:52:51Z

I'd like to put this on hold. We currently suspect some kind of permission or quota problems. Removing any two addons seems to fix the problem (we tried with EBS CSI but without Calico and Metrics and it worked).

Will report back

hshepherd · 2024-01-29T16:48:59Z

I am seeing a similar issue with the following config

        const addOns: Array<blueprints.ClusterAddOn> = [
            new blueprints.addons.SecretsStoreAddOn({
                rotationPollInterval: '120s',
                syncSecrets: true
            }),
            argoAddon,
            new blueprints.addons.CalicoOperatorAddOn(),
            new blueprints.addons.MetricsServerAddOn(),
            new blueprints.addons.ClusterAutoScalerAddOn(),
            new blueprints.addons.AwsLoadBalancerControllerAddOn(),
            new blueprints.addons.VpcCniAddOn(),
            new blueprints.addons.CoreDnsAddOn(),
            new blueprints.addons.KubeProxyAddOn(),
            new blueprints.addons.OpaGatekeeperAddOn(),
        ];

        const stack = blueprints.EksBlueprint.builder()
            .account(account)
            .region(region)
            .version('auto')
            .addOns(...addOns)
            .useDefaultSecretEncryption(true)
            .enableControlPlaneLogTypes(blueprints.ControlPlaneLogType.AUDIT)
            .enableGitOps(blueprints.GitOpsMode.APPLICATION)
            .teams(new TeamPlatform(props.gitops.platformTeamUserRoleArn), new TeamDeveloper(props.gitops.developerTeamUserRoleArn))
            .build(app, id + '-eks-bps', { env: props.env });

Is this possibly related to aws/aws-cdk#26838?

Update: Also tried without GitOps enabled and seeing the same issue.

Update: I can see the following error in CloudTrail around the time of the cdk deploy failure:

    "eventTime": "2024-01-30T16:19:34Z",
    "eventSource": "iam.amazonaws.com",
    "eventName": "GetRolePolicy",
    "awsRegion": "us-east-1",
    "sourceIPAddress": "cloudformation.amazonaws.com",
    "userAgent": "cloudformation.amazonaws.com",
    "errorCode": "NoSuchEntityException",
    "errorMessage": "The role policy with name ProviderframeworkonEventServiceRoleDefaultPolicy48CD2133 cannot be found.",
    "requestParameters": {
        "roleName": "workloadsdevelopmentworkl-ProviderframeworkonEventS-ERHAR0IF0eVi",
        "policyName": "ProviderframeworkonEventServiceRoleDefaultPolicy48CD2133"
    },

hshepherd · 2024-02-01T15:25:14Z

Updating as I've found the root cause for our timeout:

For us at least, this appears to be caused by Lambda Concurrency Limits in a new AWS account. The underlying EKS construct spins up many Lambdas as part of the KubectlProvider implementation. As CDK does the deploy, it waits for these lambdas to apply kubectl commands in the new cluster.

In our case, a new AWS account had a Concurrent Executions limit of 10 -- which is not high enough for the blueprint deploy and resulted in these Lambda requests being throttled (i.e. canceled with no error).

This problem is probably exacerbated if you are installing multiple Addons.

This does not appear to be an issue with cdk-eks-blueprints, but I am posting here for awareness.
FYI @shapirov103

shapirov103 · 2024-02-01T20:12:17Z

@hshepherd thank you for this insight, it would have been very hard for us to reproduce.
The custom resource lambda is created to use all unreserved capacity. Hypothetically, if all addons are executed serially the issue should be mitigated if you have at least some concurrency available (e.g. kubectl commands will go one at a time, but other lambda functions may interfere).
You can try defining strictly ordered behavior for all addons, e.g.

import "reflect-metadata";

Reflect.defineMetadata("ordered", true, addons. EbsCsiDriverAddOn); // repeat for all addons

This is more of an experimental feature tbh.

dedrone-fb added the bug Something isn't working label Dec 21, 2023

dedrone-fb changed the title ~~(module name): (short issue description)~~ EbsCsiDriverAddon: Waiter has timed out Dec 21, 2023

hshepherd mentioned this issue Feb 1, 2024

cdk deploy: Waiter times out on clusterautoscaler #856

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EbsCsiDriverAddon: Waiter has timed out #894

EbsCsiDriverAddon: Waiter has timed out #894

dedrone-fb commented Dec 21, 2023

shapirov103 commented Dec 21, 2023 •

edited

Loading

shapirov103 commented Dec 21, 2023

dedrone-fb commented Dec 26, 2023

hshepherd commented Jan 29, 2024 •

edited

Loading

hshepherd commented Feb 1, 2024 •

edited

Loading

shapirov103 commented Feb 1, 2024

EbsCsiDriverAddon: Waiter has timed out #894

EbsCsiDriverAddon: Waiter has timed out #894

Comments

dedrone-fb commented Dec 21, 2023

Describe the bug

Expected Behavior

Current Behavior

Reproduction Steps

Possible Solution

Additional Information/Context

CDK CLI Version

EKS Blueprints Version

Node.js Version

Environment details (OS name and version, etc.)

Other information

shapirov103 commented Dec 21, 2023 • edited Loading

shapirov103 commented Dec 21, 2023

dedrone-fb commented Dec 26, 2023

hshepherd commented Jan 29, 2024 • edited Loading

hshepherd commented Feb 1, 2024 • edited Loading

shapirov103 commented Feb 1, 2024

shapirov103 commented Dec 21, 2023 •

edited

Loading

hshepherd commented Jan 29, 2024 •

edited

Loading

hshepherd commented Feb 1, 2024 •

edited

Loading