Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EbsCsiDriverAddon: Waiter has timed out #894

Open
dedrone-fb opened this issue Dec 21, 2023 · 6 comments
Open

EbsCsiDriverAddon: Waiter has timed out #894

dedrone-fb opened this issue Dec 21, 2023 · 6 comments
Labels
bug Something isn't working

Comments

@dedrone-fb
Copy link

Describe the bug

We are trying to deploy an EKS Blueprint with the EBS CNI AddOn. We resproducibly run into this error message

10:56:04 AM | CREATE_FAILED        | Custom::AWSCDK-EKS-KubernetesResource | eks-stack/ebs-csi-...e/Resource/Default
Received response status [FAILED] from custom resource. Message returned: TimeoutError: {"state":"TIMEOUT","reason":"Waiter has timed out"}
at checkExceptions (/var/runtime/node_modules/@aws-sdk/util-waiter/dist-cjs/waiter.js:26:30)
at waitUntilFunctionActiveV2 (/var/runtime/node_modules/@aws-sdk/client-lambda/dist-cjs/waiters/waitForFunctionActiveV2.js:52:46)
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
at async defaultInvokeFunction (/var/task/outbound.js:1:875)
at async invokeUserFunction (/var/task/framework.js:1:2192)
at async onEvent (/var/task/framework.js:1:369)
at async Runtime.handler (/var/task/cfn-response.js:1:1573) (RequestId: 3b206e15-a3df-4a4e-b222-b58893c77dd5)
10:56:06 AM | UPDATE_ROLLBACK_IN_P | AWS::CloudFormation::Stack            | eks-stack
The following resource(s) failed to create: [eksstackAwsAuthmanifest65E07027, eksstackebscsicontrollersasamanifestebscsicontrollersasaServiceAccountResource71971128].
10:56:14 AM | UPDATE_ROLLBACK_COMP | AWS::CloudFormation::Stack            | eks-stack

Expected Behavior

EBS CNI Addon successfully added to cluster to be spawned

Current Behavior

Rollback initiated

Reproduction Steps

        const addOns = [
            new eksblueprints.addons.CalicoOperatorAddOn(),
            new eksblueprints.addons.MetricsServerAddOn(),
            new eksblueprints.addons.ClusterAutoScalerAddOn(),
            new eksblueprints.addons.AwsLoadBalancerControllerAddOn(),
            new eksblueprints.addons.VpcCniAddOn(),
            new eksblueprints.addons.CoreDnsAddOn(),
            new eksblueprints.addons.KubeProxyAddOn(),
            new eksblueprints.addons.EbsCsiDriverAddOn()
        ];

        const clusterProvider = new eksblueprints.MngClusterProvider({
            version: props.version,
            minSize: props.minSize,
            maxSize: props.maxSize,
            instanceTypes: props.instanceTypes.map(s => new InstanceType(s)),
        });

        const eksBlueprint = eksblueprints.EksBlueprint.builder()
            .account(props.env!.account!)
            .region(props.env!.region!)
            .addOns(...addOns)
            .version(props.version)
            .useDefaultSecretEncryption(props.useDefaultSecretEncryption)
            .clusterProvider(clusterProvider)
            .name(props.clusterName)
            .build(app, id);

        this.blueprint = eksBlueprint;
        this.cluster = eksBlueprint.getClusterInfo().cluster;

Possible Solution

No response

Additional Information/Context

Looked at and tried aws-samples/stable-diffusion-on-eks#5 - but no luck

CDK CLI Version

2.115.0 (build 58027ee)

EKS Blueprints Version

1.13.1

Node.js Version

v18.16.0

Environment details (OS name and version, etc.)

Ubuntu Linux 22.04

Other information

No response

@dedrone-fb dedrone-fb added the bug Something isn't working label Dec 21, 2023
@dedrone-fb dedrone-fb changed the title (module name): (short issue description) EbsCsiDriverAddon: Waiter has timed out Dec 21, 2023
@shapirov103
Copy link
Collaborator

shapirov103 commented Dec 21, 2023

@dedrone-fb Do you have worker nodes running? The reason I ask is because it unclear what kind of EC2 instance types you fed to your cluster and whether they were provisioned.

You can run cdk deploy <your-blueprint-name> --no-rollback to check the cluster state if provisioning fails, it prevents rollback and cleanup of resources.

Another possible reason is insufficient capacity. I assume cluster autoscaler should address it (it is in your list) but it may take longer than expected to roll out a new node and hence result in the timeout.

Please also share your props object: minSize, cluster version.

@shapirov103
Copy link
Collaborator

The following blueprint provisioned fine:

const addOns = [
    new blueprints.addons.CalicoOperatorAddOn(),
    new blueprints.addons.MetricsServerAddOn(),
    new blueprints.addons.ClusterAutoScalerAddOn(),
    new blueprints.addons.AwsLoadBalancerControllerAddOn(),
    new blueprints.addons.VpcCniAddOn(),
    new blueprints.addons.CoreDnsAddOn(),
    new blueprints.addons.KubeProxyAddOn(),
    new blueprints.addons.EbsCsiDriverAddOn()
];

const clusterProvider = new blueprints.MngClusterProvider();

const eksBlueprint = blueprints.EksBlueprint.builder()

    .addOns(...addOns)
    .region("us-east-1")
    .version("auto")
    .useDefaultSecretEncryption(true)
    .clusterProvider(clusterProvider)
    .name("reprod-case-ebs")
    .build(app, "reprod-case-ebs");

@dedrone-fb
Copy link
Author

I'd like to put this on hold. We currently suspect some kind of permission or quota problems. Removing any two addons seems to fix the problem (we tried with EBS CSI but without Calico and Metrics and it worked).

Will report back

@hshepherd
Copy link

hshepherd commented Jan 29, 2024

I am seeing a similar issue with the following config

        const addOns: Array<blueprints.ClusterAddOn> = [
            new blueprints.addons.SecretsStoreAddOn({
                rotationPollInterval: '120s',
                syncSecrets: true
            }),
            argoAddon,
            new blueprints.addons.CalicoOperatorAddOn(),
            new blueprints.addons.MetricsServerAddOn(),
            new blueprints.addons.ClusterAutoScalerAddOn(),
            new blueprints.addons.AwsLoadBalancerControllerAddOn(),
            new blueprints.addons.VpcCniAddOn(),
            new blueprints.addons.CoreDnsAddOn(),
            new blueprints.addons.KubeProxyAddOn(),
            new blueprints.addons.OpaGatekeeperAddOn(),
        ];

        const stack = blueprints.EksBlueprint.builder()
            .account(account)
            .region(region)
            .version('auto')
            .addOns(...addOns)
            .useDefaultSecretEncryption(true)
            .enableControlPlaneLogTypes(blueprints.ControlPlaneLogType.AUDIT)
            .enableGitOps(blueprints.GitOpsMode.APPLICATION)
            .teams(new TeamPlatform(props.gitops.platformTeamUserRoleArn), new TeamDeveloper(props.gitops.developerTeamUserRoleArn))
            .build(app, id + '-eks-bps', { env: props.env });
Screenshot 2024-01-29 at 12 18 29 PM

Is this possibly related to aws/aws-cdk#26838?

Update: Also tried without GitOps enabled and seeing the same issue.

Update: I can see the following error in CloudTrail around the time of the cdk deploy failure:

    "eventTime": "2024-01-30T16:19:34Z",
    "eventSource": "iam.amazonaws.com",
    "eventName": "GetRolePolicy",
    "awsRegion": "us-east-1",
    "sourceIPAddress": "cloudformation.amazonaws.com",
    "userAgent": "cloudformation.amazonaws.com",
    "errorCode": "NoSuchEntityException",
    "errorMessage": "The role policy with name ProviderframeworkonEventServiceRoleDefaultPolicy48CD2133 cannot be found.",
    "requestParameters": {
        "roleName": "workloadsdevelopmentworkl-ProviderframeworkonEventS-ERHAR0IF0eVi",
        "policyName": "ProviderframeworkonEventServiceRoleDefaultPolicy48CD2133"
    },

@hshepherd
Copy link

hshepherd commented Feb 1, 2024

Updating as I've found the root cause for our timeout:

For us at least, this appears to be caused by Lambda Concurrency Limits in a new AWS account. The underlying EKS construct spins up many Lambdas as part of the KubectlProvider implementation. As CDK does the deploy, it waits for these lambdas to apply kubectl commands in the new cluster.

In our case, a new AWS account had a Concurrent Executions limit of 10 -- which is not high enough for the blueprint deploy and resulted in these Lambda requests being throttled (i.e. canceled with no error).

This problem is probably exacerbated if you are installing multiple Addons.
image

This does not appear to be an issue with cdk-eks-blueprints, but I am posting here for awareness.
FYI @shapirov103

@shapirov103
Copy link
Collaborator

@hshepherd thank you for this insight, it would have been very hard for us to reproduce.
The custom resource lambda is created to use all unreserved capacity. Hypothetically, if all addons are executed serially the issue should be mitigated if you have at least some concurrency available (e.g. kubectl commands will go one at a time, but other lambda functions may interfere).
You can try defining strictly ordered behavior for all addons, e.g.

import "reflect-metadata";

Reflect.defineMetadata("ordered", true, addons. EbsCsiDriverAddOn); // repeat for all addons

This is more of an experimental feature tbh.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants