Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

custom_resources: log retention rate limit error during deploy #24485

Closed
jsauter opened this issue Mar 6, 2023 · 14 comments · Fixed by #26995
Closed

custom_resources: log retention rate limit error during deploy #24485

jsauter opened this issue Mar 6, 2023 · 14 comments · Fixed by #26995
Assignees
Labels
@aws-cdk/custom-resources Related to AWS CDK Custom Resources bug This issue is a bug. p2 sdk-v3-upgrade Tag issues that are associated to SDK V3 upgrade. Not limited to CR usage of SDK only.

Comments

@jsauter
Copy link

jsauter commented Mar 6, 2023

Describe the bug

When deploying a stack containing a custom resource, we received several rate limit exceptions when the cdk code deployed the logRetention infrastructure. This seems to happen intermittently and we have not received feedback recently of the issue. However, we figured you would like to know.

Expected Behavior

Not to receive a rate limit exception and the application to deploy.

Current Behavior

Deploying cdk application that contained a customer resource and lambda, the logRetention infrastructure generated a rate exceeded exception.

Reproduction Steps

It has not happened recently, but create a cdk application with a custom resource that has configuration that will generate a logRetention policy etc.

Possible Solution

No response

Additional Information/Context

We are curious if there is a way to manage these rate limit exceptions if they are outside of our code.

CDK CLI Version

2.29.1

Framework Version

No response

Node.js Version

14.18.3

OS

macos

Language

Typescript

Language Version

No response

Other information

No response

@jsauter jsauter added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Mar 6, 2023
@github-actions github-actions bot added the @aws-cdk/custom-resources Related to AWS CDK Custom Resources label Mar 6, 2023
@jsauter
Copy link
Author

jsauter commented Mar 6, 2023

Console output during deploy.

Screen Shot 2023-03-06 at 3 33 18 PM

@khushail khushail self-assigned this Mar 7, 2023
@khushail khushail added p2 needs-reproduction This issue needs reproduction. investigating This issue is being investigated and/or work is in progress to resolve the issue. and removed needs-triage This issue or PR still needs to be triaged. labels Mar 7, 2023
@cgarvis
Copy link
Contributor

cgarvis commented Mar 7, 2023

thanks @jsauter for the report.

@khushail khushail removed investigating This issue is being investigated and/or work is in progress to resolve the issue. needs-reproduction This issue needs reproduction. labels Mar 7, 2023
@khushail khushail removed their assignment Mar 7, 2023
@khushail
Copy link
Contributor

khushail commented Mar 7, 2023

Hey @jsauter , thanks for reporting this, as we can't tell what's the rate limit is about, this screenshot is still very helpful.

@pahud
Copy link
Contributor

pahud commented Mar 7, 2023

Hi @jsauter

Can you tell how many custom resources or log groups with logRotation enabled you are creating in this CDK app? I am just curious where the limit comes from.

@jsauter
Copy link
Author

jsauter commented Mar 7, 2023

It this case of this stack, we are created a custom resource and configuring it for log retention. So one in our code, and I guess two from CDK, the 'AWS CDK resource provider framework - onEvent' and the logRetetion cr.

@idwagner
Copy link

idwagner commented Mar 9, 2023

I'm also running into the issue that the logrotation lambda is timing out at three seconds, causing the stack to timeout/fail. Some additional tuning options would be helpful to resolve issues with this without having to use escape hatches.

INIT_START Runtime Version: nodejs:14.v29	Runtime Version ARN: arn:aws:lambda:us-east-1::runtime:XXXXXX
START RequestId: 8fe6a7e5-a521-48dd-8f73-07b97e0feff2 Version: $LATEST
2023-03-09T15:29:10.583Z	8fe6a7e5-a521-48dd-8f73-07b97e0feff2	INFO	
{
    "RequestType": "Create",
    "ServiceToken": "arn:aws:lambda:us-east-1:XXXX:function:XXXXX-LogRetentionaae0aa3c5b4d4f87b0-m6gZos1arCB1",
    "ResponseURL": "...",
    "StackId": "arn:aws:cloudformation:us-east-1:XXXXX:stack/XXXXX/f4e26fa0-be8d-11ed-8fb1-0ecbd2482a13",
    "RequestId": "1b1b5d99-72d0-43ba-936d-49ab668ffe0b",
    "LogicalResourceId": "XXXXXXLogRetentionED515AF7",
    "ResourceType": "Custom::LogRetention",
    "ResourceProperties": {
        "ServiceToken": "arn:aws:lambda:us-east-1:XXXXX:function:XXXXX-LogRetentionaae0aa3c5b4d4f87b0-m6gZos1arCB1",
        "RetentionInDays": "7",
        "LogGroupName": "/aws/lambda/XXXXX-c1quvH5k3SfR"
    }
}

2023-03-09T15:29:13.587Z 8fe6a7e5-a521-48dd-8f73-07b97e0feff2 Task timed out after 3.01 seconds

END RequestId: 8fe6a7e5-a521-48dd-8f73-07b97e0feff2
REPORT RequestId: 8fe6a7e5-a521-48dd-8f73-07b97e0feff2	Duration: 3006.29 ms	Billed Duration: 3000 ms	Memory Size: 128 MB	Max Memory Used: 26 MB	

@dylan-westbury
Copy link

dylan-westbury commented Aug 7, 2023

Also receiving rate exceeded for log retention for cdk stack.

Screenshot 2023-08-07 at 6 31 28 pm

Adding logRetentionRetryOptions seems to have resolved

      logRetention: RetentionDays.ONE_MONTH,
      logRetentionRetryOptions: {
        base: Duration.millis(200),
        maxRetries: 10
      },

@jusdino
Copy link
Contributor

jusdino commented Aug 15, 2023

I'm also hitting this issue. Here's a stack trace for reference. I think the LogRetention custom resource lambda code just needs to have a reasonable retry policy set:

    at Request.extractError (/var/runtime/node_modules/aws-sdk/lib/protocol/json.js:52:27)
    at Request.callListeners (/var/runtime/node_modules/aws-sdk/lib/sequential_executor.js:106:20)
    at Request.emit (/var/runtime/node_modules/aws-sdk/lib/sequential_executor.js:78:10)
    at Request.emit (/var/runtime/node_modules/aws-sdk/lib/request.js:686:14)
    at Request.transition (/var/runtime/node_modules/aws-sdk/lib/request.js:22:10)
    at AcceptorStateMachine.runTo (/var/runtime/node_modules/aws-sdk/lib/state_machine.js:14:12)
    at /var/runtime/node_modules/aws-sdk/lib/state_machine.js:26:10
    at Request.<anonymous> (/var/runtime/node_modules/aws-sdk/lib/request.js:38:9)
    at Request.<anonymous> (/var/runtime/node_modules/aws-sdk/lib/request.js:688:12)
    at Request.callListeners (/var/runtime/node_modules/aws-sdk/lib/sequential_executor.js:116:18) {
  code: 'ThrottlingException',
  time: 2023-08-15T22:50:17.215Z,
  requestId: '59263e11-a284-42cb-8b4d-097c4cfe7c40',
  statusCode: 400,
  retryable: true
}

We've got devs running into this - all you have to do is try to deploy an app with a bunch of lambdas, all with logRetention set.

One work-around a team found: They manually reduced the number of lambdas in their stack then deployed several times, introducing only a few new ones each time. It's tedious, but it did get them past the rate limit issue.

@jusdino
Copy link
Contributor

jusdino commented Aug 16, 2023

Doing a little digging, in case it helps somebody, it looks like the LogRetention function code just pulls retry options from its resource properties, which is also available to specify by the user in aws-lambda.Function via the logRetentionRetryOptions property, though you're only given a maxRetries option, which apparently defaults to 5.

I'm not familiar with the JavaScript aws-sdk, so it's unclear to me if these rate exceeded errors are retried and what the sdk v3 retry strategy looks like, if it does. I don't see any logs in CloudWatch indicating that any retries were done.

@aestebance
Copy link

If you use "base" option in logRetentionRetryOptions (I use 200 millis) you can deploy without getting "rate exceeded". But, I don´t know why "base" is deprecated. If anyone can explain the reason to deprecate "base" if it is still working and there is no other solution.

@jusdino
Copy link
Contributor

jusdino commented Aug 17, 2023

I think it was because they migrated from SDK v2 to SDK v3 in the lambda and retry strategies work differently in v3.

@jaapvanblaaderen
Copy link
Contributor

This issue got worse since CDK 2.90: #26837

@cgarvis cgarvis added the node18-upgrade Any work (bug, feature) related to Node 18 upgrade label Sep 1, 2023
@mrgrain
Copy link
Contributor

mrgrain commented Sep 1, 2023

Ha, that is a good catch! You can totally configure the retry mechanism to exceed the lambda time out. I'll get on fixing that!

If you use "base" option in logRetentionRetryOptions (I use 200 millis) you can deploy without getting "rate exceeded". But, I don´t know why "base" is deprecated. If anyone can explain the reason to deprecate "base" if it is still working and there is no other solution.

Hi @aestebance, base has been deprecated because we migrated the code from AWS SDKv2 to AWS SDK3.
The SDKv3 has a number of different retry mechanisms, the default one should be better because it is more intelligent and by taking into account retry budgets. We decided that this is the better experience than simply re-implementing the old retry mechanism. I believe that maxRetries is enough to make any case work. But I'd love to get an understanding if that's not the case!

I'm also hitting this issue. Here's a stack trace for reference. I think the LogRetention custom resource lambda code just needs to have a reasonable retry policy set:

We've got devs running into this - all you have to do is try to deploy an app with a bunch of lambdas, all with logRetention set.

One work-around a team found: They manually reduced the number of lambdas in their stack then deployed several times, introducing only a few new ones each time. It's tedious, but it did get them past the rate limit issue.

@jusdino Thanks for the report and sorry for this. This has been resolved in #26858 and the release for it should be out any time now. You are waiting for v2.94.0. Please let me know if that resolves it for you.

The reason for this regression was also the migration to SDKv3. Basically I've missed that we now need to explicitly check for ThrottlingException.

@mrgrain mrgrain self-assigned this Sep 1, 2023
@udaypant udaypant removed the node18-upgrade Any work (bug, feature) related to Node 18 upgrade label Sep 1, 2023
@mrgrain mrgrain added the sdk-v3-upgrade Tag issues that are associated to SDK V3 upgrade. Not limited to CR usage of SDK only. label Sep 4, 2023
@mergify mergify bot closed this as completed in #26995 Sep 6, 2023
mergify bot pushed a commit that referenced this issue Sep 6, 2023
)

We use a custom resource to set the log retention for log groups created by the Lambda service.
This custom resource handler code has a built-in retry mechanism to avoid throttling when executing many LogRetention CRs.
Users can customize the number of possible retries, potentially retrying for a long time.
This can cause the situation that further retries should be attempted, but the Lambda Function timeout is exceeded.

The change sets the lambda execution timeout to its maximum value to allow for up to 15 minutes of retries.
If the retry budget is exhausted, the handler will throw an error and exit early.

Closes #24485

----

*By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
@github-actions
Copy link

github-actions bot commented Sep 6, 2023

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

mikewrighton pushed a commit that referenced this issue Sep 14, 2023
)

We use a custom resource to set the log retention for log groups created by the Lambda service.
This custom resource handler code has a built-in retry mechanism to avoid throttling when executing many LogRetention CRs.
Users can customize the number of possible retries, potentially retrying for a long time.
This can cause the situation that further retries should be attempted, but the Lambda Function timeout is exceeded.

The change sets the lambda execution timeout to its maximum value to allow for up to 15 minutes of retries.
If the retry budget is exhausted, the handler will throw an error and exit early.

Closes #24485

----

*By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@aws-cdk/custom-resources Related to AWS CDK Custom Resources bug This issue is a bug. p2 sdk-v3-upgrade Tag issues that are associated to SDK V3 upgrade. Not limited to CR usage of SDK only.
Projects
None yet
Development

Successfully merging a pull request may close this issue.