Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws-lambda: Log retention gives rate exceeded error #31338

Closed
1 task done
Exter-dg opened this issue Sep 6, 2024 · 4 comments · Fixed by #31340 or softwaremill/tapir#4137 · May be fixed by NOUIY/aws-solutions-constructs#135 or NOUIY/aws-solutions-constructs#136
Closed
1 task done
Labels
@aws-cdk/aws-lambda Related to AWS Lambda bug This issue is a bug. effort/medium Medium work item – several days of effort p2

Comments

@Exter-dg
Copy link

Exter-dg commented Sep 6, 2024

Describe the bug

Legacy log retention in Lambda gives a rate limit exceeded error.

We are in the process of upgrading our app from CDK v1 to v2. To test this, we created a new env in a new account and redeployed the configuration using CDK v1.

We are creating 70-80 lambdas with log retention enabled. The legacy log retention creates a custom lambda resource to create log group and set log retention. CDK V1 used to create Node 14 lambas for this purpose (for which the creation is blocked in AWS). Hence, we disabled the log retention and upgraded the stack to 2.151.0 and then enabled the log retention.

While doing so, our stack is failing with the error:

Received response status [FAILED] from custom resource. Message returned: Out of attempts to change log group

Initially we thought this is an issue with the “CreateLogGroup throttle limit in transactions per second” quota. We increased it to 80 from 10 but the issue still exists.

On exploring the cloudwatch logs for the custom lambda resource, we found:

2024-09-06T05:23:33.260Z	06a9833f-0ad3-4faf-8f94-aa78dd49d0ec	ERROR	{
  clientName: 'CloudWatchLogsClient',
  commandName: 'PutRetentionPolicyCommand',
  input: {
    logGroupName: '/aws/lambda/LogRetentionaae0aa3c5b4d-mE6Tt6xks1CB',
    retentionInDays: 1
  },
  error: ThrottlingException: Rate exceeded
      at de_ThrottlingExceptionRes (/var/runtime/node_modules/@aws-sdk/client-cloudwatch-logs/dist-cjs/index.js:2321:21)
      at de_CommandError (/var/runtime/node_modules/@aws-sdk/client-cloudwatch-logs/dist-cjs/index.js:2167:19)
      at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
      at async /var/runtime/node_modules/@aws-sdk/node_modules/@smithy/middleware-serde/dist-cjs/index.js:35:20
      at async /var/runtime/node_modules/@aws-sdk/node_modules/@smithy/core/dist-cjs/index.js:165:18
      at async /var/runtime/node_modules/@aws-sdk/node_modules/@smithy/middleware-retry/dist-cjs/index.js:320:38
      at async /var/runtime/node_modules/@aws-sdk/middleware-logger/dist-cjs/index.js:34:22
      at async /var/task/index.js:1:1148
      at async /var/task/index.js:1:2728
      at async y (/var/task/index.js:1:1046) {
    '$fault': 'client',
    '$metadata': {
      httpStatusCode: 400,
      requestId: 'e247739c-8ebb-40d3-b85e-293802a87e24',
      extendedRequestId: undefined,
      cfId: undefined,
      attempts: 3,
      totalRetryDelay: 466
    },
    __type: 'ThrottlingException'
  },
  metadata: {
    httpStatusCode: 400,
    requestId: 'e247739c-8ebb-40d3-b85e-293802a87e24',
    extendedRequestId: undefined,
    cfId: undefined,
    attempts: 3,
    totalRetryDelay: 466
  }
}

Looks like an issue with the rate limit for PutRetentionPolicyCommand. The service quota for the same cannot be changed. Our earlier implementation had one difference in how log retention was implemented.
The base property was enabled to apply a exponential backoff (probably to handle such cases). This is now deprecated and hence we removed it during our upgrade from CDK v1 to v2. The documentation for LogRetentionRetryOptions says that this was removed as it is handled differently in AWS SDK v3. Is this what is causing the issue? Should't CDK/ SDK handle the backoff in this case?

Regression Issue

  • Select this option if this issue appears to be a regression.

Last Known Working CDK Version

1.204.0

Expected Behavior

Log retention backoff should be handled internally

Current Behavior

Creating legacy log retention for multiple lambas together gives a rate limit exceeded error.

Reproduction Steps

Possible Solution

No response

Additional Information/Context

No response

CDK CLI Version

2.108.1

Framework Version

No response

Node.js Version

v22.1.0

OS

MacOs

Language

TypeScript

Language Version

No response

Other information

No response

@Exter-dg Exter-dg added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Sep 6, 2024
@github-actions github-actions bot added @aws-cdk/aws-lambda Related to AWS Lambda potential-regression Marking this issue as a potential regression to be checked by team member labels Sep 6, 2024
@rix0rrr rix0rrr removed the potential-regression Marking this issue as a potential regression to be checked by team member label Sep 6, 2024
rix0rrr added a commit that referenced this issue Sep 6, 2024
When the Log Retention Lambda runs massively parallel (on 70+ Lambdas
at the same time), it can run into throttling problems and fail.

Raise the retry count and delays:

- Raise the default amount of retries from 5 -> 10
- Raise the sleep base from 100ms to 1s.
- Change the sleep calculation to apply the 10s limit *after* jitter instead
  of before (previously, we would take a fraction of 10s; now we're
  taking a fraction of the accumulated wait time, and after calculating
  that limit it to 10s).

Fixes #31338.
@Exter-dg
Copy link
Author

Exter-dg commented Sep 6, 2024

@rix0rrr Is this related?
#24485

@khushail khushail added p2 effort/medium Medium work item – several days of effort and removed needs-triage This issue or PR still needs to be triaged. labels Sep 6, 2024
@Exter-dg
Copy link
Author

We fixed it by increasing the value of maxRetries from 7 to 20.

@mergify mergify bot closed this as completed in #31340 Sep 25, 2024
@mergify mergify bot closed this as completed in a2d42d2 Sep 25, 2024
Copy link

Comments on closed issues and PRs are hard for our team to see.
If you need help, please open a new issue that references this one.

1 similar comment
Copy link

Comments on closed issues and PRs are hard for our team to see.
If you need help, please open a new issue that references this one.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 25, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.