Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(logs): LogRetention resources fail with rate exceeded errors #26858

Merged
merged 4 commits into from
Aug 24, 2023

Conversation

mrgrain
Copy link
Contributor

@mrgrain mrgrain commented Aug 23, 2023

The LogRetention Custom Resource used to be able to handle server-side throttling, when a lot of requests to the CloudWatch Logs service are made at the same time.
Handling of this error case got lost during the migration to SDK v3.

If we have (read: a lot) LogRetention Custom Resources in a single Stack, CloudFormation apparently applies some internal breaks to the amount of parallelism. For example it appears that resources are batched in smaller groups that need to be completed before the next group is provisioned. And within the groups there appears to be a ever so slight delay between individual resources. Together this is enough to avoid rate limiting in most circumstances.

Therefore, in practice this issues only occurs when multiple stacks are deployed in parallel.

To test this scenario, I have added support for integ-runner to deploy all stacks of a test case concurrently.
Support for arbitrary command args already existed, but needed to explicitly include the concurrency option.

I then create an integration test that deploys 3 stacks à 25 LogRetention resources.
This triggers the error cases described in #26837.

The fix itself is twofold:

  • Pass the maxRetries prop value to the SDK client to increase the number of attempts of the SDK internal throttling handling. But also enforce a minimum for these retries since they might catch additional retryable failures that our custom outer loop does not account for.
  • Explicitly catch ThrottlingException errors in the outer retry loop.

Closes #26837


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license

@aws-cdk-automation aws-cdk-automation requested a review from a team August 23, 2023 15:17
@github-actions github-actions bot added bug This issue is a bug. effort/medium Medium work item – several days of effort p0 labels Aug 23, 2023
@mergify mergify bot added the contribution/core This is a PR that came from AWS. label Aug 23, 2023
@mrgrain mrgrain marked this pull request as ready for review August 24, 2023 11:55
@aws-cdk-automation aws-cdk-automation added the pr/needs-maintainer-review This PR needs a review from a Core Team Member label Aug 24, 2023
@aws-cdk-automation
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: AutoBuildv2Project1C6BFA3F-wQm2hXv2jqQv
  • Commit ID: e73545c
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@mergify
Copy link
Contributor

mergify bot commented Aug 24, 2023

Thank you for contributing! Your pull request will be updated from main and then merged automatically (do not update manually, and be sure to allow changes to be pushed to your fork).

@mergify mergify bot merged commit b60e6ef into main Aug 24, 2023
9 checks passed
@mergify mergify bot deleted the mrgrain/fix/log-retention-failing-retries branch August 24, 2023 13:07
@aws-cdk-automation aws-cdk-automation removed the pr/needs-maintainer-review This PR needs a review from a Core Team Member label Aug 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug. contribution/core This is a PR that came from AWS. effort/medium Medium work item – several days of effort p0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CDK deploy: Lambda LogRetention resources fail with rate exceeded errors
3 participants