fix(logs): LogRetention resources fail with rate exceeded errors #26858
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The LogRetention Custom Resource used to be able to handle server-side throttling, when a lot of requests to the CloudWatch Logs service are made at the same time.
Handling of this error case got lost during the migration to SDK v3.
If we have (read: a lot)
LogRetention
Custom Resources in a single Stack, CloudFormation apparently applies some internal breaks to the amount of parallelism. For example it appears that resources are batched in smaller groups that need to be completed before the next group is provisioned. And within the groups there appears to be a ever so slight delay between individual resources. Together this is enough to avoid rate limiting in most circumstances.Therefore, in practice this issues only occurs when multiple stacks are deployed in parallel.
To test this scenario, I have added support for
integ-runner
to deploy all stacks of a test case concurrently.Support for arbitrary command args already existed, but needed to explicitly include the
concurrency
option.I then create an integration test that deploys 3 stacks à 25 LogRetention resources.
This triggers the error cases described in #26837.
The fix itself is twofold:
maxRetries
prop value to the SDK client to increase the number of attempts of the SDK internal throttling handling. But also enforce a minimum for these retries since they might catch additional retryable failures that our custom outer loop does not account for.ThrottlingException
errors in the outer retry loop.Closes #26837
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license