Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a way to handle the API rate limit on AWS #1051

Closed
netors opened this issue Feb 25, 2015 · 28 comments · Fixed by #1787
Closed

Add a way to handle the API rate limit on AWS #1051

netors opened this issue Feb 25, 2015 · 28 comments · Fixed by #1787

Comments

@netors
Copy link

netors commented Feb 25, 2015

If you have a big infrastructure, you will hit the AWS API limit when trying to plan your infrastructure.

Find a way to work around this limitation into the provider itself in some global way

@pearkes
Copy link
Contributor

pearkes commented Feb 25, 2015

We currently implement parallelism that helps throttle connections with a semaphore. This could potentially be a configurable size.

@mitchellh
Copy link
Contributor

@pearkes I think we should introduce a global provider-wise semaphore in the provider as well to artificially rate limit those resources to avoid this. The global parallelism semaphore helps but isn't meant to solve this problem.

@pearkes
Copy link
Contributor

pearkes commented Feb 26, 2015

@mitchellh Yea, that's a good point. Provider A may not mind but B wants 30/req minute max.

@radeksimko
Copy link
Member

This should also be configurable per provider config as e.g. in AWS each account may have different limits which actually aren't public anywhere (not even via API), but you may ask AWS to increase it = we should allow customers having higher limits to bootstrap the infrastructure faster if the API allows it.

@mitchellh
Copy link
Contributor

@radeksimko Sounds fair!

@willmcg
Copy link

willmcg commented Mar 11, 2015

I'm running into these API throttling limits deploying a configuration using master right now on a reasonably sized configuration.

Some kind of rate limiting of API calls is definitely required and a rate limit that could be set a-priori as part of the configuration would definitely help. However, Terraform needs to handle the corner case of hitting this limit and automatically retrying and backing off its requests.Terraform cannot assume it is the only consumer of the provider API request budget because for AWS the request rate limit is account-wide and other applications may be depleting the budget independently. I would be happy if it retried failed requests and issued warnings that would prompt me to adjust a rate limit in the config. Extra bonus points if it could automatically modulate the request rate up/down when limits are hit.

Here is an example apply that I needed to run twice to get it to complete:

$ terraform apply meta/terraform/test/
aws_vpc.vpc: Creating...
  cidr_block:                "" => "10.0.0.0/16"
  default_network_acl_id:    "" => "<computed>"
  default_security_group_id: "" => "<computed>"
  enable_dns_hostnames:      "" => "1"
  enable_dns_support:        "" => "1"
  main_route_table_id:       "" => "<computed>"
  tags.#:                    "" => "2"
  tags.Deployment:           "" => "blah"
  tags.Name:                 "" => "vpc"
aws_vpc.vpc: Creation complete
aws_internet_gateway.igw: Creating...
  tags.#:          "0" => "2"
  tags.Deployment: "" => "blah"
.
.
.
aws_elb.front: Creation complete
aws_security_group.compute: Error: 1 error(s) occurred:

* Request limit exceeded.
aws_network_acl.public.2: Creation complete
aws_network_acl.public.1: Error: 1 error(s) occurred:

*
aws_network_acl.public.0: Error: 1 error(s) occurred:

*
Error applying plan:

4 error(s) occurred:

* 1 error(s) occurred:

* 1 error(s) occurred:

* Request limit exceeded.
* 1 error(s) occurred:

* Resource 'aws_launch_configuration.compute' not found for variable 'aws_launch_configuration.compute.name'
* 1 error(s) occurred:

* Resource 'aws_security_group.nat' not found for variable 'aws_security_group.nat.id'
* 2 error(s) occurred:

* 1 error(s) occurred:

*
* 1 error(s) occurred:

*

Terraform does not automatically rollback in the face of errors.
Instead, your Terraform state file has been partially updated with
any resources that successfully completed. Please address the error
above and apply again to incrementally change your infrastructure.

@radeksimko
Copy link
Member

Terraform cannot assume it is the only consumer of the provider API request budget because for AWS the request rate limit is account-wide and other applications may be depleting the budget independently.

True, but AWS is making us blind here... as they don't provide any useful stats like many other APIs from Github, Twitter, Google etc.

X-RateLimit-Limit: 5000
X-RateLimit-Remaining: 4999
X-RateLimit-Reset: 1372700873

If we would have such things, we could make Terraform actually clever. :)

@willmcg
Copy link

willmcg commented Mar 11, 2015

Right... quite annoying and they really only tell you that you need to implement exponential back-off when you receive the RequestLimitExceeded error as their answer to questions on the matter.

I've reached a point now where I cannot deploy my configuration anymore with 0.37 or master due to hitting this API throttling error constantly. Seems to depend on where the apply run errors out with the limit exceeded error and now more often than not it leaves the partially deployed configuration in an unrecoverable state that requires me to manually destroy the VPC from the AWS console and start over. Other times I can run apply 3-4 times and it eventually succeeds after deploying more resources incrementally on each run.

This is a show stopper for my automated deployments so my next stop will be with AWS support to see if I can get my API throttling limit bumped up for my account until Terraform can gracefully handle backing off its requests.

@CpuID
Copy link
Contributor

CpuID commented Mar 12, 2015

The request limits definitely differ per API call. If you have any way to set some sane attempt limits and retry thresholds on a per API call basis this would go a long way. Might be difficult with the move to aws-sdk-go unless their doing something already.

Example - RunInstances tends to baulk around 20-30/sec but it will throttle you quite hard for 5-10 seconds. Whereas DescribeInstances will allow a lot more.

AWS does define 3 categories in their documentation, could list the medium/high complexity calls and treat them differently to the default maybe?

In my experience AWS don't want to reveal thresholds, to avoid people abusing them.

Nathan Sullivan
Sent from a mobile device

On 12 Mar 2015, at 2:03 am, Will McGovern [email protected] wrote:

Right... quite annoying and they really only tell you that you need to implement exponential back-off when you receive the RequestLimitExceeded error as their answer to questions on the matter.

I've reached a point now where I cannot deploy my configuration anymore with 0.37 or master due to hitting this API throttling error constantly. Seems to depend on where the apply run errors out with the limit exceeded error and now more often than not it leaves the partially deployed configuration in an unrecoverable state that requires me to manually destroy the VPC from the AWS console and start over. Other times I can run apply 3-4 times and it eventually succeeds after deploying more resources incrementally on each run.

This is a show stopper for my automated deployments so my next stop will be with AWS support to see if I can get my API throttling limit bumped up for my account until Terraform can gracefully handle backing off its requests.


Reply to this email directly or view it on GitHub.

@willmcg
Copy link

willmcg commented Mar 12, 2015

I talked with our AWS support people today and they basically said that the EC2 API rate limits are already the highest of all services. They will not raise them for an account even if you have a very expensive support agreement. An application not implementing back-off was definitely not regarded as a anything even close to sufficient reason for them to even consider raising limits.

Because my configuration would basically not ever deploy without hitting the API limits on my account and crapping out... sometimes leaving everything in a bad state... I did some hackery in the ec2 request code in aws-sdk-go to add an exponential back-off for requests that hit the rate limit and now I have it working reliably without ever failing on API rate limit errors. Never written a line of golang in my life until today so it was a gross hack and only deals with the particular case that was blocking me.

A proper implementation would put a more general retry mechanism in place around the request code that is aware of the different kinds of request error responses that should be retried in EC2. The AWS docs have a table that lists the different errors that need to be retried (5xx server errors, RateLimitExceeded, ConcurrentTagAccess, some DependencyViolation cases due to eventual consistency, etc.).

http://docs.aws.amazon.com/AWSEC2/latest/APIReference/errors-overview.html

The retry logic differs across the different services so you need to deal with each one individually.

@ocxo
Copy link

ocxo commented Mar 24, 2015

@willmcg can you share more details on how you implemented the back-off?

@willmcg
Copy link

willmcg commented Mar 24, 2015

It was some simple horrible hackery in aws-sdk-go/aws/ec2.go in to EC2Client Do() method that put a retry loop around the request logic to keep retrying the request with brain-dead back-off delay in the case where the EC2 API call returned the "RequestLimitExceeded" error:

if ec2Err.Code == "RequestLimitExceeded" {
    time.Sleep(time.Duration(count) * time.Second)
    continue
}

Actually it is such a hack that it is really only linear back-off and not even exponential at present :-)

I see in the source JSON for the APIs there is a _retry.json that actually details all the retry conditions but it does not look like the golang request code uses this information for retry policy on failed requests. This would be the right way to handle retry rather than my hackery.

@jschneiderhan
Copy link
Contributor

It looks like master of aws-sdk-go implements retries with an exponential back-off on Throttling errors: https://github.com/awslabs/aws-sdk-go/blob/master/aws/service.go#L124-L142. The fork that terraform is using doesn't include it, but perhaps this will help once it catches up with the upstream. If the version of aws-sdk-go used was more up-to-date, would it solve the problem, or would it still make sense to have something in terraform controlling the number of requests being generated?

Either way I'm very interested in helping move this forward. I'm new to golang, but very interested in learning.

@franklinwise
Copy link

Happy to help, how can we move this forward?

@ocxo
Copy link

ocxo commented Apr 20, 2015

I think most if not all the work has been done to support moving to the official aws sdk go which implements backoff/retry. This should be in the next release.

@clstokes
Copy link
Contributor

+1 as this is really annoying and impactful.

@jschneiderhan
Copy link
Contributor

I'm really hoping https://github.com/awslabs/aws-sdk-go/blob/a79c7d95c012010822e27aaa5551927f5e8a6ab6/aws/service.go#L134 helps, but I'm concerned that the default max retries is too low at 3. In my case it's the AutoScaling API that is throwing rate limit exceeded errors, and I've see the command retry up to 9 times before it succeeds. Granted, I've been using an older version with some custom retry logic added in while I waited for the aws-sdk-go library catch up to upstream, but I copy/pasted the logic from the upstream aws-sdk-go repo, so the behavior should be similar.

@franklinwise
Copy link

@fromonesrc - When is the next release?

@ocxo
Copy link

ocxo commented Apr 21, 2015

Looks like it will be in 0.5.0 (https://github.com/hashicorp/terraform/blob/master/CHANGELOG.md#050-unreleased) but I don't know when that will ship.

@promorphus
Copy link

Any word on when 0.5.0 will be released? And is it possible to use the rate limiting by building what's currently in the repo or is the rate limiting a feature that hasn't been developed yet but is on the roadmap?

@davedash
Copy link
Contributor

So I'm running into this issue. Seems like Route53 is VERY aggressive with throttling. I can't get plan to successfully return.

Anybody have a work-around? Otherwise I might have to downgrade temporarily.

@fishnix
Copy link

fishnix commented Apr 29, 2015

I just started hitting this as well 😒 Much needed 👍

@zadunn
Copy link

zadunn commented Apr 29, 2015

We are hitting this as well.

@jgillis01
Copy link

I was able to hack around it by doing the following:
sudo tc qdisc add dev enp0s20u1u3 root netem delay 1000ms

This would basically delay all outbound traffic on my workstation by 1 second. There may be a more elegant solution in using tc with iptables.

@koendc
Copy link

koendc commented May 3, 2015

When running terraform from master with the retry logic enabled, we were still hit by the API rate limits. After increasing MaxRetries to 11, we were no longer experiencing the issue. It looks like the default of 3 retries is not enough.

In #1787, the number of retries is made configurable, with a default of 11 times (ie a delay of 61 seconds for the last retry).

@promorphus
Copy link

@koendc, are you running on AWS or some other provider? I can change the number of retries for Openstack, but can't for AWS.

@koendc
Copy link

koendc commented May 4, 2015

I should have made myself a bit more clear:

  • I'm running on AWS
  • aws-sdk-go, used by terraform now has a retry logic. The default is 3 retries. By compiling terraform from the master, you'll get the retry logic.
  • Even with this retry logic and the 3 retries, we were encountering rate limit errors.
  • I changed the terraform code to make the maximum number of retries configurable and I set the default max_retries to 11.
  • After this change, we were no longer encountering the errors.

@ghost
Copy link

ghost commented May 3, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@ghost ghost locked and limited conversation to collaborators May 3, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.