-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent results when creating AWS VPC infrastructure using Terraform #6813
Comments
Hey @achalupa74 – Terraform certainly does do retries and we're very aware of eventual consistency gotcha's that come about when using a platform as large as AWS! That said we're of course not perfect and there are still scenarios we haven't covered 100%, but that's why I'm here 😄 As you can guess, some setups of sufficient size have enough moving parts that make it difficult for me to diagnose the root cause of without more information. Do you possibly have an example configuration that reliably (even if only ~30% of the time) reproduces these issues? I understand if you're infrastructure is sophisticated enough that trimming it down to make a reproduction case is not feasible.
These errors are typically handled gracefully, I would be interested to see how you are referencing subnet ids, if at all. If you can share a snippet of the config that may help, but please be sure to omit any secrets!
These kinds of errors are unusual but have been reported before. I believe they are being tracked in another GitHub issue and being worked on. They are rare, but perhaps you're hitting something that makes them more common. In short, we do take these kinds of issues regarding stability very serious and we're always working to make Terraform more stable and resilient. Unfortunately I can only provided limited help without further configuration to help me reproduce something if there is a systemic problem. |
Thanks for the reply! We have a failure complicated infrastructure being deployed by Terraform. The initial deployment is probably between 200 and 300 AWS resources. Short of trying to come up with a simpler example that can somewhat consistently exhibit the problem I can show you the parts of the code that the reported errors are likely related to. First off the Security Group configuration is very isolated. We have a module for each security group and the 'manhattan_master' module that creates the security group referenced in the error is attached. This is very simple and isolated. The attached module creates a security group and adds a series of SG rules to this security group. I can't see how to make this code any better other than to maybe added "depends_on" clause on every single rule? The use of subnet ID's is a bit more complicated but I can give you a code snippet that will hopefully lead you in the right direction. module "dmz_subnet_1" { stack_name = "${var.stack_name}" resource "aws_eip" "elastic_ip_dmz_subnet1" { resource "aws_nat_gateway" "nat_gateway_dmz_subnet_1" { depends_on = [ "aws_eip.elastic_ip_dmz_subnet1", "aws_internet_gateway.internet-gateway" ] This sequence basically: Could our problems somehow be related to the fact that the subnet is being created in a module? Any help is much appreciated! |
+1 |
Hi @catsby - You said: "I believe they are being tracked in another GitHub issue and being worked on." |
@brendonmartino – I misspoke I suppose, the issue I was referring to was a PR meant to fix this kind of issue: and follow up PR: Unfortunately I do not believe either of those are in a release version of Terraform, but can be found in v0.7.0-rc1 |
Just hit a similar error:
Snippet of relevant code: resource "aws_subnet" "private-persistence" {
count = "${length(split(",", var.aws_availability_zones))}"
vpc_id = "${aws_vpc.main.id}"
availability_zone = "${element(split(",", var.aws_availability_zones), count.index)}"
cidr_block = "${cidrsubnet(var.cidr_block, 5, count.index + 10)}"
}
resource "aws_route_table_association" "private-persistence" {
count = "${length(split(",", var.aws_availability_zones))}"
subnet_id = "${element(aws_subnet.private-persistence.*.id, count.index)}"
route_table_id = "${element(aws_route_table.private-persistence.*.id, count.index)}"
} I'm using Terraform v0.6.16. |
Update: still intermittently seeing the same |
Hello – I'm following up on this issue as some time has passed and we've since released several new versions of Terraform. Unfortunately our findings here were inconclusive; we were never able to reproduce this issue. Can anyone comment further, or supply a reproduction case? I would like to know if this is still an issue you're encountering, otherwise I'd like to close the issue. |
I'm going to close this for now. Please let us know if you anyone has more information or a reproduction case. Thanks! |
I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further. |
We are using Terraform for laying down AWS infrastructure resources. Our VPC consists of: 10 or so subnets, a VPN server instance, 3 route 53 zones, several security groups, etc...
When we deploy the infrastructure it usually works without error. But occasionally we get errors that we can't explain. Here's an example:
I, [2016-05-20T00:29:50.251046 #3328] INFO -- : �[31mError applying plan:
I, [2016-05-20T00:29:50.251046 #3328] INFO -- :
I, [2016-05-20T00:29:50.266650 #3328] INFO -- : 7 error(s) occurred:
I, [2016-05-20T00:29:50.266650 #3328] INFO -- :
I, [2016-05-20T00:29:50.266650 #3328] INFO -- : * aws_subnet.sub: InvalidSubnetID.NotFound: The subnet ID 'subnet-632d513b' does not exist
I, [2016-05-20T00:29:50.266650 #3328] INFO -- : status code: 400, request id: bebb274d-0a37-422d-b3ab-7270f8b49519
I, [2016-05-20T00:29:50.266650 #3328] INFO -- : * Resource 'aws_eip.elastic_ip_dmz_subnet2' does not have attribute 'id' for variable 'aws_eip.elastic_ip_dmz_subnet2.id'
I, [2016-05-20T00:29:50.266650 #3328] INFO -- : * Resource 'aws_security_group.manhattan_master' does not have attribute 'id' for variable 'aws_security_group.manhattan_master.id'
I, [2016-05-20T00:29:50.266650 #3328] INFO -- : * aws_security_group_rule.outbound_ssh_master: Error authorizing security group rule type egress: InvalidGroupId.Malformed: Invalid id: "${var.master_security_group_id}" (expecting "sg-...")
I, [2016-05-20T00:29:50.266650 #3328] INFO -- : status code: 400, request id: 51d73183-12ad-4652-bcc7-5ca23da229ae
I, [2016-05-20T00:29:50.266650 #3328] INFO -- : * aws_security_group_rule.mesos_slave_master: Error authorizing security group rule type ingress: InvalidGroupId.Malformed: Invalid id: "${var.master_security_group_id}" (expecting "sg-...")
I, [2016-05-20T00:29:50.266650 #3328] INFO -- : status code: 400, request id: f2ad1c25-d49b-4cf4-a179-46653d89e442
I, [2016-05-20T00:29:50.266650 #3328] INFO -- : * aws_security_group_rule.all_ports_master: Error authorizing security group rule type ingress: InvalidGroupId.Malformed: Invalid id: "${var.master_security_group_id}" (expecting "sg-...")
I, [2016-05-20T00:29:50.266650 #3328] INFO -- : status code: 400, request id: cdfc616c-133e-4c87-a1ce-676d9a0e76fe
I, [2016-05-20T00:29:50.266650 #3328] INFO -- : * aws_route_table.private_route_table_1: InvalidRouteTableID.NotFound: The routeTable ID 'rtb-f1296696' does not exist
I, [2016-05-20T00:29:50.266650 #3328] INFO -- : status code: 400, request id: 48babf0e-b9d1-47fe-aa19-283328cfcb15
We've opened a case with AWS and they came back with the following response:
Hello,
Thank you for contacting AWS Premium Support.
I understand that you are seeing a few errors while using the Terraform tool. Let me address the errors one by one:
The way around the eventual consistency model is to implement retries and exponential backoffs in the application. I am not sure if Terraform has implemented it.
Also, when you get these errors, you can go ahead and check if the resources actually exists or not in the AWS console. This way you can narrow down the issue if it was a reaource creation error or not.
Hope this information is helpful to you. Please let me know if you have further questions and I will be happy to help you.
Links:
[1] http://docs.aws.amazon.com/AWSEC2/latest/APIReference/query-api-troubleshooting.html#eventual-consistency
Best regards,
Truptesh
Amazon Web Services
We value your feedback. Please rate my response using the link below.
We read this as AWS calling this an issue that the caller of the AWS SDK needs to handle. In our case Terraform calls AWS SDK. Either thru retries or by checking to make sure AWS resources are fully deployed before you use ID's for those resources. This must not be a problem unique to us? I assume AWS would be one of the more popular providers for users of Terraform? Has this eventual consistency issue come up before? Are there any plans/ways to address this in Terraform?
Terraform Version
0.6.16
Affected Resource(s)
Please list the resources as a list, for example:
The problem is intermittent and doesn't always fail in the same way.
Terraform Configuration Files
It's a fairly large source base with some proprietary logic in it. It may be difficult to share all of the TF scripts involved.
Debug Output
Please provider a link to a GitHub Gist containing the complete debug output: https://www.terraform.io/docs/internals/debugging.html. Please do NOT paste the debug output in the issue; just paste a link to the Gist.
Expected Behavior
The VPC and supporting resources should be able to deployed successfully EVERY Time.
Actual Behavior
Terraform Apply attempts fail maybe as high as 30% of the time.
Steps to Reproduce
Please list the steps required to reproduce the issue, for example:
terraform apply
Wondering if you have any bug fixes or ideas on how to make this code run more stable?
The text was updated successfully, but these errors were encountered: