Inconsistent results when creating AWS VPC infrastructure using Terraform #6813

achalupa74 · 2016-05-21T03:09:49Z

We are using Terraform for laying down AWS infrastructure resources. Our VPC consists of: 10 or so subnets, a VPN server instance, 3 route 53 zones, several security groups, etc...

When we deploy the infrastructure it usually works without error. But occasionally we get errors that we can't explain. Here's an example:

I, [2016-05-20T00:29:50.251046 #3328] INFO -- : �[31mError applying plan:
I, [2016-05-20T00:29:50.251046 #3328] INFO -- :
I, [2016-05-20T00:29:50.266650 #3328] INFO -- : 7 error(s) occurred:
I, [2016-05-20T00:29:50.266650 #3328] INFO -- :
I, [2016-05-20T00:29:50.266650 #3328] INFO -- : * aws_subnet.sub: InvalidSubnetID.NotFound: The subnet ID 'subnet-632d513b' does not exist
I, [2016-05-20T00:29:50.266650 #3328] INFO -- : status code: 400, request id: bebb274d-0a37-422d-b3ab-7270f8b49519
I, [2016-05-20T00:29:50.266650 #3328] INFO -- : * Resource 'aws_eip.elastic_ip_dmz_subnet2' does not have attribute 'id' for variable 'aws_eip.elastic_ip_dmz_subnet2.id'
I, [2016-05-20T00:29:50.266650 #3328] INFO -- : * Resource 'aws_security_group.manhattan_master' does not have attribute 'id' for variable 'aws_security_group.manhattan_master.id'
I, [2016-05-20T00:29:50.266650 #3328] INFO -- : * aws_security_group_rule.outbound_ssh_master: Error authorizing security group rule type egress: InvalidGroupId.Malformed: Invalid id: "${var.master_security_group_id}" (expecting "sg-...")
I, [2016-05-20T00:29:50.266650 #3328] INFO -- : status code: 400, request id: 51d73183-12ad-4652-bcc7-5ca23da229ae
I, [2016-05-20T00:29:50.266650 #3328] INFO -- : * aws_security_group_rule.mesos_slave_master: Error authorizing security group rule type ingress: InvalidGroupId.Malformed: Invalid id: "${var.master_security_group_id}" (expecting "sg-...")
I, [2016-05-20T00:29:50.266650 #3328] INFO -- : status code: 400, request id: f2ad1c25-d49b-4cf4-a179-46653d89e442
I, [2016-05-20T00:29:50.266650 #3328] INFO -- : * aws_security_group_rule.all_ports_master: Error authorizing security group rule type ingress: InvalidGroupId.Malformed: Invalid id: "${var.master_security_group_id}" (expecting "sg-...")
I, [2016-05-20T00:29:50.266650 #3328] INFO -- : status code: 400, request id: cdfc616c-133e-4c87-a1ce-676d9a0e76fe
I, [2016-05-20T00:29:50.266650 #3328] INFO -- : * aws_route_table.private_route_table_1: InvalidRouteTableID.NotFound: The routeTable ID 'rtb-f1296696' does not exist
I, [2016-05-20T00:29:50.266650 #3328] INFO -- : status code: 400, request id: 48babf0e-b9d1-47fe-aa19-283328cfcb15

We've opened a case with AWS and they came back with the following response:

Hello,

Thank you for contacting AWS Premium Support.

I understand that you are seeing a few errors while using the Terraform tool. Let me address the errors one by one:

aws_subnet.sub: InvalidSubnetID.NotFound: The subnet ID 'subnet-632d513b' does not exist
aws_route_table.private_route_table_1: InvalidRouteTableID.NotFound: The routeTable ID 'rtb-f1296696' does not exist
From the request Ids for this error, I saw that the api call was to create a tag to the subnet-632d513b and rtb-f1296696. This failed because the subnet/route table did not exist. Now, this maybe because of two reasons. Either the resource creation failed before this or the resource was not found because of the eventual consistency model of the api calls [1]
Resource 'aws_eip.elastic_ip_dmz_subnet2' does not have attribute 'id' for variable 'aws_eip.elastic_ip_dmz_subnet2.id'
Resource 'aws_eip.elastic_ip_dmz_subnet2' does not have attribute 'id' for variable 'aws_eip.elastic_ip_dmz_subnet2.id'
Resource 'aws_security_group.manhattan_master' does not have attribute 'id' for variable 'aws_security_group.manhattan_master.id'
This is again because the subnet and security group was not created/still in the process of creating/eventual consistency model of api calls.
aws_security_group_rule.outbound_ssh_master: Error authorizing security group rule type egress: InvalidGroupId.Malformed: Invalid id: ""
This error could be a consequence of the previous errors. The security group id could not be found and hence the variable var.master_security_group_id was not set and this caused the api call to fail.

The way around the eventual consistency model is to implement retries and exponential backoffs in the application. I am not sure if Terraform has implemented it.

Also, when you get these errors, you can go ahead and check if the resources actually exists or not in the AWS console. This way you can narrow down the issue if it was a reaource creation error or not.

Hope this information is helpful to you. Please let me know if you have further questions and I will be happy to help you.

Links:
[1] http://docs.aws.amazon.com/AWSEC2/latest/APIReference/query-api-troubleshooting.html#eventual-consistency

Best regards,

Truptesh
Amazon Web Services

We value your feedback. Please rate my response using the link below.

We read this as AWS calling this an issue that the caller of the AWS SDK needs to handle. In our case Terraform calls AWS SDK. Either thru retries or by checking to make sure AWS resources are fully deployed before you use ID's for those resources. This must not be a problem unique to us? I assume AWS would be one of the more popular providers for users of Terraform? Has this eventual consistency issue come up before? Are there any plans/ways to address this in Terraform?

Terraform Version

0.6.16

Affected Resource(s)

Please list the resources as a list, for example:

aws_eip
aws_subnet
others...

The problem is intermittent and doesn't always fail in the same way.

Terraform Configuration Files

It's a fairly large source base with some proprietary logic in it. It may be difficult to share all of the TF scripts involved.

Debug Output

Please provider a link to a GitHub Gist containing the complete debug output: https://www.terraform.io/docs/internals/debugging.html. Please do NOT paste the debug output in the issue; just paste a link to the Gist.

Expected Behavior

The VPC and supporting resources should be able to deployed successfully EVERY Time.

Actual Behavior

Terraform Apply attempts fail maybe as high as 30% of the time.

Steps to Reproduce

Please list the steps required to reproduce the issue, for example:

terraform apply

Wondering if you have any bug fixes or ideas on how to make this code run more stable?

The text was updated successfully, but these errors were encountered:

catsby · 2016-05-23T15:39:13Z

Hey @achalupa74 – Terraform certainly does do retries and we're very aware of eventual consistency gotcha's that come about when using a platform as large as AWS!

That said we're of course not perfect and there are still scenarios we haven't covered 100%, but that's why I'm here 😄

As you can guess, some setups of sufficient size have enough moving parts that make it difficult for me to diagnose the root cause of without more information. Do you possibly have an example configuration that reliably (even if only ~30% of the time) reproduces these issues? I understand if you're infrastructure is sophisticated enough that trimming it down to make a reproduction case is not feasible.

* aws_subnet.sub: InvalidSubnetID.NotFound: The subnet ID 'subnet-632d513b' does not exist
* aws_route_table.private_route_table_1: InvalidRouteTableID.NotFound: The routeTable ID 'rtb-f1296696' does not exist

These errors are typically handled gracefully, I would be interested to see how you are referencing subnet ids, if at all. If you can share a snippet of the config that may help, but please be sure to omit any secrets!

Resource 'aws_eip.elastic_ip_dmz_subnet2' does not have attribute 'id' for variable 'aws_eip.elastic_ip_dmz_subnet2.id'
Resource 'aws_eip.elastic_ip_dmz_subnet2' does not have attribute 'id' for variable 'aws_eip.elastic_ip_dmz_subnet2.id'
Resource 'aws_security_group.manhattan_master' does not have attribute 'id' for variable 'aws_security_group.manhattan_master.id'

These kinds of errors are unusual but have been reported before. I believe they are being tracked in another GitHub issue and being worked on. They are rare, but perhaps you're hitting something that makes them more common.

In short, we do take these kinds of issues regarding stability very serious and we're always working to make Terraform more stable and resilient. Unfortunately I can only provided limited help without further configuration to help me reproduce something if there is a systemic problem.

achalupa74 · 2016-05-23T20:12:31Z

Thanks for the reply!

We have a failure complicated infrastructure being deployed by Terraform. The initial deployment is probably between 200 and 300 AWS resources. Short of trying to come up with a simpler example that can somewhat consistently exhibit the problem I can show you the parts of the code that the reported errors are likely related to.

First off the Security Group configuration is very isolated. We have a module for each security group and the 'manhattan_master' module that creates the security group referenced in the error is attached.
manhattan_master_security_group.zip

This is very simple and isolated. The attached module creates a security group and adds a series of SG rules to this security group. I can't see how to make this code any better other than to maybe added "depends_on" clause on every single rule?

The use of subnet ID's is a bit more complicated but I can give you a code snippet that will hopefully lead you in the right direction.

module "dmz_subnet_1" {
source = "../subnet"

stack_name = "${var.stack_name}"
subnet_name = "dmz1"
subnet_cidr = "${var.dmz_subnet_cidr_1}"
availability_zone = "${var.availability_zone_1}"
vpc_id = "${module.vpc.vpc_id}"
route_table_id = "${aws_route_table.public_route_table.id}"
region = "${var.region}"
profile = "${var.default_profile}"
}

resource "aws_eip" "elastic_ip_dmz_subnet1" {
provider = "aws.base"
vpc = true
}

resource "aws_nat_gateway" "nat_gateway_dmz_subnet_1" {
provider = "aws.base"
allocation_id = "${aws_eip.elastic_ip_dmz_subnet1.id}"
subnet_id = "${module.dmz_subnet_1.subnet_id}"

depends_on = [ "aws_eip.elastic_ip_dmz_subnet1", "aws_internet_gateway.internet-gateway" ]
}

This sequence basically:
- creates a subnet (by calling a module)
- Allocates an elastic IP
- creates a NAT gateway that ties the subnet and the eIP together

Could our problems somehow be related to the fact that the subnet is being created in a module?

Any help is much appreciated!

brendonmartino · 2016-05-27T15:59:26Z

+1

brendonmartino · 2016-05-27T16:01:15Z

Hi @catsby - You said: "I believe they are being tracked in another GitHub issue and being worked on."
If you have that link, I would like to see that issue. Thanks

catsby · 2016-06-01T21:40:21Z

@brendonmartino – I misspoke I suppose, the issue I was referring to was a PR meant to fix this kind of issue:

core: Fix interp error msgs on module vars during destroy #6557

and follow up PR:

terraform: Correct fix for destroy interp errors #6599

Unfortunately I do not believe either of those are in a release version of Terraform, but can be found in v0.7.0-rc1

brikis98 · 2016-06-28T18:40:26Z

Just hit a similar error:

aws_subnet.private-persistence.2: InvalidSubnetID.NotFound: The subnet ID 'subnet-xxxxxxx' does not exist

Snippet of relevant code:

resource "aws_subnet" "private-persistence" {
    count = "${length(split(",", var.aws_availability_zones))}"
    vpc_id = "${aws_vpc.main.id}"
    availability_zone = "${element(split(",", var.aws_availability_zones), count.index)}"
    cidr_block = "${cidrsubnet(var.cidr_block, 5, count.index + 10)}"
}

resource "aws_route_table_association" "private-persistence" {
    count = "${length(split(",", var.aws_availability_zones))}"
    subnet_id = "${element(aws_subnet.private-persistence.*.id, count.index)}"
    route_table_id = "${element(aws_route_table.private-persistence.*.id, count.index)}"
}

I'm using Terraform v0.6.16.

brikis98 · 2016-08-29T16:41:55Z

Update: still intermittently seeing the same aws_subnet.private-persistence.2: InvalidSubnetID.NotFound: The subnet ID 'subnet-xxxxxxx' does not exist error on Terraform 0.7.2.

catsby · 2016-12-13T22:57:58Z

Hello – I'm following up on this issue as some time has passed and we've since released several new versions of Terraform.

Unfortunately our findings here were inconclusive; we were never able to reproduce this issue. Can anyone comment further, or supply a reproduction case? I would like to know if this is still an issue you're encountering, otherwise I'd like to close the issue.

catsby · 2016-12-14T15:30:34Z

I'm going to close this for now. Please let us know if you anyone has more information or a reproduction case. Thanks!

ghost · 2020-04-18T02:27:44Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

catsby added bug provider/aws labels May 23, 2016

josh-padnick mentioned this issue May 31, 2016

Retry AWS commands that may fail and increase insufficient timeouts #6775

Closed

This was referenced Jul 6, 2016

InvalidSubnet.Conflict: The CIDR 'XXX' conflicts with another subnet #7516

Closed

Network ACLs must wait on Internet and NAT Gateways (finally found a workaround for lots of random eventual consistency errors) #7527

Closed

brikis98 mentioned this issue Aug 29, 2016

Error finding route after creating it: error finding matching route for Route table (rtb-xxxxx) and destination CIDR block (xxx.xxx.xxx.xxx/xxx) #8542

Closed

catsby added the waiting-response An issue/pull request is waiting for a response from the community label Dec 13, 2016

catsby closed this as completed Dec 14, 2016

edwinsteele mentioned this issue Jan 19, 2017

Build failures related to AWS instantiation ConnectBox/connectbox-pi#67

Closed

ghost locked and limited conversation to collaborators Apr 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent results when creating AWS VPC infrastructure using Terraform #6813

Inconsistent results when creating AWS VPC infrastructure using Terraform #6813

achalupa74 commented May 21, 2016 •

edited

Loading

catsby commented May 23, 2016

achalupa74 commented May 23, 2016 •

edited

Loading

brendonmartino commented May 27, 2016

brendonmartino commented May 27, 2016

catsby commented Jun 1, 2016

brikis98 commented Jun 28, 2016

brikis98 commented Aug 29, 2016

catsby commented Dec 13, 2016

catsby commented Dec 14, 2016

ghost commented Apr 18, 2020

Inconsistent results when creating AWS VPC infrastructure using Terraform #6813

Inconsistent results when creating AWS VPC infrastructure using Terraform #6813

Comments

achalupa74 commented May 21, 2016 • edited Loading

We value your feedback. Please rate my response using the link below.

Terraform Version

Affected Resource(s)

Terraform Configuration Files

Debug Output

Expected Behavior

Actual Behavior

Steps to Reproduce

catsby commented May 23, 2016

achalupa74 commented May 23, 2016 • edited Loading

brendonmartino commented May 27, 2016

brendonmartino commented May 27, 2016

catsby commented Jun 1, 2016

brikis98 commented Jun 28, 2016

brikis98 commented Aug 29, 2016

catsby commented Dec 13, 2016

catsby commented Dec 14, 2016

ghost commented Apr 18, 2020

achalupa74 commented May 21, 2016 •

edited

Loading

achalupa74 commented May 23, 2016 •

edited

Loading