Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent results when creating AWS VPC infrastructure using Terraform #6813

Closed
achalupa74 opened this issue May 21, 2016 · 10 comments
Closed
Labels
bug provider/aws waiting-response An issue/pull request is waiting for a response from the community

Comments

@achalupa74
Copy link

achalupa74 commented May 21, 2016

We are using Terraform for laying down AWS infrastructure resources. Our VPC consists of: 10 or so subnets, a VPN server instance, 3 route 53 zones, several security groups, etc...

When we deploy the infrastructure it usually works without error. But occasionally we get errors that we can't explain. Here's an example:

I, [2016-05-20T00:29:50.251046 #3328] INFO -- : �[31mError applying plan:
I, [2016-05-20T00:29:50.251046 #3328] INFO -- :
I, [2016-05-20T00:29:50.266650 #3328] INFO -- : 7 error(s) occurred:
I, [2016-05-20T00:29:50.266650 #3328] INFO -- :
I, [2016-05-20T00:29:50.266650 #3328] INFO -- : * aws_subnet.sub: InvalidSubnetID.NotFound: The subnet ID 'subnet-632d513b' does not exist
I, [2016-05-20T00:29:50.266650 #3328] INFO -- : status code: 400, request id: bebb274d-0a37-422d-b3ab-7270f8b49519
I, [2016-05-20T00:29:50.266650 #3328] INFO -- : * Resource 'aws_eip.elastic_ip_dmz_subnet2' does not have attribute 'id' for variable 'aws_eip.elastic_ip_dmz_subnet2.id'
I, [2016-05-20T00:29:50.266650 #3328] INFO -- : * Resource 'aws_security_group.manhattan_master' does not have attribute 'id' for variable 'aws_security_group.manhattan_master.id'
I, [2016-05-20T00:29:50.266650 #3328] INFO -- : * aws_security_group_rule.outbound_ssh_master: Error authorizing security group rule type egress: InvalidGroupId.Malformed: Invalid id: "${var.master_security_group_id}" (expecting "sg-...")
I, [2016-05-20T00:29:50.266650 #3328] INFO -- : status code: 400, request id: 51d73183-12ad-4652-bcc7-5ca23da229ae
I, [2016-05-20T00:29:50.266650 #3328] INFO -- : * aws_security_group_rule.mesos_slave_master: Error authorizing security group rule type ingress: InvalidGroupId.Malformed: Invalid id: "${var.master_security_group_id}" (expecting "sg-...")
I, [2016-05-20T00:29:50.266650 #3328] INFO -- : status code: 400, request id: f2ad1c25-d49b-4cf4-a179-46653d89e442
I, [2016-05-20T00:29:50.266650 #3328] INFO -- : * aws_security_group_rule.all_ports_master: Error authorizing security group rule type ingress: InvalidGroupId.Malformed: Invalid id: "${var.master_security_group_id}" (expecting "sg-...")
I, [2016-05-20T00:29:50.266650 #3328] INFO -- : status code: 400, request id: cdfc616c-133e-4c87-a1ce-676d9a0e76fe
I, [2016-05-20T00:29:50.266650 #3328] INFO -- : * aws_route_table.private_route_table_1: InvalidRouteTableID.NotFound: The routeTable ID 'rtb-f1296696' does not exist
I, [2016-05-20T00:29:50.266650 #3328] INFO -- : status code: 400, request id: 48babf0e-b9d1-47fe-aa19-283328cfcb15

We've opened a case with AWS and they came back with the following response:


Hello,

Thank you for contacting AWS Premium Support.

I understand that you are seeing a few errors while using the Terraform tool. Let me address the errors one by one:

  • aws_subnet.sub: InvalidSubnetID.NotFound: The subnet ID 'subnet-632d513b' does not exist
  • aws_route_table.private_route_table_1: InvalidRouteTableID.NotFound: The routeTable ID 'rtb-f1296696' does not exist
  • From the request Ids for this error, I saw that the api call was to create a tag to the subnet-632d513b and rtb-f1296696. This failed because the subnet/route table did not exist. Now, this maybe because of two reasons. Either the resource creation failed before this or the resource was not found because of the eventual consistency model of the api calls [1]
  • Resource 'aws_eip.elastic_ip_dmz_subnet2' does not have attribute 'id' for variable 'aws_eip.elastic_ip_dmz_subnet2.id'
  • Resource 'aws_eip.elastic_ip_dmz_subnet2' does not have attribute 'id' for variable 'aws_eip.elastic_ip_dmz_subnet2.id'
  • Resource 'aws_security_group.manhattan_master' does not have attribute 'id' for variable 'aws_security_group.manhattan_master.id'
  • This is again because the subnet and security group was not created/still in the process of creating/eventual consistency model of api calls.
  • aws_security_group_rule.outbound_ssh_master: Error authorizing security group rule type egress: InvalidGroupId.Malformed: Invalid id: ""
  • This error could be a consequence of the previous errors. The security group id could not be found and hence the variable var.master_security_group_id was not set and this caused the api call to fail.

The way around the eventual consistency model is to implement retries and exponential backoffs in the application. I am not sure if Terraform has implemented it.

Also, when you get these errors, you can go ahead and check if the resources actually exists or not in the AWS console. This way you can narrow down the issue if it was a reaource creation error or not.

Hope this information is helpful to you. Please let me know if you have further questions and I will be happy to help you.

Links:
[1] http://docs.aws.amazon.com/AWSEC2/latest/APIReference/query-api-troubleshooting.html#eventual-consistency

Best regards,

Truptesh
Amazon Web Services

We value your feedback. Please rate my response using the link below.


We read this as AWS calling this an issue that the caller of the AWS SDK needs to handle. In our case Terraform calls AWS SDK. Either thru retries or by checking to make sure AWS resources are fully deployed before you use ID's for those resources. This must not be a problem unique to us? I assume AWS would be one of the more popular providers for users of Terraform? Has this eventual consistency issue come up before? Are there any plans/ways to address this in Terraform?

Terraform Version

0.6.16

Affected Resource(s)

Please list the resources as a list, for example:

  • aws_eip
  • aws_subnet
  • others...

The problem is intermittent and doesn't always fail in the same way.

Terraform Configuration Files

It's a fairly large source base with some proprietary logic in it. It may be difficult to share all of the TF scripts involved.

Debug Output

Please provider a link to a GitHub Gist containing the complete debug output: https://www.terraform.io/docs/internals/debugging.html. Please do NOT paste the debug output in the issue; just paste a link to the Gist.

Expected Behavior

The VPC and supporting resources should be able to deployed successfully EVERY Time.

Actual Behavior

Terraform Apply attempts fail maybe as high as 30% of the time.

Steps to Reproduce

Please list the steps required to reproduce the issue, for example:

  1. terraform apply

Wondering if you have any bug fixes or ideas on how to make this code run more stable?

@catsby
Copy link
Contributor

catsby commented May 23, 2016

Hey @achalupa74 – Terraform certainly does do retries and we're very aware of eventual consistency gotcha's that come about when using a platform as large as AWS!

That said we're of course not perfect and there are still scenarios we haven't covered 100%, but that's why I'm here 😄

As you can guess, some setups of sufficient size have enough moving parts that make it difficult for me to diagnose the root cause of without more information. Do you possibly have an example configuration that reliably (even if only ~30% of the time) reproduces these issues? I understand if you're infrastructure is sophisticated enough that trimming it down to make a reproduction case is not feasible.

* aws_subnet.sub: InvalidSubnetID.NotFound: The subnet ID 'subnet-632d513b' does not exist
* aws_route_table.private_route_table_1: InvalidRouteTableID.NotFound: The routeTable ID 'rtb-f1296696' does not exist

These errors are typically handled gracefully, I would be interested to see how you are referencing subnet ids, if at all. If you can share a snippet of the config that may help, but please be sure to omit any secrets!

Resource 'aws_eip.elastic_ip_dmz_subnet2' does not have attribute 'id' for variable 'aws_eip.elastic_ip_dmz_subnet2.id'
Resource 'aws_eip.elastic_ip_dmz_subnet2' does not have attribute 'id' for variable 'aws_eip.elastic_ip_dmz_subnet2.id'
Resource 'aws_security_group.manhattan_master' does not have attribute 'id' for variable 'aws_security_group.manhattan_master.id'

These kinds of errors are unusual but have been reported before. I believe they are being tracked in another GitHub issue and being worked on. They are rare, but perhaps you're hitting something that makes them more common.

In short, we do take these kinds of issues regarding stability very serious and we're always working to make Terraform more stable and resilient. Unfortunately I can only provided limited help without further configuration to help me reproduce something if there is a systemic problem.

@achalupa74
Copy link
Author

achalupa74 commented May 23, 2016

Thanks for the reply!

We have a failure complicated infrastructure being deployed by Terraform. The initial deployment is probably between 200 and 300 AWS resources. Short of trying to come up with a simpler example that can somewhat consistently exhibit the problem I can show you the parts of the code that the reported errors are likely related to.

First off the Security Group configuration is very isolated. We have a module for each security group and the 'manhattan_master' module that creates the security group referenced in the error is attached.
manhattan_master_security_group.zip

This is very simple and isolated. The attached module creates a security group and adds a series of SG rules to this security group. I can't see how to make this code any better other than to maybe added "depends_on" clause on every single rule?

The use of subnet ID's is a bit more complicated but I can give you a code snippet that will hopefully lead you in the right direction.

module "dmz_subnet_1" {
source = "../subnet"

stack_name = "${var.stack_name}"
subnet_name = "dmz1"
subnet_cidr = "${var.dmz_subnet_cidr_1}"
availability_zone = "${var.availability_zone_1}"
vpc_id = "${module.vpc.vpc_id}"
route_table_id = "${aws_route_table.public_route_table.id}"
region = "${var.region}"
profile = "${var.default_profile}"
}

resource "aws_eip" "elastic_ip_dmz_subnet1" {
provider = "aws.base"
vpc = true
}

resource "aws_nat_gateway" "nat_gateway_dmz_subnet_1" {
provider = "aws.base"
allocation_id = "${aws_eip.elastic_ip_dmz_subnet1.id}"
subnet_id = "${module.dmz_subnet_1.subnet_id}"

depends_on = [ "aws_eip.elastic_ip_dmz_subnet1", "aws_internet_gateway.internet-gateway" ]
}

This sequence basically:
- creates a subnet (by calling a module)
- Allocates an elastic IP
- creates a NAT gateway that ties the subnet and the eIP together

Could our problems somehow be related to the fact that the subnet is being created in a module?

Any help is much appreciated!

@brendonmartino
Copy link

+1

@brendonmartino
Copy link

Hi @catsby - You said: "I believe they are being tracked in another GitHub issue and being worked on."
If you have that link, I would like to see that issue. Thanks

@catsby
Copy link
Contributor

catsby commented Jun 1, 2016

@brendonmartino – I misspoke I suppose, the issue I was referring to was a PR meant to fix this kind of issue:

and follow up PR:

Unfortunately I do not believe either of those are in a release version of Terraform, but can be found in v0.7.0-rc1

@brikis98
Copy link
Contributor

Just hit a similar error:

aws_subnet.private-persistence.2: InvalidSubnetID.NotFound: The subnet ID 'subnet-xxxxxxx' does not exist

Snippet of relevant code:

resource "aws_subnet" "private-persistence" {
    count = "${length(split(",", var.aws_availability_zones))}"
    vpc_id = "${aws_vpc.main.id}"
    availability_zone = "${element(split(",", var.aws_availability_zones), count.index)}"
    cidr_block = "${cidrsubnet(var.cidr_block, 5, count.index + 10)}"
}

resource "aws_route_table_association" "private-persistence" {
    count = "${length(split(",", var.aws_availability_zones))}"
    subnet_id = "${element(aws_subnet.private-persistence.*.id, count.index)}"
    route_table_id = "${element(aws_route_table.private-persistence.*.id, count.index)}"
}

I'm using Terraform v0.6.16.

@brikis98
Copy link
Contributor

Update: still intermittently seeing the same aws_subnet.private-persistence.2: InvalidSubnetID.NotFound: The subnet ID 'subnet-xxxxxxx' does not exist error on Terraform 0.7.2.

@catsby
Copy link
Contributor

catsby commented Dec 13, 2016

Hello – I'm following up on this issue as some time has passed and we've since released several new versions of Terraform.

Unfortunately our findings here were inconclusive; we were never able to reproduce this issue. Can anyone comment further, or supply a reproduction case? I would like to know if this is still an issue you're encountering, otherwise I'd like to close the issue.

@catsby catsby added the waiting-response An issue/pull request is waiting for a response from the community label Dec 13, 2016
@catsby
Copy link
Contributor

catsby commented Dec 14, 2016

I'm going to close this for now. Please let us know if you anyone has more information or a reproduction case. Thanks!

@ghost
Copy link

ghost commented Apr 18, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@ghost ghost locked and limited conversation to collaborators Apr 18, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug provider/aws waiting-response An issue/pull request is waiting for a response from the community
Projects
None yet
Development

No branches or pull requests

4 participants