-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intermittent AWS Eventual Consistency Issues on CircleCI? #5335
Comments
The InvalidInternetGatewayID.NotFound is a known issue, tracked in #2174 |
@carlossg Thanks for the head's up. That would corroborate my initial hypothesis that these are all eventual consistency issues. It's pretty crazy this hasn't been more of a problem for all Terraform users. Not sure I understand why that is. |
This is more a consistency issue with the AWS API. We are seeing this |
a bunch of eventual consistency problems are fixed in #6775 |
This issue seems related to #7038 and is biting me right now. |
I may have found a workaround for this issue which seems related to the sequencing of how AWS network resources get created. After making all 'aws_route' resources dependent(using 'depends on') on the 'aws_internet_gateway' resource, I have not run into these errors. See example below which is a terraform project (with 2 subnets) that VPC peers into 2 other terraform project VPCs, including routes and reverse routes.
|
@bkc1 Your suggestion of adding a resource "aws_route" "nat" {
count = "${var.num_availability_zones}"
route_table_id = "${element(aws_route_table.private.*.id, count.index)}"
destination_cidr_block = "0.0.0.0/0"
nat_gateway_id = "${element(aws_nat_gateway.nat.*.id, count.index)}"
depends_on = ["aws_internet_gateway.main", "aws_route_table.private"]
} With these two additions, most of the eventual consistency errors have gone away, at least in my last ~10 or 15 |
An update on this. I finally realized why we see more errors in CircleCI than when running locally. It's because our Terraform test framework picks an AWS region at random, whereas I believe CircleCI runs in us-east-1. We've independently seen that running Terraform VPC commands in a region physically far away naturally results in higher latency which exposes more of the underlying eventual consistency bugs. It'd be great if the hashicorp folks could create a canonical VPC in Terraform, and test it with high latencies to smoke out these issues since, 2 years in on Terraform, they continue to be an issue. |
I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further. |
I have a terraform configuration whose job is to create a VPC. The basic structure is this:
As you can see, there are many interdependencies, however
apply
ing anddestroy
ing this from my macbook consistently works fine.When I run this same terraform configuration as part of a CircleCI build, however, I get intermittent failures. Here are some of the errors I've gotten:
Notice that sometimes I'm getting the same error, but for a different resource, though the second one occurred twice. Here are some other errors. Note that each build will fail with one of these, or in some rare cases, succeed.
Every so often the build succeeds, but even then I sometimes receive some non-fatal
diffs didn't match during apply
warnings:So, based on these data points, and the fact that it runs fine in my localdev, I'm guessing there's something about the CircleCI environment. Most (all?) of these errors indicate AWS eventual consistency issues, but I'm struggling to explain why the CircleCI environment would be more likely to trigger them?
Some other datapoints:
GOMAXPROCS
= 27984GOMAXPROCS
= ~4000000GOMAXPROCS
= 1 via an env var in CircleCI did yield a successful build, but obviously slowed things down.terraform apply -parallelism=3
(just to reduce this down from 10) did not seem to affect things, but I'm guessingGOMAXPROCS
andparallelism
ultimately have the same effect.Does anyone have any ideas for how I might further debug this or why this is happening? Thanks for your help and for this outstanding piece of software!
The text was updated successfully, but these errors were encountered: