-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Network ACLs must wait on Internet and NAT Gateways (finally found a workaround for lots of random eventual consistency errors) #7527
Comments
@brikis98 wow this is some seriously great reporting. Thanks so much for all of your work in putting this together!
This is a huge finding. Give us a chance to chew on this a bit and we'll follow up! |
Hey @brikis98 sorry for the silence here. I have a question about the setup you've shared here (excellent details by the way 😄 ) In your resource "aws_network_acl" "private_app_subnets" {
vpc_id = "${var.vpc_id}"
...
} Can you tell me, where does In your workaround, you have: module "acls" {
source = "./acls"
# ... lots of params omitted
vpc_ready = "${module.vpc.vpc_ready}"
} so I'm curious where the I created a demo project based on the description of your example above, which can be found here: In that demo, I export I’d like to know how you’re setting your Thanks! |
@catsby The module "acls" {
source = "./acls"
# ... lots of params omitted
vpc_ready = "${module.vpc.vpc_ready}"
vpc_id = "${module.vpc.vpc_id}"
} The output "vpc_id" {
value = "${aws_vpc.main.id}"
} |
Hey @brikis98 thanks for getting back. I'm still not able to reproduce this with terraform v0.6.16 or v0.7. I've updated my demo app (which requires v0.7, but I have a v0.6.16 version as well) to include subnets et. al, but I'm still not hitting the issues. Looking at the errors you shared:
Can you tell me which module those resources belong to? I'm assuming the VPC module, can you confirm? Also, can you please tell me what region you using? Please let me know if there's anything in my demo app that I can expand on to try and hit these errors. Thanks! |
@catsby All of those resources belong to the VPC module. When we run our tests, we pick a random region each time, so I saw those failures randomly in us-east-1, us-west-2, and many others. Here are a few items in our VPC module that are not in your demo app:
No idea if any of these would make a difference, but thought I'd mention them just in case. |
@catsby I just upgraded to Terraform 0.7.2, and as far as I can tell, my workaround is less effective now. I'm seeing far more eventual consistency issues in general with this new version of Terraform (e.g. #7993 (comment), #6813 (comment), #8229 (comment), #8530), and with Network ACLs in particular, I'm getting a large number of eventual consistency errors, despite this workaround, and more often than not, the templates will not apply or destroy successfully. Not sure where to go from here. |
Update: I've found, through trial and error and copying code examples I found online, that most of the issues I describe in this bug are resolved by adding two resource "aws_route" "internet" {
route_table_id = "${aws_route_table.public.id}"
destination_cidr_block = "0.0.0.0/0"
gateway_id = "${aws_internet_gateway.main.id}"
# A workaround for a series of eventual consistency bugs in Terraform. For a list of the errors, see the related
# bugs described in this issue: https://github.com/hashicorp/terraform/issues/8542. The workaround is based on:
# https://github.com/hashicorp/terraform/issues/5335 and https://charity.wtf/2016/04/14/scrapbag-of-useful-terraform-tips/
depends_on = ["aws_internet_gateway.main", "aws_route_table.public"]
} I have no idea why that helps, but it gets rid of most issues. The only one it does NOT get rid of is #8542. |
@catsby From browsing the Terraform code, I've noticed that some of the functions, after creating a resource, start making repeated API calls to AWS until the API says the resource exists. I'm guessing this is done to ensure that anything that depends on that resource doesn't execute until information about it has propagated. The catch is that those API calls are only repeated up to some maximum time out, such as waiting at most 15 seconds for a route to be created (see #8542 (comment)). My suspicion is that for read API calls, AWS routes you to a replica in a nearby region. For example, you might be deploying a VPC in Perhaps the reason you weren't able to repro the issues I was seeing was that you always deployed to a data center near you? Perhaps you need to try to deploy to something as far away as possible? |
Hey @brikis98 , how have things been here? Last I looked I was unable to reproduce this issue. We do make repeated calls to confirm resources are fully created, and we still add more and more polling to cover edge cases, as we find them. Can you tell me, are you still seeing this issue with any frequency? I don't feel these kinds of eventual consistency issues are still prevalent, but I'd like your feedback before closing this issue. Please let me know! |
@catsby Thanks for checking in. I have not seen this error in a while. It's an intermittent issue by nature, so I don't know if that really means it has been fixed, but it's probably safe to close the bug for now. I can reopen if I hit this problem again. |
Thank you, @brikis98 , I appreciate the quick turn around. I wish I had a solid resolution here 😦 Please let us know if you do happen to stumble on anything conclusive in the future. Thanks! |
I had a similar issue when trying to add a route to my public subnet's route table. The route needed our site to site VPN's |
I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further. |
While working on a complicated set of VPC templates that created multiple VPCs, subnets, route tables, and network ACLs, I was hitting a huge number of seemingly random eventual consistency issues, including #7038, #5335, #5185, #6813, #7516, and many others. Sometimes I'd get one error, sometimes I'd get a dozen (I listed an example in the "Actual Behavior" section below). Re-running
terraform apply
would get past some of these errors, only to reveal others, and often, I couldn't get the templates to apply successfully at all.After lots of digging, I've finally found a workaround. I'm not sure if this is a bug that needs to be fixed in Terraform, or AWS, or just documentation that should be added, but I figured I'd describe my findings here in case other folks hit the same problems. See below for details of the problem plus a description of the workaround.
Terraform Version
Terraform v0.6.16
Affected Resource(s)
These errors seem to come up when you create network ACLs at the same time as you are creating a new VPC with Internet and NAT Gateways, so the affected resources are:
Terraform Configuration Files
I was creating my VPC and its Internet and Nat Gateways in one module and the Network ACLs in another. I don't know if this matters, but I figured I'd list it here just in case.
Key excerpts from the VPC module:
Example excerpts from the Network ACLs module:
Note that the exact details of the Network ACLs probably don't matter. All that matters is that you are trying to create ACLs at more or less the same time as you're creating the VPC and its subnets.
Expected Behavior
The VPC and Network ACLs should be created without errors.
Actual Behavior
I get a huge number of seemingly random errors about route tables not being found, or subnets not being found, or Network ACLs not being found, and so on. Sometimes I'd get one error, sometimes more than a dozen, as shown in this example output:
Workaround
I came across a comment by @mitchellh that said the following:
I was a bit desperate for a way forward, so I figured I'd give it a shot: I would force my Network ACLs module to wait until all the Internet Gateways, and, just in case, the NAT Gateways, and all relevant routes, were fully created. To do this, I added a new
null_resource
, and a correspondingoutput
for it, to the VPC module:Note how the
null_resource
explicitly depends on the Internet Gateway, NAT Gateway, and their corresponding routes to be created. I then added avpc_ready
variable to the Network ACL module, anull_resource
that depends on that variable, and made sure that each ACL in those templatesdepends_on
thenull_resource
.Finally, when I use the two modules together, I use set the
vpc_ready
input in the Network ACL module to thevpc_ready
output from the VPC module to ensure the Network ACLs do not get created until all the Gateways are created:As soon as I added this, all the errors magically went away.
Note: this workaround would be much simpler (i.e. not require any extra variables,
null_resources
, etc) if Terraform supporteddepends_on
for modules (see #1178).The text was updated successfully, but these errors were encountered: