-
Notifications
You must be signed in to change notification settings - Fork 9.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intermittent network issues (read: connection reset Errors) #14163
Comments
Same here, using The weird thing - after a couple of restarts, it works without any issues, so it's very inconsistent
|
Getting the same issue when running plan or apply, with cloudfront
started happening intermittently about 3 days ago |
Also seen the same issue with Cloudfront:
TF Version - 0.12.28 |
Hi, today I also run into such an error:
(IPs changed) :-) I have to use a proxy server in-betweeen (IP 123.x.x.x) however I'd expect terraform or the provider to run a retry.
|
Having the same issues in our CI/CD pipeline |
Seeing this same issue in Terraform Cloud, specifically with the cloudfront_distribution and cloudfront_origin_access_identity resources - it's happening almost daily at this point. |
It would be great if we could get a Gist with debug logging enabled so we can further troubleshoot. If you are worried about any sensitive data, it can be encrypted with the HashiCorp GPG Key or redacted as necessary. The maintainers will need this information to be able to see and triage the current provider and AWS Go SDK behavior during them. |
Cool, I'll enable debug on the workflow and post back once we catch it happening |
We're hitting this too, and have debug logs enabled. Will clear this with security and get back to you. In the mean time though, we're seeing two slightly different behaviours. Some calls cause the run to fail immediately and others cause up to 15 minutes pauses before a retry is attempted, at which point the plan succeeds and the CI job continues on. Some of our calls go through VPC endpoints wherever possible, but where that's not, they end up going through an Internet Proxy (Squid). So far, we've only seen the proxy-routed calls cause the 15 minute pause and the VPC-endpoint-routed calls cause an immediate failure but a) there's too little data to extract any kind of pattern and b) given they're different services then the retry logic might be different between services. |
GPG-encrypted logs available at https://gist.github.com/mattburgess/2a00b1e77b00368781360ac8581383b9 analytical-dataset-generation_analytical-dataset-generation-qa_154.log.gpg - this one failed after seeing a single analytical-dataset-generation_analytical-dataset-generation-preprod_136.log.gpg - this one hung/paused/waited for 15 minutes having seen a |
seeing this on |
Just got this issue on |
I had the same issue last night. Ran it again in the morning and it was fine. This is a rather intermittent issue. Got this on |
Same here on TF v0.13.0 and AWS provider v3.3.0 |
We're experiencing the issue also in Terraform Cloud using
|
Same issue. Terraform |
I've had this issue with v0.12.26 and v0.12.28. So persistent that we've had to wrap any Terraform execution in multiple layers of retry |
Hi folks 👋 Its not entirely clear why this is more of an issue all of the sudden for a lot more environments except that maybe AWS' service APIs are resetting connections more aggressively. Understandably, this error is very problematic though. The challenge here is that the AWS Go SDK request handlers explicitly catch this specific condition, a Which is eventually handled here: Some of the upstream decision process for this can be seen here:
Essentially boiling down to this:
I would personally agree with their assessment on the surface and say that the Terraform AWS Provider would not want to always retry in this case, since without some very careful investigation the potential effects would broadly be unknown. While we might be in a slightly better situation than the whole SDK since we are mainly dealing with management API calls (Create/Read/Update/Delete/List) rather than event/data API calls, we would still have issues with this type of retry logic including:
This leaves us in a little bit of a bind in this project. 😖 We have been purposefully avoiding implementing any custom retryer logic to decrease any maintenance and testing in that considerably harder area. Outside of that we could implement this logic per AWS Go SDK service client as we do today for some other retryable conditions (see Another option may be to suggest this type of enhancement (or some may say bug fix) upstream into the AWS Go SDK codebase itself, but I'm not sure if the upstream maintainers would want to get into this space either. I'm out of time to ponder on this more for tonight, but hopefully this initial insight can kickstart some discussions. |
@bflad does it seem to you that this is particularly prevalent with the CloudFront API? I know that in my case it certainly is, and I can see from the rest of the comments in this issue that CloudFront is involved very often. |
@bflad most probably a long shot, but would there be any connection with #14797 + hashicorp/terraform#25835 (comment) ? It seems after upgrading from 0.12.24 (could be though coincidence that aws maybe changed their rate limiting at the same time) we have been both getting issues as described in this thread, but also more intermittent "No valid credential sources found for AWS Provider" issues. |
@bflad Regarding the write operation would it be possible to add it as a switch? Either to the |
Hello. Hitting the same kind of behaviour, more frequently lately. Terraform 0.12.28, AWS provider 2.70.0, running on Concourse CI on AWS. Almost always connection resets while waiting for a CloudFront distribution creation / update. |
for what it's worth - every time I've seen this issue it's been on read calls to either Cloudfront distribution configs or Cloudfront origin access identities. It may not be the best way to approach solving the issue, but given that the majority of the connection reset issues seem to be with specific Cloudfront read calls + a few others, it might be worth just adding retries to individual API calls (Cloudfront or otherwise) as they become problematic? |
I don't use any cloudfront resources. |
As mentioned above, the most pragmatic approach for this may be to try and implement temporary quick fixes for the most problematic cases until we can determine root causes and work on more permanent solutions. In an effort to accomplish that, it would be great if we can rally around the most problematic API calls and see if we cannot figure out some additional debugging details along this journey. If you haven't already, we would strongly encourage filing an AWS Support technical support case to alert the AWS service teams of the increased API connection reset errors. Please feel free to link back to this GitHub issue. We are happy to introduce additional changes (e.g. extra logging in addition to our available debug logging) to support AWS troubleshooting efforts. Can folks please comment with the below details:
For example: Error:
```
Error: RequestError: send request failed
caused by: Get https://cloudfront.amazonaws.com/2019-03-26/distribution/ABCD1234567: read tcp 10.181.43.96:57570->54.239.29.51:443: read: connection reset by peer
```
| Question | Answer |
| --- | --- |
| Terraform Resource | aws_cloudfront_distribution |
| Terraform Operation | Read |
| AWS Service | CloudFront |
| API Call | GetDistribution |
| Terraform Environment | Corporate network |
| Terraform Concurrency | 10 (default) |
| Known HTTP Proxy | Yes (Squid X.Y.Z) |
| How Many Resources | 50 in same configuration |
| How Often | 10% of runs |
Any other relevant information. I have an initial hunch that this could be related to the recent Application and Classic Load Balancers are adding defense in depth with the introduction of Desync Mitigation Mode. Many production service APIs are run using the same AWS infrastructure components publicly available. The underlying HTTP Desync Guardian project includes some documentation and diagrams to show its behaviors. The mitigations section is particularly helpful in describing the conceptual behaviors. What we may be seeing could be two-fold if it is related to the above:
Gathering the above details may help tease this out. We may also want to create some additional AWS Go SDK tracking issues as well. For example, we may need the AWS Go SDK to always debug log the request of API calls, even if the request fails in this state. Currently, the debug logging seems to just give the error and not the request payload like:
|
Error:
Terraform 0.12.28, AWS provider 2.70.0 |
We have a support ticket request open with aws for both this issue and #14797 - especially in the latter case it would greatly help if Or maybe even something separate like We are experiencing this issue on our Jenkins hosted on EC2 - we run multiple nodes behind a natgw (so shared IP for outgoing connections). |
Error:
At first I assumed this was due to Terraform polling and waiting for the distribution to be deployed, which is why I added |
We haven't had this happen for more than a week now |
Hi again 👋 Since it appears that this was handled on the AWS side (both in this issue and lack of Terraform support tickets), our preference will be to leave things as they are for now. If this comes up again, especially since CloudFront seems to very prominently have this issue when it occurs, we can definitely think more about this network connection handling. 👍 |
I started seeing these today.
|
@tibbon I was seeing it few minutes ago and now its gone |
I'm seeing it and the issues persist. I've been restarting my CI pipeline for about half an hour hoping it's transient, but it's sticking around. Likewise, mine is with the iam.amazonaws.com Edit: 40th minute was the charm. You can force through it with enough retries. As far as I could tell, I only had 2 or 3 items that were failing. If you have many more, you might just be probablistically stuck until the broader problem is resolved. |
Same here. Started half an hour ago. |
I just starting hitting this issue, too. It's an old Terraform project, which we run several times per week. All of a sudden, it's causing problems.
|
https://status.aws.amazon.com/
|
I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thanks! |
Terraform Version
We're running a drift detection workflow using github hosted github actions, which simply runs terraform plan and fails if it outputs anything. This runs on a schedule every hour.
We're getting request errors, causing terraform plan to fail, around 2-3 times a day
Some of the request errors we've so far encountered:
Most of these seem to be CloudFront and S3
Thanks
The text was updated successfully, but these errors were encountered: