Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent network issues (read: connection reset Errors) #14163

Closed
lijok opened this issue Jul 13, 2020 · 38 comments
Closed

Intermittent network issues (read: connection reset Errors) #14163

lijok opened this issue Jul 13, 2020 · 38 comments
Assignees
Labels
provider Pertains to the provider itself, rather than any interaction with AWS. upstream Addresses functionality related to the cloud provider.

Comments

@lijok
Copy link

lijok commented Jul 13, 2020

Terraform Version

Terraform v0.12.23

We're running a drift detection workflow using github hosted github actions, which simply runs terraform plan and fails if it outputs anything. This runs on a schedule every hour.
We're getting request errors, causing terraform plan to fail, around 2-3 times a day

Some of the request errors we've so far encountered:

Error: RequestError: send request failed
caused by: Get https://cloudfront.amazonaws.com/2019-03-26/origin-access-identity/cloudfront/E26H********: read tcp 10.1.0.4:52046->54.239.29.26:443: read: connection reset by peer

Error: RequestError: send request failed
caused by: Get https://cloudfront.amazonaws.com/2019-03-26/origin-access-identity/cloudfront/E1B3D********: read tcp 10.1.0.4:33408->54.239.29.51:443: read: connection reset by peer

Error: error listing tags for CloudFront Distribution (E24R********): RequestError: send request failed
caused by: Get https://cloudfront.amazonaws.com/2019-03-26/tagging?Resource=arn%3Aaws%3Acloudfront%3A%3A*********%3Adistribution%2FE24********: read tcp 10.1.0.4:56918->54.239.29.65:443: read: connection reset by peer

Error: error getting S3 Bucket website configuration: RequestError: send request failed
caused by: Get https://******.s3.amazonaws.com/?website=: read tcp 10.1.0.4:59070->52.216.20.56:443: read: connection reset by peer

Error: error getting S3 Bucket replication: RequestError: send request failed
caused by: Get https://*******.s3.amazonaws.com/?replication=: read tcp 10.1.0.4:60534->52.216.138.67:443: read: connection reset by peer

Most of these seem to be CloudFront and S3

Thanks

@github-actions github-actions bot added the needs-triage Waiting for first response or review from a maintainer. label Jul 13, 2020
@unfor19
Copy link

unfor19 commented Jul 16, 2020

Same here, using v0.12.28, I'm using drone.io's plugin drone-terraform, output log below

The weird thing - after a couple of restarts, it works without any issues, so it's very inconsistent

...
$ terraform version
Terraform v0.12.28
$ rm -rf .terraform
$ terraform init -input=false
Initializing modules...
...
Initializing the backend...
...
Successfully configured the backend "s3"! Terraform will automatically
use this backend unless the backend configuration changes.
...
Initializing provider plugins...
- Checking for available provider plugins...
- Downloading plugin for provider "template" (hashicorp/template) 2.1.2...
- Downloading plugin for provider "random" (hashicorp/random) 2.3.0...
- Downloading plugin for provider "aws" (hashicorp/aws) 2.70.0...
...
* provider.aws: version = "~> 2.70"
* provider.random: version = "~> 2.3"
* provider.template: version = "~> 2.1"

Terraform has been successfully initialized!
You may now begin working with Terraform. Try running "terraform plan" to see
any changes that are required for your infrastructure. All Terraform commands
should now work.
...
$ terraform get
$ terraform validate
Success! The configuration is valid.
$ terraform plan -out=plan.tfout -var image_tag=drone-latest -var sha=1a2b3c4d
Refreshing Terraform state in-memory prior to plan...
The refreshed state will be used to calculate this plan, but will not be
persisted to local or remote state storage.
...
TONS OF Refreshing state messages...
...
Error: RequestError: send request failed
caused by: Get https://cloudfront.amazonaws.com/2019-03-26/distribution/truncated: read tcp 192.168.0.1:33534->53.229.31.61:443: read: connection reset by peer

time="2020-07-16T15:43:20Z" level=fatal msg="Failed to execute a command" error="exit status 1" 

@mo-hit
Copy link

mo-hit commented Jul 16, 2020

Getting the same issue when running plan or apply, with cloudfront

Error: error listing tags for CloudFront Distribution <redacted>: RequestError: send request failed
caused by: Get https://cloudfront.amazonaws.com/2019-03-26/tagging?Resource=arn%3Aaws%3Acloudfront%3A%3<redacted>%3Adistribution%2<redacted>: read tcp 192.168.1.94:51422->54.239.29.65:443: read: connection
 reset by peer

started happening intermittently about 3 days ago
tf 0.12.28

@lcaproni-pp
Copy link

Also seen the same issue with Cloudfront:

Error: RequestError: send request failed
caused by: Get https://cloudfront.amazonaws.com/2019-03-26/origin-access-identity/cloudfront/read: connection reset by peer

TF Version - 0.12.28

@tbugfinder
Copy link
Contributor

Hi, today I also run into such an error:

Error: error listing tags for ACM Certificate (arn:aws:acm:eu-west-1:111111111:certificate/800000f-1111-2222-bedb-9096d4c8a692): RequestError: send request failed
caused by: Post https://acm.eu-west-1.amazonaws.com/: read tcp 10.10.10.10:43720->123.1.1.1:443: read: connection reset by peer

(IPs changed) :-)

I have to use a proxy server in-betweeen (IP 123.x.x.x) however I'd expect terraform or the provider to run a retry.

$ terraform version
Terraform v0.12.25


provider version
2.66

@tristanhoskenjs
Copy link

Having the same issues in our CI/CD pipeline

@acburdine
Copy link
Contributor

acburdine commented Jul 30, 2020

Seeing this same issue in Terraform Cloud, specifically with the cloudfront_distribution and cloudfront_origin_access_identity resources - it's happening almost daily at this point.

@bflad
Copy link
Contributor

bflad commented Jul 30, 2020

It would be great if we could get a Gist with debug logging enabled so we can further troubleshoot. If you are worried about any sensitive data, it can be encrypted with the HashiCorp GPG Key or redacted as necessary.

The maintainers will need this information to be able to see and triage the current provider and AWS Go SDK behavior during them.

@lijok
Copy link
Author

lijok commented Jul 30, 2020

It would be great if we could get a Gist with debug logging enabled so we can further troubleshoot. If you are worried about any sensitive data, it can be encrypted with the HashiCorp GPG Key or redacted as necessary.

The maintainers will need this information to be able to see and triage the current provider and AWS Go SDK behavior during them.

Cool, I'll enable debug on the workflow and post back once we catch it happening

@mattburgess
Copy link
Collaborator

We're hitting this too, and have debug logs enabled. Will clear this with security and get back to you. In the mean time though, we're seeing two slightly different behaviours.

Some calls cause the run to fail immediately and others cause up to 15 minutes pauses before a retry is attempted, at which point the plan succeeds and the CI job continues on.

Some of our calls go through VPC endpoints wherever possible, but where that's not, they end up going through an Internet Proxy (Squid). So far, we've only seen the proxy-routed calls cause the 15 minute pause and the VPC-endpoint-routed calls cause an immediate failure but a) there's too little data to extract any kind of pattern and b) given they're different services then the retry logic might be different between services.

@mattburgess
Copy link
Collaborator

GPG-encrypted logs available at https://gist.github.com/mattburgess/2a00b1e77b00368781360ac8581383b9

analytical-dataset-generation_analytical-dataset-generation-qa_154.log.gpg - this one failed after seeing a single connection reset by peer error; no retries were attempted.

analytical-dataset-generation_analytical-dataset-generation-preprod_136.log.gpg - this one hung/paused/waited for 15 minutes having seen a connection reset by peer error, then retried and succeeded on its first retry.

@awsiv
Copy link
Contributor

awsiv commented Aug 20, 2020

seeing this on v0.12.29 as well

@blakemorgan
Copy link

Just got this issue on v0.13.0. The first two times it failed and the third time worked as expected. All three times it was running in a GitHub Action.

@ivorcheung
Copy link

I had the same issue last night. Ran it again in the morning and it was fine. This is a rather intermittent issue.

Got this on v.0.12.28

@ZsoltPath
Copy link
Contributor

ZsoltPath commented Aug 25, 2020

Same here on TF v0.13.0 and AWS provider v3.3.0
And as someone mentioned above, it mainly happens when running it in GitHub actions (CI/CD).

@edwardofclt
Copy link
Contributor

edwardofclt commented Aug 25, 2020

We're experiencing the issue also in Terraform Cloud using v0.12.28 & 0.12.29 and the AWS provider pinned to ~> 2.0.

Error: RequestError: send request failed
caused by: Get https://cloudfront.amazonaws.com/2019-03-26/origin-access-identity/cloudfront/ABCD1234567: read tcp 10.181.43.96:56350->54.239.29.51:443: read: connection reset by peer

Error: RequestError: send request failed
caused by: Get https://cloudfront.amazonaws.com/2019-03-26/distribution/ABCD1234567: read tcp 10.181.43.96:57570->54.239.29.51:443: read: connection reset by peer

@cpuspellcaster
Copy link

cpuspellcaster commented Aug 25, 2020

Same issue. Terraform v0.12.29, AWS provider 3.3.0, running in CircleCI. It's intermittent but occurring in roughly 10% of the TF executions per day.

@chrusty
Copy link

chrusty commented Aug 25, 2020

I've had this issue with v0.12.26 and v0.12.28. So persistent that we've had to wrap any Terraform execution in multiple layers of retry

@bflad
Copy link
Contributor

bflad commented Aug 26, 2020

Hi folks 👋 Its not entirely clear why this is more of an issue all of the sudden for a lot more environments except that maybe AWS' service APIs are resetting connections more aggressively. Understandably, this error is very problematic though.

The challenge here is that the AWS Go SDK request handlers explicitly catch this specific condition, a ECONNRESET type error during the read operation of an API call, to disable the retry logic. This logic has been present in the AWS Go SDK since version 1.20.2 and the Terraform AWS Provider version 2.16.0. The code can be seen here:

https://github.com/aws/aws-sdk-go/blob/fde575c64841b291899bc112dfcdc206f609a305/aws/request/connection_reset_error.go#L8-L10

Which is eventually handled here:

https://github.com/aws/aws-sdk-go/blob/fde575c64841b291899bc112dfcdc206f609a305/aws/request/retryer.go#L168-L185

Some of the upstream decision process for this can be seen here:

Essentially boiling down to this:

The logic behind this change is that the SDK is not able to sufficiently determine the state of an API request after successfully writing the request to the transport layer, but then failing to read the corresponding response due to a connection reset occurring. This is due to the fact that the SDK has no knowledge about whether the given operation is idempotent or whether it would be safe to retry.

I would personally agree with their assessment on the surface and say that the Terraform AWS Provider would not want to always retry in this case, since without some very careful investigation the potential effects would broadly be unknown. While we might be in a slightly better situation than the whole SDK since we are mainly dealing with management API calls (Create/Read/Update/Delete/List) rather than event/data API calls, we would still have issues with this type of retry logic including:

  • Duplicate, unmanaged components being created if the Create/Update API being called does not have some form of uniqueness constraint as part of its parameters (e.g. identifier/name). This could mean unexpected costs or security risks.
  • For other cases, the API would likely return a different error on the retried request since it already received the same API call (e.g. "resource not found" on duplicate Delete calls, "invalid state" on duplicate Update calls, etc.)

This leaves us in a little bit of a bind in this project. 😖 We have been purposefully avoiding implementing any custom retryer logic to decrease any maintenance and testing in that considerably harder area. Outside of that we could implement this logic per AWS Go SDK service client as we do today for some other retryable conditions (see aws/config.go), however attempting to enumerate all safely idempotent API calls is a massive undertaking, even after using loose heuristics such as trying to say all "read-only" calls such as Describe*/Get*/List* are retryable (and potentially Create*/Put*/Set* where we include a ClientToken/IdempotencyToken) for this specific handling.

Another option may be to suggest this type of enhancement (or some may say bug fix) upstream into the AWS Go SDK codebase itself, but I'm not sure if the upstream maintainers would want to get into this space either.

I'm out of time to ponder on this more for tonight, but hopefully this initial insight can kickstart some discussions.

@chrusty
Copy link

chrusty commented Aug 26, 2020

@bflad does it seem to you that this is particularly prevalent with the CloudFront API? I know that in my case it certainly is, and I can see from the rest of the comments in this issue that CloudFront is involved very often.

@lifeofguenter
Copy link

lifeofguenter commented Aug 26, 2020

@bflad most probably a long shot, but would there be any connection with #14797 + hashicorp/terraform#25835 (comment) ?

It seems after upgrading from 0.12.24 (could be though coincidence that aws maybe changed their rate limiting at the same time) we have been both getting issues as described in this thread, but also more intermittent "No valid credential sources found for AWS Provider" issues.

@bflad bflad self-assigned this Aug 26, 2020
@bflad bflad added provider Pertains to the provider itself, rather than any interaction with AWS. upstream Addresses functionality related to the cloud provider. and removed needs-triage Waiting for first response or review from a maintainer. labels Aug 26, 2020
@ZsoltPath
Copy link
Contributor

This leaves us in a little bit of a bind in this project. 😖 We have been purposefully avoiding implementing any custom retryer logic to decrease any maintenance and testing in that considerably harder area. Outside of that we could implement this logic per AWS Go SDK service client as we do today for some other retryable conditions (see aws/config.go), however attempting to enumerate all safely idempotent API calls is a massive undertaking, even after using loose heuristics such as trying to say all "read-only" calls such as Describe*/Get*/List* are retryable (and potentially Create*/Put*/Set* where we include a ClientToken/IdempotencyToken) for this specific handling.

@bflad
I'd say retrying Describe*/Get*/List* could be harmless and probably help a lot.
I haven't looked into debug log but from the surface it happens most of the time at a Describe operation. Either when collecting the states at the beginning or when after creation of a Cloudfront distribution TF is periodically checking the status.
Both would be solved with a retry.

Regarding the write operation would it be possible to add it as a switch? Either to the apply command or as a lifecycle option to the actual resources.
Then users can decide whether risk it or not and what their actual use case.

@spouzols
Copy link

Hello. Hitting the same kind of behaviour, more frequently lately. Terraform 0.12.28, AWS provider 2.70.0, running on Concourse CI on AWS. Almost always connection resets while waiting for a CloudFront distribution creation / update.

@acburdine
Copy link
Contributor

acburdine commented Aug 26, 2020

for what it's worth - every time I've seen this issue it's been on read calls to either Cloudfront distribution configs or Cloudfront origin access identities.

It may not be the best way to approach solving the issue, but given that the majority of the connection reset issues seem to be with specific Cloudfront read calls + a few others, it might be worth just adding retries to individual API calls (Cloudfront or otherwise) as they become problematic?

@tbugfinder
Copy link
Contributor

I don't use any cloudfront resources.

@bflad bflad changed the title Intermittent network issues Intermittent network issues (read: connection reset Errors) Aug 26, 2020
@bflad bflad pinned this issue Aug 26, 2020
@bflad
Copy link
Contributor

bflad commented Aug 26, 2020

As mentioned above, the most pragmatic approach for this may be to try and implement temporary quick fixes for the most problematic cases until we can determine root causes and work on more permanent solutions. In an effort to accomplish that, it would be great if we can rally around the most problematic API calls and see if we cannot figure out some additional debugging details along this journey.

If you haven't already, we would strongly encourage filing an AWS Support technical support case to alert the AWS service teams of the increased API connection reset errors. Please feel free to link back to this GitHub issue. We are happy to introduce additional changes (e.g. extra logging in addition to our available debug logging) to support AWS troubleshooting efforts.

Can folks please comment with the below details:

  • The error message, including the RequestError line and the caused by: line (redacting any sensitive resource identifiers and IP addresses if necessary)
  • The Terraform AWS Provider resource and operation (Create/Read (plan/refresh)/Update/Delete (destroy)) being performed by Terraform on it
  • If possible to determine from the above, the AWS service and underlying API call (can be found looking at the resource code or service API reference)
  • In general, where Terraform is running (Terraform Cloud, GitHub Actions, corporate network, etc.)
  • If there is a known HTTP proxy between where Terraform is running and the AWS API (what proxy and version would also be super helpful)
  • Roughly how often it occurs (% of runs, # of same resources in same configuration)
  • Terraform concurrency if increased -parallelism flag configured above 10
  • Any other relevant information

For example:

Error:

```
Error: RequestError: send request failed
caused by: Get https://cloudfront.amazonaws.com/2019-03-26/distribution/ABCD1234567: read tcp 10.181.43.96:57570->54.239.29.51:443: read: connection reset by peer
```

| Question | Answer |
| --- | --- |
| Terraform Resource | aws_cloudfront_distribution |
| Terraform Operation | Read |
| AWS Service | CloudFront |
| API Call | GetDistribution |
| Terraform Environment | Corporate network |
| Terraform Concurrency | 10 (default) |
| Known HTTP Proxy | Yes (Squid X.Y.Z) |
| How Many Resources | 50 in same configuration |
| How Often | 10% of runs |

Any other relevant information.

I have an initial hunch that this could be related to the recent Application and Classic Load Balancers are adding defense in depth with the introduction of Desync Mitigation Mode. Many production service APIs are run using the same AWS infrastructure components publicly available. The underlying HTTP Desync Guardian project includes some documentation and diagrams to show its behaviors. The mitigations section is particularly helpful in describing the conceptual behaviors.

What we may be seeing could be two-fold if it is related to the above:

  • Terraform AWS Provider HTTP connections being disconnected due to unexpected HTTP compliance issues either by payload (mostly determined by the AWS Go SDK) or intermediate HTTP proxy behavior
  • Side effects of other HTTP connections being disconnected and affecting ours

Gathering the above details may help tease this out.


We may also want to create some additional AWS Go SDK tracking issues as well. For example, we may need the AWS Go SDK to always debug log the request of API calls, even if the request fails in this state. Currently, the debug logging seems to just give the error and not the request payload like:

---[ REQUEST POST-SIGN ]-----------------------------
POST / HTTP/1.1
Host: ec2.eu-west-2.amazonaws.com
User-Agent: aws-sdk-go/1.33.21 (go1.14.5; linux; amd64) APN/1.0 HashiCorp/1.0 Terraform/0.12.19 (+https://www.terraform.io)
Content-Length: 79
Content-Type: application/x-www-form-urlencoded; charset=utf-8
X-Amz-Date: 20200814T100330Z
Accept-Encoding: gzip

Action=DescribeSecurityGroups&GroupId.1=sg-12345678&Version=2016-11-15
-----------------------------------------------------

@spouzols
Copy link

Error:

Error: error waiting until CloudFront Distribution (XXXXX) is deployed: RequestError: send request failed
caused by: Get https://cloudfront.amazonaws.com/2019-03-26/distribution/XXXXX: read tcp 10.x.x.x:35832->54.x.x.x:443: read: connection reset by peer
Question Answer
Terraform Resource aws_cloudfront_distribution
Terraform Operation Read
AWS Service CloudFront
API Call GetDistribution
Terraform Environment AWS VPC (EC2, Concourse CI)
Terraform Concurrency 10 (default)
Known HTTP Proxy No
How Many Resources 1 in same configuration
How Often 80% of runs, 4/5 in 24h

Terraform 0.12.28, AWS provider 2.70.0

@lifeofguenter
Copy link

We have a support ticket request open with aws for both this issue and #14797 - especially in the latter case it would greatly help if TRACE would show complete requests + responses for us/aws to understand what is going on.

Or maybe even something separate like HTTP_TRACE that only shows requests + responses, which in most cases is the more interesting part when debugging these type of issues.

We are experiencing this issue on our Jenkins hosted on EC2 - we run multiple nodes behind a natgw (so shared IP for outgoing connections).

@lijok
Copy link
Author

lijok commented Aug 27, 2020

There is definitely a problem on the AWS side
If you go to the cloudfront console and hit refresh a few times, you're now very likely to encounter this
image

@encron
Copy link

encron commented Sep 1, 2020

Error:

Error: RequestError: send request failed
       caused by: Get "https://cloudfront.amazonaws.com/2020-05-31/origin-access-identity/cloudfront/E33T16DJ8BRX2": read tcp 10.170.3.101:33268->54.239.29.51:443: read: connection reset by peer
Question Answer
Terraform Resource aws_cloudfront_distribution
Terraform Operation Read (refreshing state or waiting for the distribution to be deployed/destroyed)
AWS Service CloudFront
API Call GetDistribution
Terraform Environment AWS VPC
Terraform Concurrency 10 (default)
Known HTTP Proxy No
How Many Resources 2
How Often 90% of runs

At first I assumed this was due to Terraform polling and waiting for the distribution to be deployed, which is why I added wait_for_deployment = false, yet it seems to have even worsened the behaviour and it's even failing when refreshing the state. I saw the bulk of the errors happening yesterday when also disabling Cloudfront distributions seemed to take a very long time. This morning upon retrying again, the error rate is way less.

@lijok
Copy link
Author

lijok commented Sep 11, 2020

We haven't had this happen for more than a week now
Could be fixed on aws side?

@bflad bflad unpinned this issue Sep 14, 2020
@bflad
Copy link
Contributor

bflad commented Sep 14, 2020

Hi again 👋 Since it appears that this was handled on the AWS side (both in this issue and lack of Terraform support tickets), our preference will be to leave things as they are for now. If this comes up again, especially since CloudFront seems to very prominently have this issue when it occurs, we can definitely think more about this network connection handling. 👍

@tibbon
Copy link

tibbon commented Oct 1, 2020

I started seeing these today.

Error: Error reading IAM policy version arn:aws:iam::XXXX:policy/OktaChildAccountPolicy: RequestError: send request failed
caused by: Post https://iam.amazonaws.com/: read tcp 192.168.1.216:52180->52.94.225.3:443: read: connection reset by peer



Error: RequestError: send request failed
caused by: Post https://iam.amazonaws.com/: read tcp 192.168.1.216:52172->52.94.225.3:443: read: connection reset by peer



Error: Error reading IAM Role Okta-Idp-cross-account-role: RequestError: send request failed
caused by: Post https://iam.amazonaws.com/: read tcp 192.168.1.216:52171->52.94.225.3:443: read: connection reset by peer

@tarazena
Copy link

tarazena commented Oct 1, 2020

@tibbon I was seeing it few minutes ago and now its gone

@isikdos
Copy link

isikdos commented Oct 1, 2020

I'm seeing it and the issues persist. I've been restarting my CI pipeline for about half an hour hoping it's transient, but it's sticking around. Likewise, mine is with the iam.amazonaws.com

Edit: 40th minute was the charm. You can force through it with enough retries. As far as I could tell, I only had 2 or 3 items that were failing. If you have many more, you might just be probablistically stuck until the broader problem is resolved.

@dchernivetsky
Copy link

Same here. Started half an hour ago.

@azemon
Copy link

azemon commented Oct 1, 2020

I just starting hitting this issue, too. It's an old Terraform project, which we run several times per week. All of a sudden, it's causing problems.

Error: Error reading IAM Role ABCDEF: RequestError: send request failed
caused by: Post https://iam.amazonaws.com/: read tcp 192.168.0.129:44552->52.94.225.3:443: read: connection reset by peer



Error: error finding IAM Role (GHIJKL) Policy Attachment (arn:aws:iam::aws:policy/AmazonInspectorFullAccess): RequestError: send request failed
caused by: Post https://iam.amazonaws.com/: read tcp 192.168.0.129:44872->52.94.225.3:443: read: connection reset by peer

@claco
Copy link

claco commented Oct 1, 2020

https://status.aws.amazon.com/

1:50 PM PDT We are investigating increased error rates and latencies affecting IAM. IAM related requests to other AWS services may also be impacted.

@ghost
Copy link

ghost commented Oct 14, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thanks!

@ghost ghost locked as resolved and limited conversation to collaborators Oct 14, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
provider Pertains to the provider itself, rather than any interaction with AWS. upstream Addresses functionality related to the cloud provider.
Projects
None yet
Development

No branches or pull requests