Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: 14 UNAVAILABLE: TCP Read failed #692

Open
simoncirstoiu opened this issue Dec 29, 2018 · 41 comments
Open

Error: 14 UNAVAILABLE: TCP Read failed #692

simoncirstoiu opened this issue Dec 29, 2018 · 41 comments

Comments

@simoncirstoiu
Copy link

Problem description

Conversation Error: { Error: 14 UNAVAILABLE: TCP Read failed
at Object.exports.createStatusError (/relay/node_modules/grpc/src/common.js:91:15)
at ClientDuplexStream._emitStatusIfDone (/relay/node_modules/grpc/src/client.js:233:26)
at ClientDuplexStream._receiveStatus (/relay/node_modules/grpc/src/client.js:211:8)
at Object.onReceiveStatus (/relay/node_modules/grpc/src/client_interceptors.js:1306:15)
at InterceptingListener._callNext (/relay/node_modules/grpc/src/client_interceptors.js:568:42)
at InterceptingListener.onReceiveStatus (/relay/node_modules/grpc/src/client_interceptors.js:618:8)
at /relay/node_modules/grpc/src/client_interceptors.js:1123:18
code: 14,
metadata: Metadata { _internal_repr: {} },
details: 'TCP Read failed' }

Environment

Additional context

it happens randomly but it start doing that since ~1 month back

@kyle-mccarthy
Copy link

I am having this same issue on Google App Engine standard, node 10.

I am using @google-cloud/logging-winston which depends on @google-cloud/logging and that depends on @google-cloud/common-grpc which uses this package. So I can't say what is causing this issue.

Is this a problem with this package or how other packages implement it?

Error: 14 UNAVAILABLE: TCP Read failed at Object.exports.createStatusError (/srv/node_modules/grpc/src/common.js:87:15) at Object.onReceiveStatus (/srv/node_modules/grpc/src/client_interceptors.js:1188:28) at InterceptingListener._callNext (/srv/node_modules/grpc/src/client_interceptors.js:564:42) at InterceptingListener.onReceiveStatus (/srv/node_modules/grpc/src/client_interceptors.js:614:8) at callback (/srv/node_modules/grpc/src/client_interceptors.js:841:24)

@sergioregueira
Copy link

Related issue: googleapis/google-cloud-node#2438

@murgatroid99
Copy link
Member

The status responses described here are expected behavior of gRPC itself. Generally, we expect that some responses, including UNAVAILABLE, should be retried. In these cases the library that uses gRPC should perform the retry logic, as described in that linked issue.

@sergioregueira
Copy link

@murgatroid99 the problem is gRPC library does not reconnect after that error, so applications stop sending logs to Stackdriver. When this happens, we have to replace the affected instance with a new one.

@murgatroid99
Copy link
Member

First, to make sure that we're on the same page, the way gRPC reconnection is supposed to work is that if a connection fails, then the next time the user starts a call the library starts trying to reestablish the connection. A user that retries failed UNAVAILABLE requests should trigger that reconnection.

So, are you saying that the Stackdriver library is retrying the failed request after the connection drops, and when it does so the gRPC library never reestablishes the connection? If so, that is a major bug that we are otherwise unaware of, so please file a full bug report with reproduction steps.

@nicolasnoble
Copy link
Member

Emphasis on "if the library is retrying the failed request". The initial bug report seems to indicate this isn't the case.

@alexander-fenster
Copy link
Contributor

To add some details here. Client libraries use gRPC via the gax layer, and that gax layer can send retries if the library is configured so. By 'configured so' I mean the JSON configuration file that is passed to client library constructor. If this file mentions that the given method is idempotent - i.e. it is safe to retry it - gax layer will do the retries. If it's non-idempotent, just like all POST methods, it won't be retried automatically and it should be taken care of by the caller.

In logging library, the configuration file is here. @crwilcox made a change 2 weeks ago to treat all the logging method as idempotent, which means that you might get a duplicated log line in case of a retry, but you should not get fails: googleapis/nodejs-logging#366 It does not fix the TCP error that makes gRPC fail, but at least it should make gax retry the failed call (we are now talking only about logging libraries).

Let us make a new release of the logging library, it should fix most of the problems related to logging libraries. For other libraries, let's go one by one and see what's going on, but in general, the caller code should be ready to get UNAVAILABLE (it's a network, so any call can technically fail), and retrying UNAVAILABLE for non-idempotent methods is something your code should take care of.

@sergioregueira
Copy link

sergioregueira commented Feb 1, 2019

@murgatroid99 @nicolasnoble I am sorry for the delay. I was a bit busy these days.

The same day you asked me for more information, I enabled gRPC debugging setting some environment variables (GRPC_TRACE=tcp GRPC_VERBOSITY=DEBUG) and I monitored the logs during several days.

I could check, as you mentioned, that nodejs-logging tries to reconnect automatically when something fails. This is an example where it reconnects successfully (and logs continue being persisted):

I0120 19:12:12.874089209       7 tcp_custom.cc:128]          TCP:0x79c4ea52120 call_cb 0x79c4e995158 0x1537f52a69d0:0x79c4e995100
I0120 19:12:12.874262453       7 tcp_custom.cc:286]          TCP 0x79c4ea52120 shutdown why={"created":"@1548011532.874204613","description":"Endpoint read failed","file":"../deps/grpc/src/core/ext/transport/chttp2/transport/chttp2_transport.cc","file_line":2510,"occurred_during_write":0,"referenced_errors":[{"created":"@1548011532.874198969","description":"Secure read failed","file":"../deps/grpc/src/core/lib/security/transport/secure_endpoint.cc","file_line":183,"referenced_errors":[{"created":"@1548011532.874048422","description":"TCP Read failed","file":"../deps/grpc/src/core/lib/iomgr/tcp_uv.cc","file_line":73,"grpc_status":14,"os_error":"connection reset by peer"}]}]}
D0120 19:12:12.874357814       7 dns_resolver.cc:279]        Start resolving.
I0120 19:13:20.597892586       7 tcp_client_custom.cc:150]   CLIENT_CONNECT: 0x79c4e86a890 ipv6:[2a00:1450:4013:c01::5f]:443: asynchronously connecting
I0120 19:13:20.698567576       7 tcp_custom.cc:348]          Creating TCP endpoint 0x79c4e86a890
I0120 19:13:20.698936427       7 tcp_custom.cc:217]          write complete on 0x79c4e86a890: error="No Error"

Nevertheless, I found some cases where the reconnection was not completed (and logs stopped):

I0120 23:47:28.792244320       7 tcp_custom.cc:128]          TCP:0x79c4e33d0a0 call_cb 0x79c4da4be78 0x1537f52a69d0:0x79c4da4be20
I0120 23:47:28.792345598       7 tcp_custom.cc:132]          read: error={"created":"@1548028048.792215883","description":"TCP Read failed","file":"../deps/grpc/src/core/lib/iomgr/tcp_uv.cc","file_line":73,"grpc_status":14,"os_error":"connection reset by peer"}
I0120 23:47:28.792489454       7 tcp_custom.cc:286]          TCP 0x79c4e33d0a0 shutdown why={"created":"@1548028048.792444724","description":"Endpoint read failed","file":"../deps/grpc/src/core/ext/transport/chttp2/transport/chttp2_transport.cc","file_line":2510,"occurred_during_write":0,"referenced_errors":[{"created":"@1548028048.792438609","description":"Secure read failed","file":"../deps/grpc/src/core/lib/security/transport/secure_endpoint.cc","file_line":183,"referenced_errors":[{"created":"@1548028048.792215883","description":"TCP Read failed","file":"../deps/grpc/src/core/lib/iomgr/tcp_uv.cc","file_line":73,"grpc_status":14,"os_error":"connection reset by peer"}]}]}
D0120 23:47:28.792592164       7 dns_resolver.cc:279]        Start resolving.
***LOGS CEASED***

I think the change done by @crwilcox will reduce the probability of lost logs due to request errors but the error that I have just detailed will continue happening.

By the way, thanks @alexander-fenster for your clear explanation!

@sergioregueira
Copy link

sergioregueira commented Feb 1, 2019

More examples where it does not reconnect:

I0118 20:49:56.469767934       6 tcp_custom.cc:128]          TCP:0x6714db57150 call_cb 0x6714dbdfb78 0x15de0f0a69d0:0x6714dbdfb20
I0118 20:49:56.469810638       6 tcp_custom.cc:132]          read: error={"created":"@1547844596.469749556","description":"TCP Read failed","file":"../deps/grpc/src/core/lib/iomgr/tcp_uv.cc","file_line":73,"grpc_status":14,"os_error":"connection reset by peer"} 
I0118 20:49:56.469885894       6 tcp_custom.cc:234]          WRITE 0x6714db57150 (peer=ipv6:[2a00:1450:4013:c03::5f]:443): ***TRUNCATED***
I0118 20:49:56.469918688       6 tcp_custom.cc:217]          write complete on 0x6714db57150: error={"created":"@1547844596.469907967","description":"TCP Write failed","file":"../deps/grpc/src/core/lib/iomgr/tcp_uv.cc","file_line":73,"grpc_status":14,"os_error":"connection reset by peer"}
I0118 20:49:56.469978028       6 tcp_custom.cc:286]          TCP 0x6714db57150 shutdown why={"created":"@1547844596.469848883","description":"Delayed close due to in-progress write","file":"../deps/grpc/src/core/ext/transport/chttp2/transport/chttp2_transport.cc","file_line":609,"referenced_errors":[{"created":"@1547844596.469834452","description":"Endpoint read failed","file":"../deps/grpc/src/core/ext/transport/chttp2/transport/chttp2_transport.cc","file_line":2510,"occurred_during_write":0,"referenced_errors":[{"created":"@1547844596.469830392","description":"Secure read failed","file":"../deps/grpc/src/core/lib/security/transport/secure_endpoint.cc","file_line":183,"referenced_errors":[{"created":"@1547844596.469749556","description":"TCP Read failed","file":"../deps/grpc/src/core/lib/iomgr/tcp_uv.cc","file_line":73,"grpc_status":14,"os_error":"connection reset by peer"}]}]},{"created":"@1547844596.469907967","description":"TCP Write failed","file":"../deps/grpc/src/core/lib/iomgr/tcp_uv.cc","file_line":73,"grpc_status":14,"os_error":"connection reset by peer"}]}
D0118 20:49:56.470028601       6 dns_resolver.cc:279]        Start resolving.
***LOGS CEASED***
I0119 06:29:15.447665078       6 tcp_custom.cc:128]          TCP:0x710b1824f20 call_cb 0x710b19157d8 0x15fcf80a69d0:0x710b1915780
I0119 06:29:15.452015545       6 tcp_custom.cc:132]          read: error={"created":"@1547879355.447619730","description":"TCP Read failed","file":"../deps/grpc/src/core/lib/iomgr/tcp_uv.cc","file_line":73,"grpc_status":14,"os_error":"connection reset by peer"}
I0119 06:29:15.452104651       6 tcp_custom.cc:286]          TCP 0x710b1824f20 shutdown why={"created":"@1547879355.452064405","description":"Endpoint read failed","file":"../deps/grpc/src/core/ext/transport/chttp2/transport/chttp2_transport.cc","file_line":2510,"occurred_during_write":0,"referenced_errors":[{"created":"@1547879355.452060353","description":"Secure read failed","file":"../deps/grpc/src/core/lib/security/transport/secure_endpoint.cc","file_line":183,"referenced_errors":[{"created":"@1547879355.447619730","description":"TCP Read failed","file":"../deps/grpc/src/core/lib/iomgr/tcp_uv.cc","file_line":73,"grpc_status":14,"os_error":"connection reset by peer"}]}]}
D0119 06:29:15.452271642       6 dns_resolver.cc:279]        Start resolving.
***LOGS CEASED***
I0121 06:29:17.716285929       5 tcp_custom.cc:128]          TCP:0x7a5ecde6af0 call_cb 0x7a5eced6498 0x153cd0ca69d0:0x7a5eced6440
I0121 06:29:17.716381358       5 tcp_custom.cc:132]          read: error={"created":"@1548052157.716232153","description":"TCP Read failed","file":"../deps/grpc/src/core/lib/iomgr/tcp_uv.cc","file_line":73,"grpc_status":14,"os_error":"connection reset by peer"}
I0121 06:29:17.716458612       5 tcp_custom.cc:286]          TCP 0x7a5ecde6af0 shutdown why={"created":"@1548052157.716404989","description":"Endpoint read failed","file":"../deps/grpc/src/core/ext/transport/chttp2/transport/chttp2_transport.cc","file_line":2510,"occurred_during_write":0,"referenced_errors":[{"created":"@1548052157.716401286","description":"Secure read failed","file":"../deps/grpc/src/core/lib/security/transport/secure_endpoint.cc","file_line":183,"referenced_errors":[{"created":"@1548052157.716232153","description":"TCP Read failed","file":"../deps/grpc/src/core/lib/iomgr/tcp_uv.cc","file_line":73,"grpc_status":14,"os_error":"connection reset by peer"}]}]}
D0121 06:29:17.716564239       5 dns_resolver.cc:279]        Start resolving.
***LOGS CEASED***

@crwilcox
Copy link

crwilcox commented Feb 1, 2019

I released a new version of the @google-cloud/logging package that will now retry when writing log entries on UNAVAILABLE. https://www.npmjs.com/package/@google-cloud/logging/v/4.3.0

For operations deemed non-idempotent, such as the case with things like WriteLogEntries, the RPC call that underlies many logging calls, it isn't necessarily safe for the client to retry. A decision was made in this case that duplicated logs are preferable to not logging. There unfortunately will exist other non-idempotent RPCs that will be more complicated.

@murgatroid99
Copy link
Member

@sergioregueira When you say that the logs ceased, do you mean that in each case, that is the last log line output from that process, and that the other logs come from different processes? If so, what is the behavior of those processes after they stop outputting logs?

Also, I don't know exactly what behavior of the gRPC library we are seeing here, but the expected behavior is not that the library always reconnects; it generally only reconnects if there is an existing pending request. Do any of the logs you have show that it is failing to reconnect while there is a pending request. If not, you may need to get more trace information to see that information. The api tracer should help, so you can modify the environment variables to have GRPC_TRACE=api,tcp. The full list of available tracers is in this document.

@sergioregueira
Copy link

sergioregueira commented Feb 1, 2019

@murgatroid99 I mean the application continues working properly but Stackdriver does not process more logs of the affected instance (including gRPC traces). Logs generated automatically (ie. appengine.googleapis.com/request_log) are not interrupted.

When that error happens I have to delete the affected instance because the logs generated by that instance after the error are not registered.

The logs I attached in my previous post is all I could obtain after the errors. I will update right now that environment variable to get more information. If you have additional suggestions to debug the problem just let me know.

@murgatroid99
Copy link
Member

I just want to make sure I am understanding this correctly. You are using the Stackdriver logger to save the gRPC logs generated from that same Stackdriver logger, which you are using to debug why the logger is failing?

@sergioregueira
Copy link

@murgatroid99

On the one hand, our application logs are sent to Stackdriver via nodejs-logging (that uses internally gRPC) in a structured way (using LogEntry.jsonPayload object).

On the other hand, gRPC logs are printed by the library itself to stdout/stderr (we did not have to code anything) and collected automatically by Stackdriver (as simple LogEntry.textPayload string).

In conclusion, we use logs that are NOT sent via gRPC to debug gRPC protocol.

@murgatroid99
Copy link
Member

Actually, it would probably help the most to run with GRPC_TRACE=all. That will tell us with the greatest granularity possible the very last things the gRPC library does before we stop getting logs.

@sergioregueira
Copy link

sergioregueira commented Feb 10, 2019

Good news! I have updated @google-cloud/logging package to 4.3.0 version one week ago and no more errors detected on integration, staging nor production environment since then. Thanks everyone :)

P.S: I will continue logging all gRPC logs a few more days just in case.

@Jeff-Tian
Copy link

Met the same issue with latest grpc npm package (npm install [email protected]).

image

@fajardm
Copy link

fajardm commented Jun 28, 2019

same issue

@RomanVarhola
Copy link

RomanVarhola commented Aug 9, 2019

i have the same issue on aws

@QuestionAndAnswer
Copy link

Same issue on GKE

@Sieabah
Copy link

Sieabah commented Aug 29, 2019

We started seeing this issue across our applications last Friday. It's as if once the connection drops the client never reconnects and tries to continue to use the dead connection.

@jacohend
Copy link

jacohend commented Oct 3, 2019

Looks like this often occurs when a connection drops in swarm. You can use endpoint-mode: dnsrr to fix this, but apparently this will nuke all published ports except those in host mode.

@JasperWoo
Copy link

We are seeing this issue in using firestore libraries in GCP.

@Sieabah
Copy link

Sieabah commented Oct 26, 2019

We ended up having to fix this at the network level to just drop the connection.

@stuckj
Copy link

stuckj commented Jan 7, 2020

I'm seeing a similar, though not identical, error from grpc while using the firebase-admin JS library (which uses google/firestore). The exact error I'm seeing is:

12:17:53 PM: 2020-01-07T17:17:53.280Z e5cf6caf-cbc5-4cd8-a1fa-a73df0be9983 INFO Error processing request { Error: 14 UNAVAILABLE: Stream refused by server
at Object.callErrorFromStatus (/var/task/node_modules/@grpc/grpc-js/build/src/call.js:30:26)
at Http2CallStream.call.on (/var/task/node_modules/@grpc/grpc-js/build/src/client.js:96:33)
at Http2CallStream.emit (events.js:203:15)
at process.nextTick (/var/task/node_modules/@grpc/grpc-js/build/src/call-stream.js:75:22)
at process._tickCallback (internal/process/next_tick.js:61:11)
code: 14,
details: 'Stream refused by server',
metadata: Metadata { internalRepr: Map {}, options: {} } }

So, stream refused by server. I would chalk this up to intermittent firestore errors, but we see this a few times an hour and currently have a pretty limited number of requests (only about 10 an hour or so). So, this is a pretty high percent error rate.

This is happening in a backend cloud function on AWS infrastructure. Specifically, we're using netlify which uses AWS lambda as their cloud function provider.

For now, I'm going to implement some retry logic on our cloud function to see if that resolves the issue or if the issue persists on any lambda instances in which it shows up.

I'll also enable the GRPC_TRACE=all GRPC_VERBOSITY=DEBUG environment variables to see if that yields any more detail.

I'll report back with whatever I find.

@iamdhrooov
Copy link

Also facing same issue on production code. Have setup a go grpc server and node grpc client. This is happening intermittently which is more of a pain to debug. I also tried to set keepalive params, timeouts on server and client. Please suggest a solution for this.
Will also update if i find another workaround.

@wdalmut
Copy link

wdalmut commented Jan 18, 2020

👍 same issue on production code with Hyperledger Fabric v1.2

@stuckj
Copy link

stuckj commented Jan 27, 2020

I don't have useful logging to provide (complications with getting the debug logs to send to our log service). But, I can say that after implementing retry logic (retries up to 5 times with a fixed 100ms delay rather than exponential) the problem has completely gone away.

So, I think in my case, at least, the errors are considered normal by the firestore library and require the user to retry. /shrug

@rodrigoreis22
Copy link

@stuckj I'm seeing the same pattern on our AWS Lambda functions using the firestore client (latest version).

I was able to reproduce almost 100% of the cases in our dev environment.

I think this is related to how the container works in AWS Lambda. For the first call (cold start lambda) the Firestore client is initialized and the data is written properly, if I do a second try right after, it works fine because the lambda is "warm".
After that I wait ~5 minutes and try to post again, I got an error almost 100% of the times, this when I was the only one using the dev environment.

It looks like the persisted connection/or connection pool from Firestore is not being reused or not being able to be reused after the lambda function has no activity for a few minutes.

Then I try setting the provisioned concurrency for Lambda to try and keep it "warm".. after waiting 5 minutes and trying again I got a different problem, a timeout.

It looks like the connection pooling developed doesn't work very well in a serverless environment with ephemeral containers.

@murgatroid99
Copy link
Member

For anyone who is experiencing this problem, if you are using the @grpc/grpc-js library (this is probably the case if you are using one of the Google Cloud or Firebase API client libraries), I suggest updating your dependencies to pick up the newest version of the library, published a couple of days ago. That version has some changes to some of the logic that emits this kind of error, so you should not longer see it in some cases.

@orgads
Copy link
Contributor

orgads commented Mar 27, 2020

I suppose you refer to #1199? It did not fix this issue.

orgads added a commit to orgads/grpc-node that referenced this issue Mar 30, 2020
This happens when a device doesn't have IPv6 address, but the endpoint
address is resolved to both IPv4 and IPv6 addresses.

The automatic attempt to reconnect that fails is detected as another channel
failure, which leads the load balancer to abort the connection.

Fixes grpc#664 and grpc#692
@murgatroid99
Copy link
Member

@orgads I think I was actually referring to #1251. I don't think #1199 is related to the problems reported here at all.

@stefanotauriello
Copy link

@stuckj I'm seeing the same pattern on our AWS Lambda functions using the firestore client (latest version).

I was able to reproduce almost 100% of the cases in our dev environment.

I think this is related to how the container works in AWS Lambda. For the first call (cold start lambda) the Firestore client is initialized and the data is written properly, if I do a second try right after, it works fine because the lambda is "warm".
After that I wait ~5 minutes and try to post again, I got an error almost 100% of the times, this when I was the only one using the dev environment.

It looks like the persisted connection/or connection pool from Firestore is not being reused or not being able to be reused after the lambda function has no activity for a few minutes.

Then I try setting the provisioned concurrency for Lambda to try and keep it "warm".. after waiting 5 minutes and trying again I got a different problem, a timeout.

It looks like the connection pooling developed doesn't work very well in a serverless environment with ephemeral containers.

I'm using it in AWS Lambda, how can I fix it?

@stuckj
Copy link

stuckj commented May 20, 2020

Also, for sure you cannot create the firestore client outside of your lambda handler if you're using AWS lambda. The client must be created fresh for every request. The way lambda "freeze-dries" your lamdba functions when they go into a cold state will break all socket connections which will mess up the state for the firestore client (in my experience). I've had similar problems with Winston for logging and the solution is to initialize whatever object has the network connections inside of the lambda.

@stefanotauriello
Copy link

stefanotauriello commented May 20, 2020 via email

@ming2020
Copy link

ming2020 commented Jun 1, 2020

Try reducing grpc to 1.22.2

@raghuchahar007
Copy link

I am also having the same issue while doing smile query on firestore,
{ Error: 14 UNAVAILABLE: DNS resolution failed at Object.exports.createStatusError (/node_modules/grpc/src/common.js:91:15) at ClientReadableStream._emitStatusIfDone (/node_modules/grpc/src/client.js:233:26) at ClientReadableStream._readsDone (/node_modules/grpc/src/client.js:199:8) at /node_modules/grpc/src/client_interceptors.js:683:15

Version:
firebase-admin: 7.0.0
Node: 6.X

Please help!

@alexander-fenster
Copy link
Contributor

@raghuchahar007 what's your gRPC versions? Please post the output of npm ls grpc @grpc/grpc-js google-gax.

@grahamjenson
Copy link

We have a similar issue and by putting in the .npmrc put

build_from_source=grpc
grpc_alpn=false

I am hoping to find a way to do this without rebuilding, but this is my best guess.

@dooleyb1
Copy link

Resolved personally by updating node modules FROM:

    "firebase-admin": "^8.7.0",
    "googleapis": "^42.0.0",

TO:

    "firebase-admin": "^9.1.1",
    "googleapis": "^59.0.0",

@mayankpunetha007
Copy link

Having the same problem with:
"@grpc/grpc-js": "^1.3.7",
"@grpc/proto-loader": "^0.6.5"

Setup is:
Local Node Server ---HTTPS----> Google Load Balancer ------HTTP2----> GRPC backend
Always getting:
Error: 14 UNAVAILABLE: Connection dropped

Any luck with this for anyone else?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.