-
Notifications
You must be signed in to change notification settings - Fork 653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error: 14 UNAVAILABLE: TCP Read failed #692
Comments
I am having this same issue on Google App Engine standard, node 10. I am using @google-cloud/logging-winston which depends on @google-cloud/logging and that depends on @google-cloud/common-grpc which uses this package. So I can't say what is causing this issue. Is this a problem with this package or how other packages implement it?
|
Related issue: googleapis/google-cloud-node#2438 |
The status responses described here are expected behavior of gRPC itself. Generally, we expect that some responses, including |
@murgatroid99 the problem is gRPC library does not reconnect after that error, so applications stop sending logs to Stackdriver. When this happens, we have to replace the affected instance with a new one. |
First, to make sure that we're on the same page, the way gRPC reconnection is supposed to work is that if a connection fails, then the next time the user starts a call the library starts trying to reestablish the connection. A user that retries failed So, are you saying that the Stackdriver library is retrying the failed request after the connection drops, and when it does so the gRPC library never reestablishes the connection? If so, that is a major bug that we are otherwise unaware of, so please file a full bug report with reproduction steps. |
Emphasis on "if the library is retrying the failed request". The initial bug report seems to indicate this isn't the case. |
To add some details here. Client libraries use gRPC via the gax layer, and that gax layer can send retries if the library is configured so. By 'configured so' I mean the JSON configuration file that is passed to client library constructor. If this file mentions that the given method is idempotent - i.e. it is safe to retry it - gax layer will do the retries. If it's non-idempotent, just like all POST methods, it won't be retried automatically and it should be taken care of by the caller. In logging library, the configuration file is here. @crwilcox made a change 2 weeks ago to treat all the logging method as idempotent, which means that you might get a duplicated log line in case of a retry, but you should not get fails: googleapis/nodejs-logging#366 It does not fix the TCP error that makes gRPC fail, but at least it should make gax retry the failed call (we are now talking only about logging libraries). Let us make a new release of the logging library, it should fix most of the problems related to logging libraries. For other libraries, let's go one by one and see what's going on, but in general, the caller code should be ready to get |
@murgatroid99 @nicolasnoble I am sorry for the delay. I was a bit busy these days. The same day you asked me for more information, I enabled gRPC debugging setting some environment variables ( I could check, as you mentioned, that
Nevertheless, I found some cases where the reconnection was not completed (and logs stopped):
I think the change done by @crwilcox will reduce the probability of lost logs due to request errors but the error that I have just detailed will continue happening. By the way, thanks @alexander-fenster for your clear explanation! |
More examples where it does not reconnect:
|
I released a new version of the For operations deemed non-idempotent, such as the case with things like |
@sergioregueira When you say that the logs ceased, do you mean that in each case, that is the last log line output from that process, and that the other logs come from different processes? If so, what is the behavior of those processes after they stop outputting logs? Also, I don't know exactly what behavior of the gRPC library we are seeing here, but the expected behavior is not that the library always reconnects; it generally only reconnects if there is an existing pending request. Do any of the logs you have show that it is failing to reconnect while there is a pending request. If not, you may need to get more trace information to see that information. The |
@murgatroid99 I mean the application continues working properly but Stackdriver does not process more logs of the affected instance (including gRPC traces). Logs generated automatically (ie. When that error happens I have to delete the affected instance because the logs generated by that instance after the error are not registered. The logs I attached in my previous post is all I could obtain after the errors. I will update right now that environment variable to get more information. If you have additional suggestions to debug the problem just let me know. |
I just want to make sure I am understanding this correctly. You are using the Stackdriver logger to save the gRPC logs generated from that same Stackdriver logger, which you are using to debug why the logger is failing? |
On the one hand, our application logs are sent to Stackdriver via On the other hand, gRPC logs are printed by the library itself to In conclusion, we use logs that are NOT sent via gRPC to debug gRPC protocol. |
Actually, it would probably help the most to run with |
Good news! I have updated P.S: I will continue logging all gRPC logs a few more days just in case. |
Met the same issue with latest |
same issue |
i have the same issue on aws |
Same issue on GKE |
We started seeing this issue across our applications last Friday. It's as if once the connection drops the client never reconnects and tries to continue to use the dead connection. |
Looks like this often occurs when a connection drops in swarm. You can use |
We are seeing this issue in using firestore libraries in GCP. |
We ended up having to fix this at the network level to just drop the connection. |
I'm seeing a similar, though not identical, error from grpc while using the firebase-admin JS library (which uses google/firestore). The exact error I'm seeing is:
So, This is happening in a backend cloud function on AWS infrastructure. Specifically, we're using netlify which uses AWS lambda as their cloud function provider. For now, I'm going to implement some retry logic on our cloud function to see if that resolves the issue or if the issue persists on any lambda instances in which it shows up. I'll also enable the I'll report back with whatever I find. |
Also facing same issue on production code. Have setup a go grpc server and node grpc client. This is happening intermittently which is more of a pain to debug. I also tried to set keepalive params, timeouts on server and client. Please suggest a solution for this. |
👍 same issue on production code with Hyperledger Fabric v1.2 |
I don't have useful logging to provide (complications with getting the debug logs to send to our log service). But, I can say that after implementing retry logic (retries up to 5 times with a fixed 100ms delay rather than exponential) the problem has completely gone away. So, I think in my case, at least, the errors are considered normal by the firestore library and require the user to retry. /shrug |
@stuckj I'm seeing the same pattern on our AWS Lambda functions using the firestore client (latest version). I was able to reproduce almost 100% of the cases in our dev environment. I think this is related to how the container works in AWS Lambda. For the first call (cold start lambda) the Firestore client is initialized and the data is written properly, if I do a second try right after, it works fine because the lambda is "warm". It looks like the persisted connection/or connection pool from Firestore is not being reused or not being able to be reused after the lambda function has no activity for a few minutes. Then I try setting the provisioned concurrency for Lambda to try and keep it "warm".. after waiting 5 minutes and trying again I got a different problem, a timeout. It looks like the connection pooling developed doesn't work very well in a serverless environment with ephemeral containers. |
For anyone who is experiencing this problem, if you are using the |
I suppose you refer to #1199? It did not fix this issue. |
I'm using it in AWS Lambda, how can I fix it? |
Also, for sure you cannot create the firestore client outside of your lambda handler if you're using AWS lambda. The client must be created fresh for every request. The way lambda "freeze-dries" your lamdba functions when they go into a cold state will break all socket connections which will mess up the state for the firestore client (in my experience). I've had similar problems with Winston for logging and the solution is to initialize whatever object has the network connections inside of the lambda. |
Thanks
Il giorno mer 20 mag 2020 alle 14:26 Jonathan Stucklen <
[email protected]> ha scritto:
Also, for sure you cannot create the firestore client outside of your
lambda handler if you're using AWS lambda. The client must be created fresh
for every request. The way lambda "freeze-dries" your lamdba functions when
they go into a cold state will break all socket connections which will mess
up the state for the firestore client (in my experience). I've had similar
problems with Winston for logging and the solution is to initialize
whatever object has the network connections inside of the lambda.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#692 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADEOXWXKIJYNYZN3PYOJ3M3RSPD5RANCNFSM4GMSOCZQ>
.
--
Stefano Tauriello
[email protected]
[image: facebook] <https://www.facebook.com/stefano.tauriello>[image:
linkedin] <https://www.linkedin.com/in/stefano-tauriello-9036784a/>[image:
twitter] <https://twitter.com/stauriello>[image: instagram]
<https://www.instagram.com/stefanotauriello/>
|
Try reducing |
I am also having the same issue while doing smile query on firestore, Version: Please help! |
@raghuchahar007 what's your gRPC versions? Please post the output of |
We have a similar issue and by putting in the
I am hoping to find a way to do this without rebuilding, but this is my best guess. |
Resolved personally by updating node modules FROM:
TO:
|
Having the same problem with: Setup is: Any luck with this for anyone else? |
Problem description
Conversation Error: { Error: 14 UNAVAILABLE: TCP Read failed
at Object.exports.createStatusError (/relay/node_modules/grpc/src/common.js:91:15)
at ClientDuplexStream._emitStatusIfDone (/relay/node_modules/grpc/src/client.js:233:26)
at ClientDuplexStream._receiveStatus (/relay/node_modules/grpc/src/client.js:211:8)
at Object.onReceiveStatus (/relay/node_modules/grpc/src/client_interceptors.js:1306:15)
at InterceptingListener._callNext (/relay/node_modules/grpc/src/client_interceptors.js:568:42)
at InterceptingListener.onReceiveStatus (/relay/node_modules/grpc/src/client_interceptors.js:618:8)
at /relay/node_modules/grpc/src/client_interceptors.js:1123:18
code: 14,
metadata: Metadata { _internal_repr: {} },
details: 'TCP Read failed' }
Environment
├─┬ [email protected]
│ └── [email protected] deduped
└── [email protected]
Additional context
it happens randomly but it start doing that since ~1 month back
The text was updated successfully, but these errors were encountered: