-
Notifications
You must be signed in to change notification settings - Fork 438
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
URGENT: [composer] GRPC randomly throws ServiceException("Socket closed") #2427
Comments
@dwsupplee firestore, not spanner |
@lukasgit what version of the grpc extension do you have installed? You can find this value by running |
|
Thanks @lukasgit. Does it seem to always be related to a particular service call, or does it happen randomly? Would you be able to enable gRPC debugging and share the results relating to a failure? You can set the following environment variables to turn on debugging: |
@jdpedrie randomly, and I'm not seeing any gRPC debugging info in the php logs:
|
@lukasgit I'm sorry I was unclear. Setting those variables will cause additional logging data to be written to |
@jdpedrie not seeing any gRPC debugging info to stderr. php.ini |
@jdpedrie ^^ |
Hey @lukasgit sorry for not responding sooner. I'm working on this though, hope to have more by the end of the day, or tomorrow at latest. |
@jdpedrie great, thanks for the update. |
Are you using PHP-FPM? Make sure your configuration has |
@jdpedrie still nothing... here is what I have configured: nginx-1.17.5 (dev environment compiled with --with-debug):
php-fpm (php-7.3.11):
php code:
|
Hi @lukasgit, I'm sorry again for the back-and-forth. I've been trying without much success to capture the gRPC debugging data from nginx/php-fpm. Do you have control over the system which is managing the FPM daemon?
Additionally, I spoke with a contact on the gRPC team, and he suggested you toggle the debug and verbosity a bit differently than what I advised earlier:
I've opened an issue on gRPC to improve the utilities for capturing gRPC debugging information in PHP. |
Hi @jdpedrie, no worries on the back-and-forth. Whatever it takes for us to resolve this issue. I do have full control over the development system. Following your instructions, still nothing related to gRPC in
nginx-1.17.5 (dev environment compiled with --with-debug):
php-fpm (php-7.3.11):
php code:
|
Hi @lukasgit, @stanley-cheung, the person I've been talking to on the gRPC team, mentioned that in his tests using Apache, he found that setting the environment variables with putenv and even in the server configuration was too late and they were ineffective. In his test using docker, he set them in the Dockerfile and had better luck. Could you try setting them at the highest level possible? Perhaps |
Hi @lukasgit I am one of the maintainers of the grpc php extension. The only thing that seems to work for me (in terms of the php-fpm + nginx setup), is to do the combination of these 2:
I have tried all those But even with the So I am currently working on adding a php.ini option for the grpc extension that we can divert all those grpc logs into a separate log file, tracked in here. Will keep you updated. But also just so I know, for the initial error, how rare / often does that happen? Are we talking about 1 every 10 request, or 1 every 10000 request? Just want to see what the scale of things is. |
Hi @stanley-cheung thanks for jumping in on this issue. I will hang tight until the php.ini option for the grpc extension is available. As for how many times the error occurs per x requests, it varies widely. Sometimes it happens 7 times out of 50 requests. Sometimes 0 times out of 1000 requests. We haven't ran a stress test of 10,000 requests on grpc yet. |
I started this PR: grpc/grpc#20991. This works for the most part but may need some more polish. I am slightly concerned about the lack of file rotations / capping the size of the log file - as it stands, the log file will grow unbounded. There might be a need for a separate cronjob or something to monitor and regularly truncating this log file. |
Hi @stanley-cheung any update on this issue? Thanks :) |
@lukasgit Sorry for the late reply. We are very close to cutting a |
@stanley-cheung since our last chat, we upgraded the composer version and only had the error happen one time (yesterday). So yeah, it's still very random. I can definitely run your next RC, just provide me with instructions when you're ready. Thanks! |
For anyone else reading this thread the line Also don't forget me make your log file writeable. |
@stanley-cheung @dwsupplee @schmidt-sebastian getting close to a year later. This is a pretty serious bug that affects production. Do you have a fix or at the very least an update? |
This issue was mis-categorized as a "question" and thus slipped through our SLO filters. I've adjusted it. cc: @dwsupplee |
Thank you @meredithslota |
I think the same issue might cause |
Santa clou(d)s please help us... |
Hi @lukasgit Update on this: I will respond back here once I have a sense of what improvements can be made from the library's side at this point of time. |
Huge thanks for taking initiative on resolving this issue. As a workaround we run a function that attempts to execute interactions with Firestore. If an error occurs, the function will log the error and try again up to five times before giving up. This seems to work as the first attempt fails (randomly), while the second attempt works. NOTE: In the last six months we've noticed a substantial decrease in Unfortunately we're now receiving these new error messages since November 9, 2021 (note these show the same
|
We are seeing high numbers of these errors in production when talking to Datastore in Firestore mode, from App Engine 2nd Generation runtimes. |
Hi @iamacarpet Can you tell me how frequently are you experiencing the errors(once a week, once a day etc)? Meanwhile, I'll try to setup a check in App Engine as well. |
Hi @lukasgit It will be great if you could also paste in your workaround code(retry mechanism), I wanna understand if there is some difference with the in built retry mechanism because this should be happening automatically upto 5 times, unless the option is overwritten(which I am sure isn't the case here). |
@saranshdhingra I'll share some info from our new project that has been in testing, here is the traffic level (started getting real visitor traffic 5 days ago): And the error we saw in Error Reporting: The error has dropped off now, as we were already using the latest version of affordablemobiles/GServerlessSupportLaravel@303ae7c We used to see a lot of "Connection reset by peer" before it switched from Datastore to Firestore in Datastore mode, so it appears this error is the equivalent. Example usage: Since deploying this, the errors appear to have gone away. We are currently in the process of deploying to our other projects, one of them has an error rate that looks like this: |
Thanks a lot for the detailed info @iamacarpet |
@saranshdhingra thanks, if it helps, the error thrown is weird. It’s the only exception we’ve seen thrown without a stack trace, and from what I can tell, it’s something to do with being thrown from within the binary gRPC module. Could be unrelated but in case it helps |
My thoughts exactly with @iamacarpet FYI, we have not experienced any further "UNAVAILABLE" errors since December 21, 2021 using the latest versions:
|
I haven't seen any errors in the project I created to replicate this error. But, I want to let you know that we have efforts planned out for improving the error messages across the library for all products. I will try to come back to this and see if our efforts make such situations better. |
Hey there @saranshdhingra could you share what OS you're running on and what version of the gRPC library you're using where you do not see the "socket closed" issue? I'm inclined to think this is an issue in the network stack below gRPC or perhaps how gRPC is interacting with the native network stack. We are running on Ubuntu 18.04 with gRPC v1.28.1 and see this issue with some frequency (on the order of 10s of minutes during a long-lived bi-directional streaming connection). I'm curious if this issue was solved or if it just did not repro in your environment and hence was closed? Thanks! |
@jaryder I'm going to copy your comment into a new issue for investigation, since it's still occurring. |
@dwsupplee @jdpedrie this issue still randomly persists.
URGENT REQUEST. We're part of the Google Cloud Startup program and launching this year... a fix would be greatly appreciated so we can move to production.
Log output:
The text was updated successfully, but these errors were encountered: