Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URGENT: [composer] GRPC randomly throws ServiceException("Socket closed") #2427

Closed
lukasgit opened this issue Oct 26, 2019 · 97 comments
Closed
Assignees
Labels
api: firestore Issues related to the Firestore API. priority: p2 Moderately-important priority. Fix may not be included in next release. 🚨 This issue needs some love. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@lukasgit
Copy link

@dwsupplee @jdpedrie this issue still randomly persists.

URGENT REQUEST. We're part of the Google Cloud Startup program and launching this year... a fix would be greatly appreciated so we can move to production.

Log output:

PHP Fatal error:  Uncaught Google\Cloud\Core\Exception\ServiceException: {
    "message": "Socket closed",
    "code": 14,
    "status": "UNAVAILABLE",
    "details": []
} in /[...]/Google/composer-google-cloud/vendor/google/cloud/Core/src/GrpcRequestWrapper.php:257
Stack trace:
#0 /[...]/Google/composer-google-cloud/vendor/google/cloud/Core/src/GrpcRequestWrapper.php(194): Google\Cloud\Core\GrpcRequestWrapper->convertToGoogleException(Object(Google\ApiCore\ApiException))
#1 [internal function]: Google\Cloud\Core\GrpcRequestWrapper->handleStream(Object(Google\ApiCore\ServerStream))
#2 /[...]/Google/composer-google-cloud/vendor/google/cloud/Firestore/src/SnapshotTrait.php(122): Generator->current()
#3 /[...]/Google/composer-google-clo in /[...]/Google/composer-google-cloud/vendor/google/cloud/Core/src/GrpcRequestWrapper.php on line 257
@lukasgit lukasgit changed the title [composer] GRPC randomly throws ServiceException("Socket closed") URGENT: [composer] GRPC randomly throws ServiceException("Socket closed") Oct 26, 2019
@yoshi-automation yoshi-automation added the triage me I really want to be triaged. label Oct 26, 2019
@dwsupplee dwsupplee added api: spanner Issues related to the Spanner API. type: question Request for information or clarification. Not an issue. and removed triage me I really want to be triaged. labels Oct 28, 2019
@lukasgit
Copy link
Author

@dwsupplee firestore, not spanner

@dwsupplee dwsupplee added api: firestore Issues related to the Firestore API. and removed api: spanner Issues related to the Spanner API. labels Oct 29, 2019
@jdpedrie
Copy link
Contributor

@lukasgit what version of the grpc extension do you have installed? You can find this value by running pecl list if you installed via PECL, or in the output of php -i or phpinfo().

@lukasgit
Copy link
Author

@jdpedrie

Package  Version State
grpc     1.23.1  stable
protobuf 3.10.0  stable

@jdpedrie
Copy link
Contributor

jdpedrie commented Oct 31, 2019

Thanks @lukasgit.

Does it seem to always be related to a particular service call, or does it happen randomly?

Would you be able to enable gRPC debugging and share the results relating to a failure? You can set the following environment variables to turn on debugging: GRPC_VERBOSITY=debug GRPC_TRACE=all. Please note that the debug output may include sensitive information such as access tokens, so if you'd prefer, you may send it to me via the email address on my profile.

@lukasgit
Copy link
Author

lukasgit commented Oct 31, 2019

@jdpedrie randomly, and I'm not seeing any gRPC debugging info in the php logs:

putenv('GRPC_VERBOSITY=debug');
putenv('GRPC_TRACE=all');

@jdpedrie
Copy link
Contributor

@lukasgit I'm sorry I was unclear. Setting those variables will cause additional logging data to be written to stderr, not to the PHP log.

@lukasgit
Copy link
Author

lukasgit commented Nov 1, 2019

@jdpedrie not seeing any gRPC debugging info to stderr.

php.ini
display_errors = stderr

@lukasgit
Copy link
Author

lukasgit commented Nov 6, 2019

@jdpedrie ^^

@jdpedrie
Copy link
Contributor

jdpedrie commented Nov 6, 2019

Hey @lukasgit sorry for not responding sooner. I'm working on this though, hope to have more by the end of the day, or tomorrow at latest.

@lukasgit
Copy link
Author

lukasgit commented Nov 6, 2019

@jdpedrie great, thanks for the update.

@jdpedrie
Copy link
Contributor

jdpedrie commented Nov 6, 2019

Are you using PHP-FPM? Make sure your configuration has catch_workers_output=yes. This should cause the workers' stderr output to be written to the server error log.

@lukasgit
Copy link
Author

lukasgit commented Nov 6, 2019

@jdpedrie still nothing... here is what I have configured:

nginx-1.17.5 (dev environment compiled with --with-debug):

error_log logs/error.log debug;

php-fpm (php-7.3.11):

catch_workers_output = yes
env[GRPC_VERBOSITY] = debug
env[GRPC_TRACE] = all

php code:

putenv('GRPC_VERBOSITY=debug');
putenv('GRPC_TRACE=all');

@jdpedrie
Copy link
Contributor

jdpedrie commented Nov 8, 2019

Hi @lukasgit, I'm sorry again for the back-and-forth. I've been trying without much success to capture the gRPC debugging data from nginx/php-fpm.

Do you have control over the system which is managing the FPM daemon? supervisord and systemd for instance? Are you able to configure php-fpm to start with an explicit pipe of stderr to file?

php-fpm 2>/var/log/php-fpm.log

Additionally, I spoke with a contact on the gRPC team, and he suggested you toggle the debug and verbosity a bit differently than what I advised earlier:

GRPC_VERBOSITY=debug                                                                                    
GRPC_TRACE=api,call_error,channel,client_channel_call,connectivity_state,handshaker,http,subchannel,tcp

I've opened an issue on gRPC to improve the utilities for capturing gRPC debugging information in PHP.

@lukasgit
Copy link
Author

lukasgit commented Nov 8, 2019

Hi @jdpedrie, no worries on the back-and-forth. Whatever it takes for us to resolve this issue.

I do have full control over the development system. Following your instructions, still nothing related to gRPC in /var/log/php-fpm.log :

[08-Nov-2019 21:24:18] NOTICE: fpm is running, pid 1104
[08-Nov-2019 21:24:18] NOTICE: ready to handle connections

nginx-1.17.5 (dev environment compiled with --with-debug):

error_log logs/error.log debug;

php-fpm (php-7.3.11):

catch_workers_output = yes
env[GRPC_VERBOSITY] = debug
env[GRPC_TRACE] = api,call_error,channel,client_channel_call,connectivity_state,handshaker,http,subchannel,tcp

php code:

putenv('GRPC_VERBOSITY=debug');
putenv('GRPC_TRACE=api,call_error,channel,client_channel_call,connectivity_state,handshaker,http,subchannel,tcp');

@jdpedrie
Copy link
Contributor

jdpedrie commented Nov 8, 2019

Hi @lukasgit,

@stanley-cheung, the person I've been talking to on the gRPC team, mentioned that in his tests using Apache, he found that setting the environment variables with putenv and even in the server configuration was too late and they were ineffective. In his test using docker, he set them in the Dockerfile and had better luck. Could you try setting them at the highest level possible? Perhaps etc/environment or similar.

@stanley-cheung
Copy link
Contributor

Hi @lukasgit I am one of the maintainers of the grpc php extension.

The only thing that seems to work for me (in terms of the php-fpm + nginx setup), is to do the combination of these 2:

  • Set the environment variable at the top of my Dockerfile ( I am using FROM php:7.2-fpm-stretch as a base for example)
  • Run php-fpm 2>/var/log/php-fpm.log to start the service (instead of relying on the standard CMD) to redirect stderr away to my desired log file

I have tried all those catch_workers_output = yes, env[GRPC_VERBOSITY] = debug in the fpm www.conf file but none of those work. Doing putenv() in a .php script seems definitely too late.

But even with the php-fpm 2>mylog trick, the log is being outputed kind of ugly with " [pool www] child 12 said into stdout: " and lines got broken up.

So I am currently working on adding a php.ini option for the grpc extension that we can divert all those grpc logs into a separate log file, tracked in here. Will keep you updated.

But also just so I know, for the initial error, how rare / often does that happen? Are we talking about 1 every 10 request, or 1 every 10000 request? Just want to see what the scale of things is.

@lukasgit
Copy link
Author

lukasgit commented Nov 9, 2019

Hi @stanley-cheung thanks for jumping in on this issue.

I will hang tight until the php.ini option for the grpc extension is available.

As for how many times the error occurs per x requests, it varies widely. Sometimes it happens 7 times out of 50 requests. Sometimes 0 times out of 1000 requests. We haven't ran a stress test of 10,000 requests on grpc yet.

@stanley-cheung
Copy link
Contributor

I started this PR: grpc/grpc#20991. This works for the most part but may need some more polish. I am slightly concerned about the lack of file rotations / capping the size of the log file - as it stands, the log file will grow unbounded. There might be a need for a separate cronjob or something to monitor and regularly truncating this log file.

@lukasgit
Copy link
Author

lukasgit commented Dec 4, 2019

Hi @stanley-cheung any update on this issue? Thanks :)

@stanley-cheung
Copy link
Contributor

@lukasgit Sorry for the late reply. We are very close to cutting a 1.26.0RC1 release candidate, which will contain that fix. Once that's done, will you be able to install that in your environment and enable the grpc log to a separate log file? You might need to monitor the file size growth if the error was not happening frequently.

@lukasgit
Copy link
Author

lukasgit commented Dec 5, 2019

@stanley-cheung since our last chat, we upgraded the composer version and only had the error happen one time (yesterday). So yeah, it's still very random. I can definitely run your next RC, just provide me with instructions when you're ready. Thanks!

@tayeke
Copy link

tayeke commented Jun 17, 2021

For anyone else reading this thread the line grpc.grpc_trace=all,-timer_check, actually broke my grpc. Something about the minus syntax. This made the issue I was originally trying to resolve 1000% worse and I just happened this time around while debugging trying the other syntax offered by some: grpc.grpc_trace=api,call_error,channel,client_channel_call,connectivity_state,handshaker,http,subchannel,tcp.

Also don't forget me make your log file writeable. sudo chmod 666 /var/log/grpc.log

@lukasgit
Copy link
Author

lukasgit commented Jul 8, 2021

Why is the ball getting dropped here? I've provided ample data and we've identified the problem. Now we're waiting for a fix on your end. Numerous people across many threads and community platforms are complaining about this issue. @stanley-cheung @dwsupplee @schmidt-sebastian

@stanley-cheung @dwsupplee @schmidt-sebastian getting close to a year later. This is a pretty serious bug that affects production. Do you have a fix or at the very least an update?

@meredithslota meredithslota added priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. and removed type: question Request for information or clarification. Not an issue. labels Sep 17, 2021
@meredithslota
Copy link

This issue was mis-categorized as a "question" and thus slipped through our SLO filters. I've adjusted it. cc: @dwsupplee

@lukasgit
Copy link
Author

Thank you @meredithslota

@sl0wik
Copy link

sl0wik commented Oct 26, 2021

I think the same issue might cause { "message": "Connection reset by peer", "code": 14, "status": "UNAVAILABLE", "details": [] }.

@yoshi-automation yoshi-automation added the 🚨 This issue needs some love. label Nov 3, 2021
@illambo
Copy link

illambo commented Nov 16, 2021

Santa clou(d)s please help us...

@saranshdhingra
Copy link
Contributor

Hi @lukasgit
Thanks a lot for having the patience with respect to this issue. I understand it's a bit frustrating to still getting the errors after such a long time.

Update on this:
I had created a setup based on this comment(#2427 (comment)) and it's been running for almost 10 days. So far I have had 1 occurrence of "UNAVAILABLE" and I had application logs and GRPC logs enabled.

I will respond back here once I have a sense of what improvements can be made from the library's side at this point of time.

@lukasgit
Copy link
Author

Hi @saranshdhingra

Huge thanks for taking initiative on resolving this issue.

As a workaround we run a function that attempts to execute interactions with Firestore. If an error occurs, the function will log the error and try again up to five times before giving up. This seems to work as the first attempt fails (randomly), while the second attempt works. NOTE: In the last six months we've noticed a substantial decrease in "message": "Socket closed" / "status": "UNAVAILABLE errors.

Unfortunately we're now receiving these new error messages since November 9, 2021 (note these show the same "code":14):

[09-Nov-2021 19:13:39 UTC] FIRESTORE error:  Google\Cloud\Core\Exception\ServiceException: {
    "message": "GOAWAY received",
    "code": 14,
    "status": "UNAVAILABLE",
    "details": []
} in (redacted)/google/cloud/Core/src/GrpcRequestWrapper.php:257
Stack trace:
#0 (redacted)/google/cloud/Core/src/GrpcRequestWrapper.php(194): Google\Cloud\Core\GrpcRequestWrapper->convertToGoogleException()
#1 [internal function]: Google\Cloud\Core\GrpcRequestWrapper->handleStream()
#2 (redacted)/google/cloud/Firestore/src/SnapshotTrait.php(122): Generator->current()
#3 (redacted)/google/cloud/Firestore/src/SnapshotTrait.php(61): Google\Cloud\Firestore\DocumentReference->getSnapshot()
#4 (redacted)/google/cloud/Firestore/src/DocumentReference.php(347): Google\Cloud\Firestore\DocumentReference->createSnapshot()
#5 (redacted): Google\Cloud\Firestore\DocumentReference->snapshot()
#6 (redacted): require('(redacted)...')
#7 (redacted): require('(redacted)...')
#8 (redacted): require('(redacted)...')
#9 (redacted): (redacted)
#10 {main} in (redacted)
[09-Nov-2021 19:46:01 UTC] FIRESTORE error:  Google\Cloud\Core\Exception\ServiceException: {
    "message": "GOAWAY received",
    "code": 14,
    "status": "UNAVAILABLE",
    "details": []
} in (redacted)/google/cloud/Core/src/GrpcRequestWrapper.php:257
Stack trace:
#0 (redacted)/google/cloud/Core/src/GrpcRequestWrapper.php(194): Google\Cloud\Core\GrpcRequestWrapper->convertToGoogleException()
#1 [internal function]: Google\Cloud\Core\GrpcRequestWrapper->handleStream()
#2 (redacted)/google/cloud/Firestore/src/SnapshotTrait.php(122): Generator->current()
#3 (redacted)/google/cloud/Firestore/src/SnapshotTrait.php(61): Google\Cloud\Firestore\DocumentReference->getSnapshot()
#4 (redacted)/google/cloud/Firestore/src/DocumentReference.php(347): Google\Cloud\Firestore\DocumentReference->createSnapshot()
#5 (redacted): Google\Cloud\Firestore\DocumentReference->snapshot()
#6 (redacted): require('(redacted)...')
#7 (redacted): require('(redacted)...')
#8 (redacted): require('(redacted)...')
#9 (redacted): (redacted)
#10 {main} in (redacted)
[21-Dec-2021 05:24:05 UTC] FIRESTORE error:  Google\Cloud\Core\Exception\ServiceException: {
    "message": "GOAWAY received",
    "code": 14,
    "status": "UNAVAILABLE",
    "details": []
} in (redacted)/google/cloud/Core/src/GrpcRequestWrapper.php:257
Stack trace:
#0 (redacted)/google/cloud/Core/src/GrpcRequestWrapper.php(194): Google\Cloud\Core\GrpcRequestWrapper->convertToGoogleException()
#1 [internal function]: Google\Cloud\Core\GrpcRequestWrapper->handleStream()
#2 (redacted)/google/cloud/Firestore/src/SnapshotTrait.php(122): Generator->current()
#3 (redacted)/google/cloud/Firestore/src/SnapshotTrait.php(61): Google\Cloud\Firestore\DocumentReference->getSnapshot()
#4 (redacted)/google/cloud/Firestore/src/DocumentReference.php(347): Google\Cloud\Firestore\DocumentReference->createSnapshot()
#5 (redacted): Google\Cloud\Firestore\DocumentReference->snapshot()
#6 (redacted): require('(redacted)...')
#7 (redacted): require('(redacted)...')
#8 (redacted): require('(redacted)...')
#9 (redacted): (redacted)
#10 {main} in (redacted)

@iamacarpet
Copy link

We are seeing high numbers of these errors in production when talking to Datastore in Firestore mode, from App Engine 2nd Generation runtimes.

iamacarpet added a commit to affordablemobiles/GServerlessSupportLaravel that referenced this issue Feb 14, 2022
@saranshdhingra
Copy link
Contributor

We are seeing high numbers of these errors in production when talking to Datastore in Firestore mode, from App Engine 2nd Generation runtimes.

Hi @iamacarpet
Thanks for the info.
Currently, I have been running the test script constantly for 6 weeks, and it's extremely hard to reproduce this.

Can you tell me how frequently are you experiencing the errors(once a week, once a day etc)?
It would be great if you could send the error message(either paste here or in an online paste), making sure no sensitive information is present(if it's exactly like the error on the above comment, then you need not do it).

Meanwhile, I'll try to setup a check in App Engine as well.

@saranshdhingra
Copy link
Contributor

Hi @lukasgit
Can you you tell me how frequently have the errors occurred red since your last comment? As my setup hasn't detected any errors in a while.

It will be great if you could also paste in your workaround code(retry mechanism), I wanna understand if there is some difference with the in built retry mechanism because this should be happening automatically upto 5 times, unless the option is overwritten(which I am sure isn't the case here).

@iamacarpet
Copy link

@saranshdhingra I'll share some info from our new project that has been in testing, here is the traffic level (started getting real visitor traffic 5 days ago):

image

And the error we saw in Error Reporting:

image

The error has dropped off now, as we were already using the latest version of google-cloud-php, but changed our models to all use ExponentialBackoff for reads and writes, plus updated this function to better map to this error:

affordablemobiles/GServerlessSupportLaravel@303ae7c

We used to see a lot of "Connection reset by peer" before it switched from Datastore to Firestore in Datastore mode, so it appears this error is the equivalent.

Example usage:

https://github.com/a1comms/GaeSupportLaravel/blob/php7.4-laravel6.0/src/A1comms/GaeSupportLaravel/Session/DatastoreSessionHandler.php#L145

https://github.com/a1comms/GaeSupportLaravel/blob/php7.4-laravel6.0/src/A1comms/GaeSupportLaravel/Session/DatastoreSessionHandler.php#L179

Since deploying this, the errors appear to have gone away.

We are currently in the process of deploying to our other projects, one of them has an error rate that looks like this:

image

@saranshdhingra
Copy link
Contributor

Thanks a lot for the detailed info @iamacarpet
Although "UNAVAILABLE" errors are retry-able, not sure why that's not happening here.
Will try to dig in and update.

@iamacarpet
Copy link

@saranshdhingra thanks, if it helps, the error thrown is weird.

It’s the only exception we’ve seen thrown without a stack trace, and from what I can tell, it’s something to do with being thrown from within the binary gRPC module.

Could be unrelated but in case it helps

@lukasgit
Copy link
Author

lukasgit commented Mar 3, 2022

@saranshdhingra thanks, if it helps, the error thrown is weird.

It’s the only exception we’ve seen thrown without a stack trace, and from what I can tell, it’s something to do with being thrown from within the binary gRPC module.

Could be unrelated but in case it helps

My thoughts exactly with @iamacarpet

FYI, we have not experienced any further "UNAVAILABLE" errors since December 21, 2021 using the latest versions:

google/grpc-gcp (v0.2.0)
google/protobuf (v3.19.4)
grpc/grpc (1.42.0)

@saranshdhingra
Copy link
Contributor

I haven't seen any errors in the project I created to replicate this error.
So, I will be closing this for now.

But, I want to let you know that we have efforts planned out for improving the error messages across the library for all products.

I will try to come back to this and see if our efforts make such situations better.
Thanks a lot for the patience you folks showed for this particular error :)

@jaryder
Copy link

jaryder commented Nov 16, 2022

Hey there @saranshdhingra could you share what OS you're running on and what version of the gRPC library you're using where you do not see the "socket closed" issue? I'm inclined to think this is an issue in the network stack below gRPC or perhaps how gRPC is interacting with the native network stack.

We are running on Ubuntu 18.04 with gRPC v1.28.1 and see this issue with some frequency (on the order of 10s of minutes during a long-lived bi-directional streaming connection).

I'm curious if this issue was solved or if it just did not repro in your environment and hence was closed? Thanks!

@meredithslota
Copy link

@jaryder I'm going to copy your comment into a new issue for investigation, since it's still occurring.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: firestore Issues related to the Firestore API. priority: p2 Moderately-important priority. Fix may not be included in next release. 🚨 This issue needs some love. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests