URGENT: [composer] GRPC randomly throws ServiceException("Socket closed") #2427

lukasgit · 2019-10-26T00:07:15Z

@dwsupplee @jdpedrie this issue still randomly persists.

URGENT REQUEST. We're part of the Google Cloud Startup program and launching this year... a fix would be greatly appreciated so we can move to production.

Log output:

PHP Fatal error:  Uncaught Google\Cloud\Core\Exception\ServiceException: {
    "message": "Socket closed",
    "code": 14,
    "status": "UNAVAILABLE",
    "details": []
} in /[...]/Google/composer-google-cloud/vendor/google/cloud/Core/src/GrpcRequestWrapper.php:257
Stack trace:
#0 /[...]/Google/composer-google-cloud/vendor/google/cloud/Core/src/GrpcRequestWrapper.php(194): Google\Cloud\Core\GrpcRequestWrapper->convertToGoogleException(Object(Google\ApiCore\ApiException))
#1 [internal function]: Google\Cloud\Core\GrpcRequestWrapper->handleStream(Object(Google\ApiCore\ServerStream))
#2 /[...]/Google/composer-google-cloud/vendor/google/cloud/Firestore/src/SnapshotTrait.php(122): Generator->current()
#3 /[...]/Google/composer-google-clo in /[...]/Google/composer-google-cloud/vendor/google/cloud/Core/src/GrpcRequestWrapper.php on line 257

The text was updated successfully, but these errors were encountered:

lukasgit · 2019-10-29T04:47:40Z

@dwsupplee firestore, not spanner

jdpedrie · 2019-10-30T16:31:17Z

@lukasgit what version of the grpc extension do you have installed? You can find this value by running pecl list if you installed via PECL, or in the output of php -i or phpinfo().

lukasgit · 2019-10-31T17:29:15Z

@jdpedrie

Package  Version State
grpc     1.23.1  stable
protobuf 3.10.0  stable

jdpedrie · 2019-10-31T17:45:16Z

Thanks @lukasgit.

Does it seem to always be related to a particular service call, or does it happen randomly?

Would you be able to enable gRPC debugging and share the results relating to a failure? You can set the following environment variables to turn on debugging: GRPC_VERBOSITY=debug GRPC_TRACE=all. Please note that the debug output may include sensitive information such as access tokens, so if you'd prefer, you may send it to me via the email address on my profile.

lukasgit · 2019-10-31T19:05:31Z

@jdpedrie randomly, and I'm not seeing any gRPC debugging info in the php logs:

putenv('GRPC_VERBOSITY=debug');
putenv('GRPC_TRACE=all');

jdpedrie · 2019-10-31T19:34:44Z

@lukasgit I'm sorry I was unclear. Setting those variables will cause additional logging data to be written to stderr, not to the PHP log.

lukasgit · 2019-11-01T19:34:28Z

@jdpedrie not seeing any gRPC debugging info to stderr.

php.ini
display_errors = stderr

lukasgit · 2019-11-06T16:31:37Z

@jdpedrie ^^

jdpedrie · 2019-11-06T17:04:33Z

Hey @lukasgit sorry for not responding sooner. I'm working on this though, hope to have more by the end of the day, or tomorrow at latest.

lukasgit · 2019-11-06T17:33:07Z

@jdpedrie great, thanks for the update.

jdpedrie · 2019-11-06T21:07:27Z

Are you using PHP-FPM? Make sure your configuration has catch_workers_output=yes. This should cause the workers' stderr output to be written to the server error log.

lukasgit · 2019-11-06T22:22:21Z

@jdpedrie still nothing... here is what I have configured:

nginx-1.17.5 (dev environment compiled with --with-debug):

error_log logs/error.log debug;

php-fpm (php-7.3.11):

catch_workers_output = yes
env[GRPC_VERBOSITY] = debug
env[GRPC_TRACE] = all

php code:

putenv('GRPC_VERBOSITY=debug');
putenv('GRPC_TRACE=all');

jdpedrie · 2019-11-08T15:02:04Z

Hi @lukasgit, I'm sorry again for the back-and-forth. I've been trying without much success to capture the gRPC debugging data from nginx/php-fpm.

Do you have control over the system which is managing the FPM daemon? supervisord and systemd for instance? Are you able to configure php-fpm to start with an explicit pipe of stderr to file?

php-fpm 2>/var/log/php-fpm.log

Additionally, I spoke with a contact on the gRPC team, and he suggested you toggle the debug and verbosity a bit differently than what I advised earlier:

GRPC_VERBOSITY=debug                                                                                    
GRPC_TRACE=api,call_error,channel,client_channel_call,connectivity_state,handshaker,http,subchannel,tcp

I've opened an issue on gRPC to improve the utilities for capturing gRPC debugging information in PHP.

lukasgit · 2019-11-08T21:28:34Z

Hi @jdpedrie, no worries on the back-and-forth. Whatever it takes for us to resolve this issue.

I do have full control over the development system. Following your instructions, still nothing related to gRPC in /var/log/php-fpm.log :

[08-Nov-2019 21:24:18] NOTICE: fpm is running, pid 1104
[08-Nov-2019 21:24:18] NOTICE: ready to handle connections

nginx-1.17.5 (dev environment compiled with --with-debug):

error_log logs/error.log debug;

php-fpm (php-7.3.11):

catch_workers_output = yes
env[GRPC_VERBOSITY] = debug
env[GRPC_TRACE] = api,call_error,channel,client_channel_call,connectivity_state,handshaker,http,subchannel,tcp

php code:

putenv('GRPC_VERBOSITY=debug');
putenv('GRPC_TRACE=api,call_error,channel,client_channel_call,connectivity_state,handshaker,http,subchannel,tcp');

jdpedrie · 2019-11-08T21:35:44Z

Hi @lukasgit,

@stanley-cheung, the person I've been talking to on the gRPC team, mentioned that in his tests using Apache, he found that setting the environment variables with putenv and even in the server configuration was too late and they were ineffective. In his test using docker, he set them in the Dockerfile and had better luck. Could you try setting them at the highest level possible? Perhaps etc/environment or similar.

stanley-cheung · 2019-11-08T23:03:46Z

Hi @lukasgit I am one of the maintainers of the grpc php extension.

The only thing that seems to work for me (in terms of the php-fpm + nginx setup), is to do the combination of these 2:

Set the environment variable at the top of my Dockerfile ( I am using FROM php:7.2-fpm-stretch as a base for example)
Run php-fpm 2>/var/log/php-fpm.log to start the service (instead of relying on the standard CMD) to redirect stderr away to my desired log file

I have tried all those catch_workers_output = yes, env[GRPC_VERBOSITY] = debug in the fpm www.conf file but none of those work. Doing putenv() in a .php script seems definitely too late.

But even with the php-fpm 2>mylog trick, the log is being outputed kind of ugly with " [pool www] child 12 said into stdout: " and lines got broken up.

So I am currently working on adding a php.ini option for the grpc extension that we can divert all those grpc logs into a separate log file, tracked in here. Will keep you updated.

But also just so I know, for the initial error, how rare / often does that happen? Are we talking about 1 every 10 request, or 1 every 10000 request? Just want to see what the scale of things is.

lukasgit · 2019-11-09T01:28:17Z

Hi @stanley-cheung thanks for jumping in on this issue.

I will hang tight until the php.ini option for the grpc extension is available.

As for how many times the error occurs per x requests, it varies widely. Sometimes it happens 7 times out of 50 requests. Sometimes 0 times out of 1000 requests. We haven't ran a stress test of 10,000 requests on grpc yet.

stanley-cheung · 2019-11-09T08:31:57Z

I started this PR: grpc/grpc#20991. This works for the most part but may need some more polish. I am slightly concerned about the lack of file rotations / capping the size of the log file - as it stands, the log file will grow unbounded. There might be a need for a separate cronjob or something to monitor and regularly truncating this log file.

lukasgit · 2019-12-04T22:30:03Z

Hi @stanley-cheung any update on this issue? Thanks :)

stanley-cheung · 2019-12-05T19:00:09Z

@lukasgit Sorry for the late reply. We are very close to cutting a 1.26.0RC1 release candidate, which will contain that fix. Once that's done, will you be able to install that in your environment and enable the grpc log to a separate log file? You might need to monitor the file size growth if the error was not happening frequently.

lukasgit · 2019-12-05T19:19:29Z

@stanley-cheung since our last chat, we upgraded the composer version and only had the error happen one time (yesterday). So yeah, it's still very random. I can definitely run your next RC, just provide me with instructions when you're ready. Thanks!

tayeke · 2021-06-17T23:36:08Z

For anyone else reading this thread the line grpc.grpc_trace=all,-timer_check, actually broke my grpc. Something about the minus syntax. This made the issue I was originally trying to resolve 1000% worse and I just happened this time around while debugging trying the other syntax offered by some: grpc.grpc_trace=api,call_error,channel,client_channel_call,connectivity_state,handshaker,http,subchannel,tcp.

Also don't forget me make your log file writeable. sudo chmod 666 /var/log/grpc.log

lukasgit · 2021-07-08T03:42:41Z

Why is the ball getting dropped here? I've provided ample data and we've identified the problem. Now we're waiting for a fix on your end. Numerous people across many threads and community platforms are complaining about this issue. @stanley-cheung @dwsupplee @schmidt-sebastian

@stanley-cheung @dwsupplee @schmidt-sebastian getting close to a year later. This is a pretty serious bug that affects production. Do you have a fix or at the very least an update?

meredithslota · 2021-09-17T17:37:28Z

This issue was mis-categorized as a "question" and thus slipped through our SLO filters. I've adjusted it. cc: @dwsupplee

lukasgit · 2021-09-21T19:05:57Z

Thank you @meredithslota

sl0wik · 2021-10-26T16:14:38Z

I think the same issue might cause { "message": "Connection reset by peer", "code": 14, "status": "UNAVAILABLE", "details": [] }.

illambo · 2021-11-16T08:31:02Z

Santa clou(d)s please help us...

saranshdhingra · 2022-01-10T15:39:47Z

Hi @lukasgit
Thanks a lot for having the patience with respect to this issue. I understand it's a bit frustrating to still getting the errors after such a long time.

Update on this:
I had created a setup based on this comment(#2427 (comment)) and it's been running for almost 10 days. So far I have had 1 occurrence of "UNAVAILABLE" and I had application logs and GRPC logs enabled.

I will respond back here once I have a sense of what improvements can be made from the library's side at this point of time.

lukasgit · 2022-01-16T19:54:41Z

Hi @saranshdhingra

Huge thanks for taking initiative on resolving this issue.

As a workaround we run a function that attempts to execute interactions with Firestore. If an error occurs, the function will log the error and try again up to five times before giving up. This seems to work as the first attempt fails (randomly), while the second attempt works. NOTE: In the last six months we've noticed a substantial decrease in "message": "Socket closed" / "status": "UNAVAILABLE errors.

Unfortunately we're now receiving these new error messages since November 9, 2021 (note these show the same "code":14):

[09-Nov-2021 19:13:39 UTC] FIRESTORE error:  Google\Cloud\Core\Exception\ServiceException: {
    "message": "GOAWAY received",
    "code": 14,
    "status": "UNAVAILABLE",
    "details": []
} in (redacted)/google/cloud/Core/src/GrpcRequestWrapper.php:257
Stack trace:
#0 (redacted)/google/cloud/Core/src/GrpcRequestWrapper.php(194): Google\Cloud\Core\GrpcRequestWrapper->convertToGoogleException()
#1 [internal function]: Google\Cloud\Core\GrpcRequestWrapper->handleStream()
#2 (redacted)/google/cloud/Firestore/src/SnapshotTrait.php(122): Generator->current()
#3 (redacted)/google/cloud/Firestore/src/SnapshotTrait.php(61): Google\Cloud\Firestore\DocumentReference->getSnapshot()
#4 (redacted)/google/cloud/Firestore/src/DocumentReference.php(347): Google\Cloud\Firestore\DocumentReference->createSnapshot()
#5 (redacted): Google\Cloud\Firestore\DocumentReference->snapshot()
#6 (redacted): require('(redacted)...')
#7 (redacted): require('(redacted)...')
#8 (redacted): require('(redacted)...')
#9 (redacted): (redacted)
#10 {main} in (redacted)
[09-Nov-2021 19:46:01 UTC] FIRESTORE error:  Google\Cloud\Core\Exception\ServiceException: {
    "message": "GOAWAY received",
    "code": 14,
    "status": "UNAVAILABLE",
    "details": []
} in (redacted)/google/cloud/Core/src/GrpcRequestWrapper.php:257
Stack trace:
#0 (redacted)/google/cloud/Core/src/GrpcRequestWrapper.php(194): Google\Cloud\Core\GrpcRequestWrapper->convertToGoogleException()
#1 [internal function]: Google\Cloud\Core\GrpcRequestWrapper->handleStream()
#2 (redacted)/google/cloud/Firestore/src/SnapshotTrait.php(122): Generator->current()
#3 (redacted)/google/cloud/Firestore/src/SnapshotTrait.php(61): Google\Cloud\Firestore\DocumentReference->getSnapshot()
#4 (redacted)/google/cloud/Firestore/src/DocumentReference.php(347): Google\Cloud\Firestore\DocumentReference->createSnapshot()
#5 (redacted): Google\Cloud\Firestore\DocumentReference->snapshot()
#6 (redacted): require('(redacted)...')
#7 (redacted): require('(redacted)...')
#8 (redacted): require('(redacted)...')
#9 (redacted): (redacted)
#10 {main} in (redacted)
[21-Dec-2021 05:24:05 UTC] FIRESTORE error:  Google\Cloud\Core\Exception\ServiceException: {
    "message": "GOAWAY received",
    "code": 14,
    "status": "UNAVAILABLE",
    "details": []
} in (redacted)/google/cloud/Core/src/GrpcRequestWrapper.php:257
Stack trace:
#0 (redacted)/google/cloud/Core/src/GrpcRequestWrapper.php(194): Google\Cloud\Core\GrpcRequestWrapper->convertToGoogleException()
#1 [internal function]: Google\Cloud\Core\GrpcRequestWrapper->handleStream()
#2 (redacted)/google/cloud/Firestore/src/SnapshotTrait.php(122): Generator->current()
#3 (redacted)/google/cloud/Firestore/src/SnapshotTrait.php(61): Google\Cloud\Firestore\DocumentReference->getSnapshot()
#4 (redacted)/google/cloud/Firestore/src/DocumentReference.php(347): Google\Cloud\Firestore\DocumentReference->createSnapshot()
#5 (redacted): Google\Cloud\Firestore\DocumentReference->snapshot()
#6 (redacted): require('(redacted)...')
#7 (redacted): require('(redacted)...')
#8 (redacted): require('(redacted)...')
#9 (redacted): (redacted)
#10 {main} in (redacted)

iamacarpet · 2022-02-14T16:21:43Z

We are seeing high numbers of these errors in production when talking to Datastore in Firestore mode, from App Engine 2nd Generation runtimes.

googleapis/google-cloud-php#2427

saranshdhingra · 2022-02-15T11:03:17Z

We are seeing high numbers of these errors in production when talking to Datastore in Firestore mode, from App Engine 2nd Generation runtimes.

Hi @iamacarpet
Thanks for the info.
Currently, I have been running the test script constantly for 6 weeks, and it's extremely hard to reproduce this.

Can you tell me how frequently are you experiencing the errors(once a week, once a day etc)?
It would be great if you could send the error message(either paste here or in an online paste), making sure no sensitive information is present(if it's exactly like the error on the above comment, then you need not do it).

Meanwhile, I'll try to setup a check in App Engine as well.

saranshdhingra · 2022-02-15T11:11:16Z

Hi @lukasgit
Can you you tell me how frequently have the errors occurred red since your last comment? As my setup hasn't detected any errors in a while.

It will be great if you could also paste in your workaround code(retry mechanism), I wanna understand if there is some difference with the in built retry mechanism because this should be happening automatically upto 5 times, unless the option is overwritten(which I am sure isn't the case here).

iamacarpet · 2022-02-15T11:30:29Z

@saranshdhingra I'll share some info from our new project that has been in testing, here is the traffic level (started getting real visitor traffic 5 days ago):

And the error we saw in Error Reporting:

The error has dropped off now, as we were already using the latest version of google-cloud-php, but changed our models to all use ExponentialBackoff for reads and writes, plus updated this function to better map to this error:

affordablemobiles/GServerlessSupportLaravel@303ae7c

We used to see a lot of "Connection reset by peer" before it switched from Datastore to Firestore in Datastore mode, so it appears this error is the equivalent.

Example usage:

https://github.com/a1comms/GaeSupportLaravel/blob/php7.4-laravel6.0/src/A1comms/GaeSupportLaravel/Session/DatastoreSessionHandler.php#L145

https://github.com/a1comms/GaeSupportLaravel/blob/php7.4-laravel6.0/src/A1comms/GaeSupportLaravel/Session/DatastoreSessionHandler.php#L179

Since deploying this, the errors appear to have gone away.

We are currently in the process of deploying to our other projects, one of them has an error rate that looks like this:

saranshdhingra · 2022-02-15T12:11:03Z

Thanks a lot for the detailed info @iamacarpet
Although "UNAVAILABLE" errors are retry-able, not sure why that's not happening here.
Will try to dig in and update.

iamacarpet · 2022-02-15T12:25:09Z

@saranshdhingra thanks, if it helps, the error thrown is weird.

It’s the only exception we’ve seen thrown without a stack trace, and from what I can tell, it’s something to do with being thrown from within the binary gRPC module.

Could be unrelated but in case it helps

lukasgit · 2022-03-03T16:37:18Z

@saranshdhingra thanks, if it helps, the error thrown is weird.

It’s the only exception we’ve seen thrown without a stack trace, and from what I can tell, it’s something to do with being thrown from within the binary gRPC module.

Could be unrelated but in case it helps

My thoughts exactly with @iamacarpet

FYI, we have not experienced any further "UNAVAILABLE" errors since December 21, 2021 using the latest versions:

google/grpc-gcp (v0.2.0)
google/protobuf (v3.19.4)
grpc/grpc (1.42.0)

saranshdhingra · 2022-04-05T09:08:36Z

I haven't seen any errors in the project I created to replicate this error.
So, I will be closing this for now.

But, I want to let you know that we have efforts planned out for improving the error messages across the library for all products.

I will try to come back to this and see if our efforts make such situations better.
Thanks a lot for the patience you folks showed for this particular error :)

jaryder · 2022-11-16T20:12:35Z

Hey there @saranshdhingra could you share what OS you're running on and what version of the gRPC library you're using where you do not see the "socket closed" issue? I'm inclined to think this is an issue in the network stack below gRPC or perhaps how gRPC is interacting with the native network stack.

We are running on Ubuntu 18.04 with gRPC v1.28.1 and see this issue with some frequency (on the order of 10s of minutes during a long-lived bi-directional streaming connection).

I'm curious if this issue was solved or if it just did not repro in your environment and hence was closed? Thanks!

meredithslota · 2023-03-02T20:13:01Z

@jaryder I'm going to copy your comment into a new issue for investigation, since it's still occurring.

lukasgit changed the title ~~[composer] GRPC randomly throws ServiceException("Socket closed")~~ URGENT: [composer] GRPC randomly throws ServiceException("Socket closed") Oct 26, 2019

lukasgit mentioned this issue Oct 26, 2019

[Spanner] GRPC randomly throws ServiceException("Socket closed") #1208

Closed

yoshi-automation added the triage me I really want to be triaged. label Oct 26, 2019

dwsupplee added api: spanner Issues related to the Spanner API. type: question Request for information or clarification. Not an issue. and removed triage me I really want to be triaged. labels Oct 28, 2019

dwsupplee added api: firestore Issues related to the Firestore API. and removed api: spanner Issues related to the Spanner API. labels Oct 29, 2019

dwsupplee assigned jdpedrie Oct 29, 2019

meredithslota assigned dwsupplee Sep 17, 2021

meredithslota unassigned jdpedrie Sep 29, 2021

yoshi-automation added the 🚨 This issue needs some love. label Nov 3, 2021

meredithslota assigned saranshdhingra Dec 8, 2021

saranshdhingra mentioned this issue Jan 6, 2022

Authentication question: error (403-Forbidden) #4830

Closed

iamacarpet added a commit to affordablemobiles/GServerlessSupportLaravel that referenced this issue Feb 14, 2022

Datastore: Retry on UNAVAILABLE

303ae7c

googleapis/google-cloud-php#2427

saranshdhingra closed this as completed Apr 5, 2022

meredithslota unassigned dwsupplee Mar 2, 2023

meredithslota mentioned this issue Mar 2, 2023

gRPC "socket closed" service exception still occurring #5918

Closed

URGENT: [composer] GRPC randomly throws ServiceException("Socket closed") #2427

URGENT: [composer] GRPC randomly throws ServiceException("Socket closed") #2427

Comments

lukasgit commented Oct 26, 2019

lukasgit commented Oct 29, 2019

jdpedrie commented Oct 30, 2019

lukasgit commented Oct 31, 2019

jdpedrie commented Oct 31, 2019 • edited Loading

lukasgit commented Oct 31, 2019 • edited Loading

jdpedrie commented Oct 31, 2019

lukasgit commented Nov 1, 2019

lukasgit commented Nov 6, 2019

jdpedrie commented Nov 6, 2019

lukasgit commented Nov 6, 2019

jdpedrie commented Nov 6, 2019

lukasgit commented Nov 6, 2019

jdpedrie commented Nov 8, 2019

lukasgit commented Nov 8, 2019

jdpedrie commented Nov 8, 2019

stanley-cheung commented Nov 8, 2019

lukasgit commented Nov 9, 2019

stanley-cheung commented Nov 9, 2019

lukasgit commented Dec 4, 2019 • edited Loading

stanley-cheung commented Dec 5, 2019

lukasgit commented Dec 5, 2019

tayeke commented Jun 17, 2021

lukasgit commented Jul 8, 2021

meredithslota commented Sep 17, 2021

lukasgit commented Sep 21, 2021

sl0wik commented Oct 26, 2021

illambo commented Nov 16, 2021

saranshdhingra commented Jan 10, 2022

lukasgit commented Jan 16, 2022

iamacarpet commented Feb 14, 2022

saranshdhingra commented Feb 15, 2022

saranshdhingra commented Feb 15, 2022

iamacarpet commented Feb 15, 2022

saranshdhingra commented Feb 15, 2022

iamacarpet commented Feb 15, 2022

lukasgit commented Mar 3, 2022

saranshdhingra commented Apr 5, 2022

jaryder commented Nov 16, 2022

meredithslota commented Mar 2, 2023

jdpedrie commented Oct 31, 2019 •

edited

Loading

lukasgit commented Oct 31, 2019 •

edited

Loading

lukasgit commented Dec 4, 2019 •

edited

Loading