Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Call stuck connecting when server is initially unavailable, but becomes available later #1434

Closed
kskalski opened this issue May 21, 2020 · 15 comments · Fixed by #1446
Closed

Comments

@kskalski
Copy link
Contributor

Problem description

I have a (C#) grpc server and (grpc-js) client, which I start simultaneously and connect them over loopback (127.0.0.1). Usually the server becomes ready after the client, thus client will often observe nothing listening on a port initially.

Prior to grpc-js 0.7.5 it worked all quite fine, client and some initial calls that I start waited for the server to get ready and connected silently. Apparently 0.7.5 introduced some fixes that made calls fail-fast when client is not connected, so I modified code passing waitForReady:true in metadata.

Unfortunately, this didn't bring the previous behavior. Now the client / calls are stuck and never connect successfully.

Reproduction steps

  1. clone https://github.com/kskalski/snippets/tree/master/GrpcJsElectronSharp
  2. cd GrpcJsElectronSharp && ./build.sh (adjust paths as needed in the script)
  3. run ./Server.exe
  4. run ./Client.exe
  5. observe server receiving messages whenever Client's window is minimized (handler triggering message send)
  6. stop Client and Server
  7. run ./Client.exe
  8. run ./Server.exe
  9. observe server no longer receives messages when Client's window is minimized
    (the calls are stuck on connecting)

Environment

  • OS name, version and architecture: Windows 10, x64
  • Node version: Electron 8.3.0 (Node 12)
  • Node installation method Electron
  • If applicable, compiler version .NET 4.6
  • Package name and version 0.7.5 (it happens also on 1.0.3, but I was able to backtrack that the problem appeared between 0.7.4 and 0.7.5)

Additional context

I poked around a bit and it seems that there is TRANSIENT_FAILURE in channel.js (this piece of code executed):

case picker_1.PickResultType.TRANSIENT_FAILURE:
                if (callMetadata.getOptions().waitForReady) {
                    this.pickQueue.push({ callStream, callMetadata });

but nothing happens afterwards, there is never completion or other result of re-connection attempt

@murgatroid99
Copy link
Member

I just published grpc-js version 1.0.4 with a change that may impact what you are seeing here. Can you try again with that version? If that doesn't fix it, can you run the client in the failing configuration with the environment variables GRPC_TRACE=all and GRPC_VERBOSITY=DEBUG and share the output?

@kskalski
Copy link
Contributor Author

Unfortunately it didn't help. I upated the repro project to use 1.0.4, so you can try yourself, but here are the traces as requested:

When I start the client first:

2020-05-22T09:14:01.406Z | resolving_load_balancer | dns:127.0.0.1:15745 IDLE -> IDLE
2020-05-22T09:14:01.406Z | connectivity_state | dns:127.0.0.1:15745 IDLE -> IDLE
2020-05-22T09:14:01.406Z | dns_resolver | Resolver constructed for target dns:127.0.0.1:15745
2020-05-22T09:14:01.408Z | channel | dns:127.0.0.1:15745 createCall [0] method="/ElectorGateway/WindowControl", deadline=Infinity
2020-05-22T09:14:01.409Z | call_stream | [0] Sending metadata
2020-05-22T09:14:01.409Z | channel | Pick result: QUEUE subchannel: undefined status: undefined undefined
(electron) The default value of app.allowRendererProcessReuse is deprecated, it is currently "false".  It will change to be "true" in Electron 9.  For more information please check https://github.com/electron/electron/issues/18397
2020-05-22T09:14:01.483Z | dns_resolver | Resolution update requested for target dns:127.0.0.1:15745
2020-05-22T09:14:01.483Z | dns_resolver | Returning IP address for target dns:127.0.0.1:15745
2020-05-22T09:14:01.484Z | resolving_load_balancer | dns:127.0.0.1:15745 IDLE -> CONNECTING
2020-05-22T09:14:01.485Z | channel | Pick result: QUEUE subchannel: undefined status: undefined undefined
2020-05-22T09:14:01.485Z | connectivity_state | dns:127.0.0.1:15745 IDLE -> CONNECTING
2020-05-22T09:14:01.485Z | resolving_load_balancer | dns:127.0.0.1:15745 CONNECTING -> CONNECTING
2020-05-22T09:14:01.485Z | channel | Pick result: QUEUE subchannel: undefined status: undefined undefined
2020-05-22T09:14:01.485Z | connectivity_state | dns:127.0.0.1:15745 CONNECTING -> CONNECTING
2020-05-22T09:14:01.489Z | pick_first | Connect to address list 127.0.0.1:15745
2020-05-22T09:14:01.490Z | subchannel | 127.0.0.1:15745 refcount 0 -> 1
2020-05-22T09:14:01.490Z | subchannel | 127.0.0.1:15745 refcount 1 -> 2
2020-05-22T09:14:01.491Z | pick_first | Start connecting to subchannel with address 127.0.0.1:15745
2020-05-22T09:14:01.491Z | pick_first | IDLE -> CONNECTING
2020-05-22T09:14:01.491Z | resolving_load_balancer | dns:127.0.0.1:15745 CONNECTING -> CONNECTING
2020-05-22T09:14:01.492Z | channel | Pick result: QUEUE subchannel: undefined status: undefined undefined
2020-05-22T09:14:01.492Z | connectivity_state | dns:127.0.0.1:15745 CONNECTING -> CONNECTING
2020-05-22T09:14:01.492Z | subchannel | 127.0.0.1:15745 IDLE -> CONNECTING
2020-05-22T09:14:01.493Z | pick_first | CONNECTING -> CONNECTING
2020-05-22T09:14:01.493Z | resolving_load_balancer | dns:127.0.0.1:15745 CONNECTING -> CONNECTING
2020-05-22T09:14:01.493Z | channel | Pick result: QUEUE subchannel: undefined status: undefined undefined
2020-05-22T09:14:01.494Z | connectivity_state | dns:127.0.0.1:15745 CONNECTING -> CONNECTING
2020-05-22T09:14:03.548Z | subchannel | 127.0.0.1:15745 connection closed with error connect ECONNREFUSED 127.0.0.1:15745
2020-05-22T09:14:03.548Z | subchannel | 127.0.0.1:15745 CONNECTING -> TRANSIENT_FAILURE
2020-05-22T09:14:03.549Z | pick_first | CONNECTING -> TRANSIENT_FAILURE
2020-05-22T09:14:03.549Z | resolving_load_balancer | dns:127.0.0.1:15745 CONNECTING -> TRANSIENT_FAILURE
2020-05-22T09:14:03.549Z | channel | Pick result: TRANSIENT_FAILURE subchannel: undefined status: 14 No connection established
2020-05-22T09:14:03.550Z | connectivity_state | dns:127.0.0.1:15745 CONNECTING -> TRANSIENT_FAILURE

nothing else happens afterwards even when I start the server, when the message is supposed to be send there is:

2020-05-22T09:15:22.325Z | call_stream | [0] write() called with message of length 25
2020-05-22T09:15:22.326Z | call_stream | [0] deferring writing data chunk of length 30

on the other hand, when the server already running, it's like this:

2020-05-22T09:16:36.611Z | resolving_load_balancer | dns:127.0.0.1:15745 IDLE -> IDLE
2020-05-22T09:16:36.611Z | connectivity_state | dns:127.0.0.1:15745 IDLE -> IDLE
2020-05-22T09:16:36.611Z | dns_resolver | Resolver constructed for target dns:127.0.0.1:15745
2020-05-22T09:16:36.614Z | channel | dns:127.0.0.1:15745 createCall [0] method="/ElectorGateway/WindowControl", deadline=Infinity
2020-05-22T09:16:36.615Z | call_stream | [0] Sending metadata
2020-05-22T09:16:36.616Z | channel | Pick result: QUEUE subchannel: undefined status: undefined undefined
(electron) The default value of app.allowRendererProcessReuse is deprecated, it is currently "false".  It will change to be "true" in Electron 9.  For more information please check https://github.com/electron/electron/issues/18397
2020-05-22T09:16:36.679Z | dns_resolver | Resolution update requested for target dns:127.0.0.1:15745
2020-05-22T09:16:36.688Z | dns_resolver | Returning IP address for target dns:127.0.0.1:15745
2020-05-22T09:16:36.688Z | resolving_load_balancer | dns:127.0.0.1:15745 IDLE -> CONNECTING
2020-05-22T09:16:36.689Z | channel | Pick result: QUEUE subchannel: undefined status: undefined undefined
2020-05-22T09:16:36.689Z | connectivity_state | dns:127.0.0.1:15745 IDLE -> CONNECTING
2020-05-22T09:16:36.689Z | resolving_load_balancer | dns:127.0.0.1:15745 CONNECTING -> CONNECTING
2020-05-22T09:16:36.689Z | channel | Pick result: QUEUE subchannel: undefined status: undefined undefined
2020-05-22T09:16:36.690Z | connectivity_state | dns:127.0.0.1:15745 CONNECTING -> CONNECTING
2020-05-22T09:16:36.694Z | pick_first | Connect to address list 127.0.0.1:15745
2020-05-22T09:16:36.698Z | subchannel | 127.0.0.1:15745 refcount 0 -> 1
2020-05-22T09:16:36.699Z | subchannel | 127.0.0.1:15745 refcount 1 -> 2
2020-05-22T09:16:36.700Z | pick_first | Start connecting to subchannel with address 127.0.0.1:15745
2020-05-22T09:16:36.700Z | pick_first | IDLE -> CONNECTING
2020-05-22T09:16:36.701Z | resolving_load_balancer | dns:127.0.0.1:15745 CONNECTING -> CONNECTING
2020-05-22T09:16:36.701Z | channel | Pick result: QUEUE subchannel: undefined status: undefined undefined
2020-05-22T09:16:36.702Z | connectivity_state | dns:127.0.0.1:15745 CONNECTING -> CONNECTING
2020-05-22T09:16:36.702Z | subchannel | 127.0.0.1:15745 IDLE -> CONNECTING
2020-05-22T09:16:36.703Z | pick_first | CONNECTING -> CONNECTING
2020-05-22T09:16:36.703Z | resolving_load_balancer | dns:127.0.0.1:15745 CONNECTING -> CONNECTING
2020-05-22T09:16:36.704Z | channel | Pick result: QUEUE subchannel: undefined status: undefined undefined
2020-05-22T09:16:36.704Z | connectivity_state | dns:127.0.0.1:15745 CONNECTING -> CONNECTING
2020-05-22T09:16:36.712Z | subchannel | 127.0.0.1:15745 CONNECTING -> READY
2020-05-22T09:16:36.713Z | pick_first | Pick subchannel with address 127.0.0.1:15745
2020-05-22T09:16:36.713Z | pick_first | CONNECTING -> READY
2020-05-22T09:16:36.713Z | resolving_load_balancer | dns:127.0.0.1:15745 CONNECTING -> READY
2020-05-22T09:16:36.713Z | channel | Pick result: COMPLETE subchannel: 127.0.0.1:15745 status: undefined undefined
2020-05-22T09:16:36.714Z | connectivity_state | dns:127.0.0.1:15745 CONNECTING -> READY
2020-05-22T09:16:36.714Z | subchannel | 127.0.0.1:15745 refcount 2 -> 3
2020-05-22T09:16:36.714Z | subchannel | 127.0.0.1:15745 refcount 3 -> 2
2020-05-22T09:16:36.717Z | subchannel | Starting stream with headers
                grpc-accept-encoding: identity,deflate,gzip
                accept-encoding: identity,gzip
                :authority: 127.0.0.1
                user-agent: grpc-node-js/1.0.4
                content-type: application/grpc
                :method: POST
                :path: /ElectorGateway/WindowControl
                te: trailers

2020-05-22T09:16:36.725Z | call_stream | [0] attachHttp2Stream from subchannel 127.0.0.1:15745
2020-05-22T09:16:36.725Z | subchannel | 127.0.0.1:15745 callRefcount 0 -> 1
sending
2020-05-22T09:16:40.331Z | call_stream | [0] write() called with message of length 25
2020-05-22T09:16:40.332Z | call_stream | [0] sending data chunk of length 30

@murgatroid99
Copy link
Member

That first log is the complete output when you start the client, then the server?

@kskalski
Copy link
Contributor Author

Yes, starting the server doesn't produce any reaction in the client logs.

@murgatroid99
Copy link
Member

OK, there's definitely a bug. There should have been more to that log with or without a server. I will look into it.

@murgatroid99
Copy link
Member

I modified your example code to work on Node, and when I ran the client I saw the trace output I expected: the client enters a continuous loop of trying and failing to connect. This might be an Electron-specific issue.

@kskalski
Copy link
Contributor Author

Might be. Just a note that 0.7.4 works stable (>=0.7.5 stopped working) and trace from it looks quite different, it never even enters TRANSIENT_FAILURE state:

2020-05-27T02:26:02.504Z | resolving_load_balancer | 127.0.0.1:15745 IDLE -> IDLE
2020-05-27T02:26:02.504Z | connectivity_state | 127.0.0.1:15745 IDLE -> IDLE
2020-05-27T02:26:02.504Z | dns_resolver | Resolver constructed for target 127.0.0.1:15745
2020-05-27T02:26:02.507Z | channel | 127.0.0.1:15745 createCall [0] method="/ElectorGateway/WindowControl", deadline=Infinity
2020-05-27T02:26:02.509Z | call_stream | [0] Sending metadata
2020-05-27T02:26:02.512Z | channel | Pick result: QUEUE subchannel: undefined status: undefined undefined
(electron) The default value of app.allowRendererProcessReuse is deprecated, it is currently "false".  It will change to be "true" in Electron 9.  For more information please check https://github.com/electron/electron/issues/18397
2020-05-27T02:26:02.581Z | dns_resolver | Resolution update requested for target 127.0.0.1:15745
2020-05-27T02:26:02.582Z | dns_resolver | Returning IP address for target 127.0.0.1:15745
2020-05-27T02:26:02.584Z | resolving_load_balancer | 127.0.0.1:15745 IDLE -> CONNECTING
2020-05-27T02:26:02.585Z | channel | Pick result: QUEUE subchannel: undefined status: undefined undefined
2020-05-27T02:26:02.585Z | connectivity_state | 127.0.0.1:15745 IDLE -> CONNECTING
2020-05-27T02:26:02.585Z | resolving_load_balancer | 127.0.0.1:15745 CONNECTING -> CONNECTING
2020-05-27T02:26:02.585Z | channel | Pick result: QUEUE subchannel: undefined status: undefined undefined
2020-05-27T02:26:02.585Z | connectivity_state | 127.0.0.1:15745 CONNECTING -> CONNECTING
2020-05-27T02:26:02.600Z | pick_first | IDLE -> IDLE
2020-05-27T02:26:02.603Z | resolving_load_balancer | 127.0.0.1:15745 CONNECTING -> IDLE
2020-05-27T02:26:02.603Z | channel | Pick result: QUEUE subchannel: undefined status: undefined undefined
2020-05-27T02:26:02.603Z | connectivity_state | 127.0.0.1:15745 CONNECTING -> IDLE
2020-05-27T02:26:02.604Z | pick_first | Connect to address list 127.0.0.1:15745
2020-05-27T02:26:02.606Z | subchannel | 127.0.0.1:15745 refcount 0 -> 1
2020-05-27T02:26:02.606Z | subchannel | 127.0.0.1:15745 refcount 1 -> 2
2020-05-27T02:26:02.607Z | pick_first | Start connecting to subchannel with address 127.0.0.1:15745
2020-05-27T02:26:02.607Z | pick_first | IDLE -> CONNECTING
2020-05-27T02:26:02.607Z | resolving_load_balancer | 127.0.0.1:15745 IDLE -> CONNECTING
2020-05-27T02:26:02.607Z | channel | Pick result: QUEUE subchannel: undefined status: undefined undefined
2020-05-27T02:26:02.607Z | connectivity_state | 127.0.0.1:15745 IDLE -> CONNECTING
2020-05-27T02:26:02.608Z | subchannel | 127.0.0.1:15745 IDLE -> CONNECTING
2020-05-27T02:26:02.613Z | pick_first | CONNECTING -> CONNECTING
2020-05-27T02:26:02.619Z | resolving_load_balancer | 127.0.0.1:15745 CONNECTING -> CONNECTING
2020-05-27T02:26:02.620Z | channel | Pick result: QUEUE subchannel: undefined status: undefined undefined
2020-05-27T02:26:02.620Z | connectivity_state | 127.0.0.1:15745 CONNECTING -> CONNECTING
2020-05-27T02:26:03.621Z | subchannel | 127.0.0.1:15745 CONNECTING -> IDLE
2020-05-27T02:26:03.647Z | pick_first | CONNECTING -> IDLE
2020-05-27T02:26:03.652Z | subchannel | 127.0.0.1:15745 refcount 2 -> 1
2020-05-27T02:26:03.669Z | resolving_load_balancer | 127.0.0.1:15745 CONNECTING -> IDLE
2020-05-27T02:26:03.671Z | channel | Pick result: QUEUE subchannel: undefined status: undefined undefined
2020-05-27T02:26:03.675Z | connectivity_state | 127.0.0.1:15745 CONNECTING -> IDLE
2020-05-27T02:26:03.686Z | dns_resolver | Resolution update requested for target 127.0.0.1:15745

I've seen there were some bugs being fixed w.r.t. to refcounting in that release and some other bugs in subsequent releases. I wonder if the reason it stalls in 1.0.4 is the same as for 0.7.5, maybe there are two problems. Any idea how to debug it further?

@murgatroid99
Copy link
Member

The subchannel is supposed to go into the TRANSIENT_FAILURE state when it fails to connect, so that log from 0.7.4 is buggy in a different way.

I have now managed to replicate the log you are seeing using Electron, with the exact same code that had the behavior I expected using Node. So, this is definitely Electron-specific behavior.

@murgatroid99
Copy link
Member

I figured out the problem, and the fix is in #1446. It's not exactly Electron-specific behavior; the bug is a race condition that Electron triggers a lot more reliably than Node does.

@badsyntax
Copy link
Contributor

I had the same issue with a similar scenario to OP and had to manually check the server socket was open before connecting the client. Great to see a fix will be added for this 👍

@mr-tech
Copy link

mr-tech commented Jun 3, 2020

Do you think this would this also apply if a client was making two (or more) unary requests to a server that doesn't exist? Error code 14 returns for the first and the connectivity status of the channel goes from 3 (idle) to 0 (connecting). The second request then hangs indefinitely for me. grpc 1.24.2.

@murgatroid99
Copy link
Member

grpc 1.24.2 is the other implementation. It shares no code with the library that has the bug that this issue is about. So, that sounds similar but it is definitely not the same bug.

@murgatroid99
Copy link
Member

I have published @grpc/grpc-js version 1.0.5 with a fix for this bug.

@kskalski
Copy link
Contributor Author

kskalski commented Jun 4, 2020

Thanks, works fine now.

@sidd-kulk
Copy link

I am on the version ^1.1.7. The bug is showing up still. google pub sub version is 1.1.6.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants