Only create keepalive timeout if there isn't one pending #2304

nicknotfun · 2022-12-28T22:39:54Z

From https://github.com/grpc/proposal/blob/master/A8-client-side-keepalive.md we have the keepalive time and timeout; the intention being that a keepalive ping is sent at-least every time and if there is no bytes of response within timeout the connection is considered dead.

The current grpc-js implementation has an issue in that if time is less than the actual ping response time (due to network, GC, whatever at the remote end) - a new ping would be scheduled, with a new timeout, and the old reference to the timeout is lost and thus will never be cancelled. This inevitably drops the connection.

As the intention of the gRFC [in my reading] was 'any byte' of response, I think it is then reasonable to only have a single timeout active at a time, and despite potentially multiple ping() calls, any response will close that timeout until the next ping.

Unsure on intended testing as I couldn't find an example current test- but from detailed gRPC tracing in our application I believe we are running into this issue, an example log:

28-12-2022T16:37:41 D 2022-12-28T21:37:41.594Z | keepalive | (44) 127.0.0.1:3052 Received ping response
28-12-2022T16:37:42 D 2022-12-28T21:37:42.594Z | keepalive | (44) 127.0.0.1:3052 Sending ping with timeout 10000ms
28-12-2022T16:37:42 D 2022-12-28T21:37:42.885Z | call_stream | [7] receive HTTP/2 data frame of length 17
28-12-2022T16:37:42 D 2022-12-28T21:37:42.885Z | call_stream | [7] parsed message of length 17
28-12-2022T16:37:42 D 2022-12-28T21:37:42.885Z | call_stream | [7] filterReceivedMessage of length 17
28-12-2022T16:37:42 D 2022-12-28T21:37:42.885Z | call_stream | [7] pushing to reader message of length 12
28-12-2022T16:37:42 D 2022-12-28T21:37:42.885Z | keepalive | (44) 127.0.0.1:3052 Received ping response
28-12-2022T16:37:43 D 2022-12-28T21:37:43.594Z | keepalive | (44) 127.0.0.1:3052 Ping timeout passed without response
28-12-2022T16:37:43 D 2022-12-28T21:37:43.594Z | subchannel | (44) 127.0.0.1:3052 READY -> TRANSIENT_FAILURE

It is not clear which keepalive is responsible for that timeout, however the connection did get a ping response less than a second prior so I don't think it should be considered a timeout (our settings are 1000ms/10000ms for time and timeout)

linux-foundation-easycla · 2022-12-28T22:39:56Z

The committers listed above are authorized under a signed CLA.

✅ login: nicknotfun / name: Nick Cooper (7b43638)

nicknotfun · 2022-12-28T23:01:21Z

Note, an alternative would be to quash any new ping if one is in-flight (early exit instead of conditional create) - I wasn't sure which is preferred. As the data should be reliable, i'd also expect the first ping to eventually get response.

One reason not to do this is it makes the keepalive-load from a client predictable and constant.

nicknotfun · 2022-12-29T01:00:09Z

Oh, i'm also willing to rework this to be closer to the spec overall if desired, namely a keepalive should be scheduled relative to the last ack.

murgatroid99

I think this is a good change, with just style issue to fix.

murgatroid99 · 2023-01-03T18:43:34Z

packages/grpc-js/src/subchannel.ts

@@ -121,7 +121,7 @@ export class Subchannel {
  /**
   * Timer reference tracking when the most recent ping will be considered lost
   */
-  private keepaliveTimeoutId: NodeJS.Timer;
+  private keepaliveTimeoutId?: NodeJS.Timer;


This field should be possibly null instead of optional. See for example session and channelzSocketRef in this class.

murgatroid99 · 2023-01-04T01:14:38Z

Update: I opened #2308, which is a big refactor of the Subchannel class that moves all of this code around. I integrated this change into it, so I don't think it makes sense to keep this PR open. If necessary, we can further refine the keepalive behavior after that PR is merged.

nicknotfun · 2023-01-04T02:43:13Z

Sounds good to me!

only create timeout if there isn't one pending

7b43638

murgatroid99 requested changes Jan 3, 2023

View reviewed changes

murgatroid99 mentioned this pull request Jan 4, 2023

grpc-js: Refactor Transport and SubchannelConnector out of Subchannel #2308

Merged

murgatroid99 closed this Jan 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only create keepalive timeout if there isn't one pending #2304

Only create keepalive timeout if there isn't one pending #2304

nicknotfun commented Dec 28, 2022

linux-foundation-easycla bot commented Dec 28, 2022 •

edited

Loading

nicknotfun commented Dec 28, 2022 •

edited

Loading

nicknotfun commented Dec 29, 2022

murgatroid99 left a comment

murgatroid99 Jan 3, 2023

murgatroid99 commented Jan 4, 2023

nicknotfun commented Jan 4, 2023

Only create keepalive timeout if there isn't one pending #2304

Only create keepalive timeout if there isn't one pending #2304

Conversation

nicknotfun commented Dec 28, 2022

linux-foundation-easycla bot commented Dec 28, 2022 • edited Loading

nicknotfun commented Dec 28, 2022 • edited Loading

nicknotfun commented Dec 29, 2022

murgatroid99 left a comment

Choose a reason for hiding this comment

murgatroid99 Jan 3, 2023

Choose a reason for hiding this comment

murgatroid99 commented Jan 4, 2023

nicknotfun commented Jan 4, 2023

linux-foundation-easycla bot commented Dec 28, 2022 •

edited

Loading

nicknotfun commented Dec 28, 2022 •

edited

Loading