Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-23.1: rpc: increase gRPC server timeout from 1x to 2x NetworkTimeout #109620

Merged
merged 1 commit into from
Aug 29, 2023

Conversation

blathers-crl[bot]
Copy link

@blathers-crl blathers-crl bot commented Aug 28, 2023

Backport 1/1 commits from #109578 on behalf of @erikgrinaker.

/cc @cockroachdb/release


This is intended as a conservative backport that changes as little as possible. For 23.2, we should restructure these settings a bit, possibly by removing NetworkTimeout and using independent timeouts for each component/parameter, since they have unique considerations (e.g. whether they are enforced above the Go runtime or by the OS, to what extent they are subject to RPC head-of-line blocking, etc).


This patch increases the gRPC server timeout from 1x to 2x NetworkTimeout. This timeout determines how long the server will wait for a TCP send to receive a TCP ack before automatically closing the connection. gRPC enforces this via the OS TCP stack by setting TCP_USER_TIMEOUT on the network socket.

While NetworkTimeout should be sufficient here, we have seen instances where this is affected by node load or other factors, so we set it to 2x NetworkTimeout to avoid spurious closed connections. An aggressive timeout is not particularly beneficial here, because the client-side timeout (in our case the CRDB RPC heartbeat) is what matters for recovery time following network or node outages -- the server side doesn't really care if the connection remains open for a bit longer.

Touches #109317.

Epic: none
Release note (ops change): The default gRPC server-side send timeout has been increased from 2 seconds to 4 seconds (1x to 2x of COCKROACH_NETWORK_TIMEOUT), to avoid spurious connection failures in certain scenarios. This can be controlled via the new environment variable COCKROACH_RPC_SERVER_TIMEOUT.


Release justification: Some current 23.1 deployments are hitting a timeout that is related to setting this to 2s. Bumping the default to 4s and disconnecting from the other parameters will help.

This patch increases the gRPC server timeout from 1x to 2x
NetworkTimeout. This timeout determines how long the server will wait
for a TCP send to receive a TCP ack before automatically closing the
connection. gRPC enforces this via the OS TCP stack by setting
TCP_USER_TIMEOUT on the network socket.

While NetworkTimeout should be sufficient here, we have seen instances
where this is affected by node load or other factors, so we set it to 2x
NetworkTimeout to avoid spurious closed connections. An aggressive
timeout is not particularly beneficial here, because the client-side
timeout (in our case the CRDB RPC heartbeat) is what matters for
recovery time following network or node outages -- the server side
doesn't really care if the connection remains open for a bit longer.

Epic: none
Release note (ops change): The default gRPC server-side send timeout has
been increased from 2 seconds to 4 seconds (1x to 2x of
COCKROACH_NETWORK_TIMEOUT), to avoid spurious connection failures in
certain scenarios. This can be controlled via the new environment
variable COCKROACH_RPC_SERVER_TIMEOUT.
@blathers-crl blathers-crl bot requested a review from a team as a code owner August 28, 2023 19:22
@blathers-crl blathers-crl bot force-pushed the blathers/backport-release-23.1-109578 branch from b05ca5c to 4ec6a7e Compare August 28, 2023 19:22
@blathers-crl blathers-crl bot added blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot. labels Aug 28, 2023
@blathers-crl blathers-crl bot force-pushed the blathers/backport-release-23.1-109578 branch from e634e41 to 71541c3 Compare August 28, 2023 19:22
@blathers-crl
Copy link
Author

blathers-crl bot commented Aug 28, 2023

Thanks for opening a backport.

Please check the backport criteria before merging:

  • Patches should only be created for serious issues or test-only changes.
  • Patches should not break backwards-compatibility.
  • Patches should change as little code as possible.
  • Patches should not change on-disk formats or node communication protocols.
  • Patches should not add new functionality.
  • Patches must not add, edit, or otherwise modify cluster versions; or add version gates.
If some of the basic criteria cannot be satisfied, ensure that the exceptional criteria are satisfied within.
  • There is a high priority need for the functionality that cannot wait until the next release and is difficult to address in another way.
  • The new functionality is additive-only and only runs for clusters which have specifically “opted in” to it (e.g. by a cluster setting).
  • New code is protected by a conditional check that is trivial to verify and ensures that it only runs for opt-in clusters.
  • The PM and TL on the team that owns the changed code have signed off that the change obeys the above rules.

Add a brief release justification to the body of your PR to justify this backport.

Some other things to consider:

  • What did we do to ensure that a user that doesn’t know & care about this backport, has no idea that it happened?
  • Will this work in a cluster of mixed patch versions? Did we test that?
  • If a user upgrades a patch version, uses this feature, and then downgrades, what happens?

@blathers-crl blathers-crl bot requested review from andrewbaptist, sean- and a team August 28, 2023 19:22
@blathers-crl blathers-crl bot added the backport Label PR's that are backports to older release branches label Aug 28, 2023
@blathers-crl
Copy link
Author

blathers-crl bot commented Aug 28, 2023

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator

@andrewbaptist andrewbaptist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport Label PR's that are backports to older release branches blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants