server: make SQL connection keep alive easily configurable #115833

fqazi · 2023-12-07T21:48:34Z

Previously, the only way to control the keep-alive time for CRDB was through the environment variable COCKROACH_SQL_TCP_KEEP_ALIVE, which set the keep-alive idle and probe times only. This was difficult to use, and the default meant that CRDB kept connections alive for 10 minutes of idle time. To help make this more configurable, we will introduce two cluster settings: server.sql_tcp_keep_alive_probe_interval and server.sql_tcp_keep_alive_timeout. These settings will control the keep-alive probe / idle intervals, and the probe count (derived from the timeout) before the connection will be dropped.

Fixes: #115422

cockroach-teamcity · 2023-12-07T21:48:45Z

This change is

erikgrinaker

This change may be too large to backport. Are we planning a more targeted fix for backports?

Reviewed 5 of 5 files at r1, 8 of 8 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @fqazi)

-- commits line 27 at r2:
This also changes the default values, and removes the env var -- we should call that out as well.

pkg/server/tcp_keepalive_manager.go line 28 at r2 (raw file):

var KeepAliveTimeout = settings.RegisterDurationSetting(
	settings.ApplicationLevel,
	"server.sql_tcp_keep_alive.timeout",

Did you consider simply exposing the parameters directly, like Postgres does?

pkg/server/tcp_keepalive_manager.go line 29 at r2 (raw file):

	settings.ApplicationLevel,
	"server.sql_tcp_keep_alive.timeout",
	"maximum time for which an idle TCP connection to the server will be kept alive",

Consider "unresponsive" instead of "idle". It's important to distinguish this from the case of a connected live client that's simply not doing anything.

pkg/server/tcp_keepalive_manager.go line 30 at r2 (raw file):

	"server.sql_tcp_keep_alive.timeout",
	"maximum time for which an idle TCP connection to the server will be kept alive",
	time.Minute*2,

2 minutes seems very high here. I think the only benefit of a high timeout is to tolerate high network latency and transient disconnects, while the downsides can be quite severe (workload stall). Is it important to handle a 2 minute network latency or disconnect?

pkg/server/tcp_keepalive_manager.go line 79 at r2 (raw file):

	// probe count.
	maximumConnectionTime := KeepAliveTimeout.Get(&k.settings.SV)
	probeFrequency := KeepAliveProbeFrequency.Get(&k.settings.SV)

We don't guard against either of these being set to 0 (KeepAliveTimeout = 0 will result in division by 0 below).

We should either take 0 to mean no probes are sent (like in Postgres), or disallow it.

pkg/server/tcp_keepalive_manager_test.go line 31 at r2 (raw file):

)

func TestKeepAliveManager(t *testing.T) {

We should have an end-to-end test as well, which tests that an unresponsive SQL client connection is in fact disconnected and its session torn down within the expected time.

fqazi

I'm tempted to say we just narrow down the bits for linux and make them an environment variable enabled by default. What do you think?

@rafiss @erikgrinaker RFAL

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @erikgrinaker, and @renatolabs)

pkg/server/tcp_keepalive_manager.go line 28 at r2 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Did you consider simply exposing the parameters directly, like Postgres does?

Yeah, lets keep this simple, and I named it as:
server.sql_tcp_keep_alive.count

pkg/server/tcp_keepalive_manager.go line 29 at r2 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Consider "unresponsive" instead of "idle". It's important to distinguish this from the case of a connected live client that's simply not doing anything.

Good point changed this to:

maximum number of probes that will be sent out before a connection dropped because its unresponsive (Linux and Darwin only)

pkg/server/tcp_keepalive_manager.go line 30 at r2 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

2 minutes seems very high here. I think the only benefit of a high timeout is to tolerate high network latency and transient disconnects, while the downsides can be quite severe (workload stall). Is it important to handle a 2 minute network latency or disconnect?

I reduced this to a minute, let me know what your gut feel is for this setting.

erikgrinaker

Reviewed 5 of 5 files at r3, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @fqazi, and @renatolabs)

pkg/cmd/roachtest/tests/network.go line 333 at r3 (raw file):

			return err
		}
		// We expect 2 since the query will itself

nit: comment seems stale (count != 1)

pkg/cmd/roachtest/tests/network.go line 349 at r3 (raw file):

# Drop any client traffic to CRDB.
sudo iptables -A INPUT -p tcp --dport 26257 -j DROP;

This should be --sport 26257, not --dport. The traffic is coming from the CRDB process on the remote node (so its source port is 26257), to a random client port (dport) on this node.

As is, the client will receive packets from the server, but can't send packets to it. That still works, but we may as well sever the connection completely.

pkg/cmd/roachtest/tests/network.go line 352 at r3 (raw file):

sudo iptables -A OUTPUT -p tcp --dport 26257 -j DROP;

sudo iptables-save

nit: does this have any purpose? This just saves the ruleset to disk in case the host restarts, the rules are already applied.

pkg/cmd/roachtest/tests/network.go line 363 at r3 (raw file):

		const restoreNet = `
set -e;
sudo iptables -D INPUT -p tcp --dport 26257 -j DROP;

nit: can also just do iptables -F INPUT to remove all rules.

pkg/server/tcp_keepalive_manager.go line 30 at r2 (raw file):

Previously, fqazi (Faizan Qazi) wrote…

I reduced this to a minute, let me know what your gut feel is for this setting.

I think that's still towards the higher end. For comparison, CRDB RPC connections use a keepalive timeout of 10 seconds, and that's only because gRPC doesn't allow lower values for various reasons.

cockroach/pkg/rpc/keepalive.go

Lines 55 to 59 in 2e6eb2c

    
           // 10 seconds is the minimum keepalive interval permitted by gRPC. 
        
           // Setting it to a value lower than this will lead to gRPC adjusting to this 
        
           // value and annoyingly logging "Adjusting keepalive ping interval to minimum 
        
           // period of 10s". See grpc/grpc-go#2642. 
        
           const minimumClientKeepaliveInterval = 10 * time.Second

This basically says "how long do we want to allow a client to disconnect from the network and then reconnect without closing its session"? Since we'll be holding onto the transaction's locks in the meanwhile, I'm inclined to make this more aggressive -- it seems better for a client to retry instead if this should happen. Maybe 20-30 seconds? That still feels pretty lenient -- I don't think it's unreasonable to disconnect a client who falls off the network for that long. Then again, it's probably going to break something for someone somewhere, and 1 minute would be the conservative option -- it's still way better than 10 minutes.

I guess there's also a question of how chatty we want to be on idle client connections.

I'll leave this to SQL Foundations' discretion, since you probably know more about client environments and such -- personally I'd lean towards something like 30 seconds, but 1 minute also seems reasonable. Hard to say what the optimal value is.

pkg/server/tcp_keepalive_manager.go line 29 at r3 (raw file):

	settings.ApplicationLevel,
	"server.sql_tcp_keep_alive.count",
	"maximum number of probes that will be sent out before a connection dropped because "+

nit: "a connection is dropped" and "it's unresponsive"

fqazi

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @erikgrinaker, and @renatolabs)

pkg/cmd/roachtest/tests/network.go line 352 at r3 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

nit: does this have any purpose? This just saves the ruleset to disk in case the host restarts, the rules are already applied.

Done.

Good point, cleaned this up.

pkg/server/tcp_keepalive_manager.go line 30 at r2 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

I think that's still towards the higher end. For comparison, CRDB RPC connections use a keepalive timeout of 10 seconds, and that's only because gRPC doesn't allow lower values for various reasons.

cockroach/pkg/rpc/keepalive.go

Lines 55 to 59 in 2e6eb2c

// 10 seconds is the minimum keepalive interval permitted by gRPC.

// Setting it to a value lower than this will lead to gRPC adjusting to this

// value and annoyingly logging "Adjusting keepalive ping interval to minimum

// period of 10s". See grpc/grpc-go#2642.

const minimumClientKeepaliveInterval = 10 * time.Second

This basically says "how long do we want to allow a client to disconnect from the network and then reconnect without closing its session"? Since we'll be holding onto the transaction's locks in the meanwhile, I'm inclined to make this more aggressive -- it seems better for a client to retry instead if this should happen. Maybe 20-30 seconds? That still feels pretty lenient -- I don't think it's unreasonable to disconnect a client who falls off the network for that long. Then again, it's probably going to break something for someone somewhere, and 1 minute would be the conservative option -- it's still way better than 10 minutes.

I guess there's also a question of how chatty we want to be on idle client connections.

I'll leave this to SQL Foundations' discretion, since you probably know more about client environments and such -- personally I'd lean towards something like 30 seconds, but 1 minute also seems reasonable. Hard to say what the optimal value is.

Done.

Yeah, I dropped this down to ~30 seconds with a probe every 10 seconds. Also, I'm probably being too cautious, since I wouldn't expect that heavy packet loss to a database server.

fqazi · 2024-01-16T19:45:41Z

@erikgrinaker TFTR!

bors r+

craig · 2024-01-16T21:22:06Z

Build failed (retrying...):

Bazel Essential CI (Cockroach)

renatolabs · 2024-01-16T21:29:30Z

This PR has drifted and the roachtest changes here no longer compile on master because of #117320.

TLDR: The list of nodes to run a command in should be passed toRunE using option.WithNodes(nodes).

bors r-

craig · 2024-01-16T21:29:33Z

Canceled.

Previously, we relied solely on setting the keep alive probe interval and keep alive idle time to control how long a connection stayed active. Unfortunately, this did not have the desired effect since the number of probes was not set. As a result, idle connections for up to 10 minutes would be kept around. To allow better configuration of these settings, we need API to set the keep alive count on a socket. This patch will add a sysutil to control the keep-alive count to address this. Release note: None

Previously, we only had an environment variable for controlling the keep alive probe interval and keep alive idle time for pgwire connections. Unfortunately, this defaulted to around 10 minutes on Linux since the probe count was not configurable either. To address this, this patch adds the settings: server.sql_tcp_keep_alive.count and server.sql_tcp_keep_alive.interval, which allow much better control than previously. These settings can be used to pick an overall connection time out for idle connections and the probe interval. Fixes: cockroachdb#115422 Release note (sql change): Add configurable settings for total TCP keep alive probes (server.sql_tcp_keep_alive.count) and TCP probe intervals (server.sql_tcp_keep_alive.interval) for SQL connections. Removes the COCKROACH_SQL_TCP_KEEP_ALIVE environment variable subsuming it.

fqazi · 2024-01-16T23:25:36Z

bors r+

craig · 2024-01-17T01:36:53Z

Build succeeded:

Bazel Essential CI (Cockroach)

fqazi requested a review from erikgrinaker December 7, 2023 21:48

fqazi force-pushed the setKeepAliveProperly branch 11 times, most recently from eb76445 to d22e3be Compare December 8, 2023 19:10

fqazi marked this pull request as ready for review December 8, 2023 19:10

fqazi requested a review from a team December 8, 2023 19:10

fqazi requested review from a team as code owners December 8, 2023 19:10

fqazi requested review from abarganier and removed request for a team December 8, 2023 19:10

fqazi force-pushed the setKeepAliveProperly branch 2 times, most recently from b398efe to c753ad9 Compare December 9, 2023 02:59

erikgrinaker reviewed Dec 11, 2023

View reviewed changes

abarganier removed their request for review December 12, 2023 15:27

fqazi force-pushed the setKeepAliveProperly branch from c753ad9 to 49cc4ba Compare January 10, 2024 18:42

fqazi requested a review from a team as a code owner January 10, 2024 18:42

fqazi requested review from DarrylWong and renatolabs and removed request for a team January 10, 2024 18:42

fqazi force-pushed the setKeepAliveProperly branch from 49cc4ba to 5d427d3 Compare January 10, 2024 18:45

fqazi commented Jan 10, 2024

View reviewed changes

fqazi force-pushed the setKeepAliveProperly branch from 5d427d3 to e233d7f Compare January 10, 2024 20:27

erikgrinaker approved these changes Jan 11, 2024

View reviewed changes

fqazi force-pushed the setKeepAliveProperly branch from e233d7f to c4d2a10 Compare January 12, 2024 21:06

fqazi commented Jan 12, 2024

View reviewed changes

fqazi force-pushed the setKeepAliveProperly branch from c4d2a10 to 5f21d54 Compare January 16, 2024 19:45

fqazi force-pushed the setKeepAliveProperly branch from 5f21d54 to e903271 Compare January 16, 2024 21:38

fqazi force-pushed the setKeepAliveProperly branch from e903271 to af23e06 Compare January 16, 2024 21:39

craig bot merged commit b743201 into cockroachdb:master Jan 17, 2024
8 of 9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: make SQL connection keep alive easily configurable #115833

server: make SQL connection keep alive easily configurable #115833

fqazi commented Dec 7, 2023

cockroach-teamcity commented Dec 7, 2023

erikgrinaker left a comment

fqazi left a comment

erikgrinaker left a comment

fqazi left a comment

fqazi commented Jan 16, 2024

craig bot commented Jan 16, 2024

renatolabs commented Jan 16, 2024

craig bot commented Jan 16, 2024

fqazi commented Jan 16, 2024

craig bot commented Jan 17, 2024

	// 10 seconds is the minimum keepalive interval permitted by gRPC.
	// Setting it to a value lower than this will lead to gRPC adjusting to this
	// value and annoyingly logging "Adjusting keepalive ping interval to minimum
	// period of 10s". See grpc/grpc-go#2642.
	const minimumClientKeepaliveInterval = 10 * time.Second

server: make SQL connection keep alive easily configurable #115833

server: make SQL connection keep alive easily configurable #115833

Conversation

fqazi commented Dec 7, 2023

cockroach-teamcity commented Dec 7, 2023

erikgrinaker left a comment

Choose a reason for hiding this comment

fqazi left a comment

Choose a reason for hiding this comment

erikgrinaker left a comment

Choose a reason for hiding this comment

fqazi left a comment

Choose a reason for hiding this comment

fqazi commented Jan 16, 2024

craig bot commented Jan 16, 2024

renatolabs commented Jan 16, 2024

craig bot commented Jan 16, 2024

fqazi commented Jan 16, 2024

craig bot commented Jan 17, 2024