-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: make SQL connection keep alive easily configurable #115833
Conversation
eb76445
to
d22e3be
Compare
b398efe
to
c753ad9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change may be too large to backport. Are we planning a more targeted fix for backports?
Reviewed 5 of 5 files at r1, 8 of 8 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @fqazi)
-- commits
line 27 at r2:
This also changes the default values, and removes the env var -- we should call that out as well.
pkg/server/tcp_keepalive_manager.go
line 28 at r2 (raw file):
var KeepAliveTimeout = settings.RegisterDurationSetting( settings.ApplicationLevel, "server.sql_tcp_keep_alive.timeout",
Did you consider simply exposing the parameters directly, like Postgres does?
pkg/server/tcp_keepalive_manager.go
line 29 at r2 (raw file):
settings.ApplicationLevel, "server.sql_tcp_keep_alive.timeout", "maximum time for which an idle TCP connection to the server will be kept alive",
Consider "unresponsive" instead of "idle". It's important to distinguish this from the case of a connected live client that's simply not doing anything.
pkg/server/tcp_keepalive_manager.go
line 30 at r2 (raw file):
"server.sql_tcp_keep_alive.timeout", "maximum time for which an idle TCP connection to the server will be kept alive", time.Minute*2,
2 minutes seems very high here. I think the only benefit of a high timeout is to tolerate high network latency and transient disconnects, while the downsides can be quite severe (workload stall). Is it important to handle a 2 minute network latency or disconnect?
pkg/server/tcp_keepalive_manager.go
line 79 at r2 (raw file):
// probe count. maximumConnectionTime := KeepAliveTimeout.Get(&k.settings.SV) probeFrequency := KeepAliveProbeFrequency.Get(&k.settings.SV)
We don't guard against either of these being set to 0 (KeepAliveTimeout = 0
will result in division by 0 below).
We should either take 0 to mean no probes are sent (like in Postgres), or disallow it.
pkg/server/tcp_keepalive_manager_test.go
line 31 at r2 (raw file):
) func TestKeepAliveManager(t *testing.T) {
We should have an end-to-end test as well, which tests that an unresponsive SQL client connection is in fact disconnected and its session torn down within the expected time.
c753ad9
to
49cc4ba
Compare
49cc4ba
to
5d427d3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm tempted to say we just narrow down the bits for linux and make them an environment variable enabled by default. What do you think?
@rafiss @erikgrinaker RFAL
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @erikgrinaker, and @renatolabs)
pkg/server/tcp_keepalive_manager.go
line 28 at r2 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Did you consider simply exposing the parameters directly, like Postgres does?
Yeah, lets keep this simple, and I named it as:
server.sql_tcp_keep_alive.count
pkg/server/tcp_keepalive_manager.go
line 29 at r2 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Consider "unresponsive" instead of "idle". It's important to distinguish this from the case of a connected live client that's simply not doing anything.
Good point changed this to:
maximum number of probes that will be sent out before a connection dropped because its unresponsive (Linux and Darwin only)
pkg/server/tcp_keepalive_manager.go
line 30 at r2 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
2 minutes seems very high here. I think the only benefit of a high timeout is to tolerate high network latency and transient disconnects, while the downsides can be quite severe (workload stall). Is it important to handle a 2 minute network latency or disconnect?
I reduced this to a minute, let me know what your gut feel is for this setting.
5d427d3
to
e233d7f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 5 files at r3, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @fqazi, and @renatolabs)
pkg/cmd/roachtest/tests/network.go
line 333 at r3 (raw file):
return err } // We expect 2 since the query will itself
nit: comment seems stale (count != 1)
pkg/cmd/roachtest/tests/network.go
line 349 at r3 (raw file):
# Drop any client traffic to CRDB. sudo iptables -A INPUT -p tcp --dport 26257 -j DROP;
This should be --sport 26257, not --dport. The traffic is coming from the CRDB process on the remote node (so its source port is 26257), to a random client port (dport) on this node.
As is, the client will receive packets from the server, but can't send packets to it. That still works, but we may as well sever the connection completely.
pkg/cmd/roachtest/tests/network.go
line 352 at r3 (raw file):
sudo iptables -A OUTPUT -p tcp --dport 26257 -j DROP; sudo iptables-save
nit: does this have any purpose? This just saves the ruleset to disk in case the host restarts, the rules are already applied.
pkg/cmd/roachtest/tests/network.go
line 363 at r3 (raw file):
const restoreNet = ` set -e; sudo iptables -D INPUT -p tcp --dport 26257 -j DROP;
nit: can also just do iptables -F INPUT
to remove all rules.
pkg/server/tcp_keepalive_manager.go
line 30 at r2 (raw file):
Previously, fqazi (Faizan Qazi) wrote…
I reduced this to a minute, let me know what your gut feel is for this setting.
I think that's still towards the higher end. For comparison, CRDB RPC connections use a keepalive timeout of 10 seconds, and that's only because gRPC doesn't allow lower values for various reasons.
cockroach/pkg/rpc/keepalive.go
Lines 55 to 59 in 2e6eb2c
// 10 seconds is the minimum keepalive interval permitted by gRPC. | |
// Setting it to a value lower than this will lead to gRPC adjusting to this | |
// value and annoyingly logging "Adjusting keepalive ping interval to minimum | |
// period of 10s". See grpc/grpc-go#2642. | |
const minimumClientKeepaliveInterval = 10 * time.Second |
This basically says "how long do we want to allow a client to disconnect from the network and then reconnect without closing its session"? Since we'll be holding onto the transaction's locks in the meanwhile, I'm inclined to make this more aggressive -- it seems better for a client to retry instead if this should happen. Maybe 20-30 seconds? That still feels pretty lenient -- I don't think it's unreasonable to disconnect a client who falls off the network for that long. Then again, it's probably going to break something for someone somewhere, and 1 minute would be the conservative option -- it's still way better than 10 minutes.
I guess there's also a question of how chatty we want to be on idle client connections.
I'll leave this to SQL Foundations' discretion, since you probably know more about client environments and such -- personally I'd lean towards something like 30 seconds, but 1 minute also seems reasonable. Hard to say what the optimal value is.
pkg/server/tcp_keepalive_manager.go
line 29 at r3 (raw file):
settings.ApplicationLevel, "server.sql_tcp_keep_alive.count", "maximum number of probes that will be sent out before a connection dropped because "+
nit: "a connection is dropped" and "it's unresponsive"
e233d7f
to
c4d2a10
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @erikgrinaker, and @renatolabs)
pkg/cmd/roachtest/tests/network.go
line 352 at r3 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
nit: does this have any purpose? This just saves the ruleset to disk in case the host restarts, the rules are already applied.
Done.
Good point, cleaned this up.
pkg/server/tcp_keepalive_manager.go
line 30 at r2 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
I think that's still towards the higher end. For comparison, CRDB RPC connections use a keepalive timeout of 10 seconds, and that's only because gRPC doesn't allow lower values for various reasons.
cockroach/pkg/rpc/keepalive.go
Lines 55 to 59 in 2e6eb2c
// 10 seconds is the minimum keepalive interval permitted by gRPC. // Setting it to a value lower than this will lead to gRPC adjusting to this // value and annoyingly logging "Adjusting keepalive ping interval to minimum // period of 10s". See grpc/grpc-go#2642. const minimumClientKeepaliveInterval = 10 * time.Second This basically says "how long do we want to allow a client to disconnect from the network and then reconnect without closing its session"? Since we'll be holding onto the transaction's locks in the meanwhile, I'm inclined to make this more aggressive -- it seems better for a client to retry instead if this should happen. Maybe 20-30 seconds? That still feels pretty lenient -- I don't think it's unreasonable to disconnect a client who falls off the network for that long. Then again, it's probably going to break something for someone somewhere, and 1 minute would be the conservative option -- it's still way better than 10 minutes.
I guess there's also a question of how chatty we want to be on idle client connections.
I'll leave this to SQL Foundations' discretion, since you probably know more about client environments and such -- personally I'd lean towards something like 30 seconds, but 1 minute also seems reasonable. Hard to say what the optimal value is.
Done.
Yeah, I dropped this down to ~30 seconds with a probe every 10 seconds. Also, I'm probably being too cautious, since I wouldn't expect that heavy packet loss to a database server.
c4d2a10
to
5f21d54
Compare
@erikgrinaker TFTR! bors r+ |
Build failed (retrying...): |
This PR has drifted and the roachtest changes here no longer compile on master because of #117320. TLDR: The list of nodes to run a command in should be passed to bors r- |
Canceled. |
Previously, we relied solely on setting the keep alive probe interval and keep alive idle time to control how long a connection stayed active. Unfortunately, this did not have the desired effect since the number of probes was not set. As a result, idle connections for up to 10 minutes would be kept around. To allow better configuration of these settings, we need API to set the keep alive count on a socket. This patch will add a sysutil to control the keep-alive count to address this. Release note: None
5f21d54
to
e903271
Compare
Previously, we only had an environment variable for controlling the keep alive probe interval and keep alive idle time for pgwire connections. Unfortunately, this defaulted to around 10 minutes on Linux since the probe count was not configurable either. To address this, this patch adds the settings: server.sql_tcp_keep_alive.count and server.sql_tcp_keep_alive.interval, which allow much better control than previously. These settings can be used to pick an overall connection time out for idle connections and the probe interval. Fixes: cockroachdb#115422 Release note (sql change): Add configurable settings for total TCP keep alive probes (server.sql_tcp_keep_alive.count) and TCP probe intervals (server.sql_tcp_keep_alive.interval) for SQL connections. Removes the COCKROACH_SQL_TCP_KEEP_ALIVE environment variable subsuming it.
e903271
to
af23e06
Compare
bors r+ |
Build succeeded: |
Previously, the only way to control the keep-alive time for CRDB was through the environment variable COCKROACH_SQL_TCP_KEEP_ALIVE, which set the keep-alive idle and probe times only. This was difficult to use, and the default meant that CRDB kept connections alive for 10 minutes of idle time. To help make this more configurable, we will introduce two cluster settings: server.sql_tcp_keep_alive_probe_interval and server.sql_tcp_keep_alive_timeout. These settings will control the keep-alive probe / idle intervals, and the probe count (derived from the timeout) before the connection will be dropped.
Fixes: #115422