Skip to content

Commit

Permalink
server, all: rename some cluster settings
Browse files Browse the repository at this point in the history
The following settings have been renamed, to emphasize how the initial
drain wait is a *mandatory waits*, whereas the others are
simply *timeouts*.

Release note (cli change): The following cluster settings have been
renamed; the previous names remain available for
backward-compatibility.

| Previous name                         | New Name                                           |
|---------------------------------------|----------------------------------------------------|
| `server.shutdown.drain_wait`          | `server.shutdown.initial_wait`                     |
| `server.shutdown.lease_transfer_wait` | `server.shutdown.lease_transfer_iteration.timeout` |
| `server.shutdown.query_wait`          | `server.shutdown.queries.timeout`                  |
| `server.shutdown.connection_wait`     | `server.shutdown.transactions.timeout`             |
| `server.shutdown.jobs_wait`           | `server.shutdown.jobs.timeout`                     |
  • Loading branch information
knz authored and Thomas Hardy committed Oct 4, 2023
1 parent 4de2c7a commit 044867a
Show file tree
Hide file tree
Showing 15 changed files with 88 additions and 73 deletions.
30 changes: 15 additions & 15 deletions docs/RFCS/20211207_graceful_draining.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,11 +26,11 @@ in which the server disallows new connections and waits until all sessions
are closed by clients to shut down. While in CockroachDB’s case, the
draining process consists of two consecutive periods, `drain_wait` and
`query_wait`. Their duration is configured via two
[cluster settings][cockroach cluster settings],`server.shutdown.drain_wait`
and `server.shutdown.query_wait`, with the former “controlling the amount of
[cluster settings][cockroach cluster settings],`server.shutdown.initial_wait`
and `server.shutdown.transactions.timeout`, with the former “controlling the amount of
time a server waits in an unready state before proceeding with the rest of the
shutdown process”, and the latter specifying “the amount of time a server waits
for existent queries to finish”. CockroachDB expects all SQL connections to be
for current txns to finish”. CockroachDB expects all SQL connections to be
closed by clients before the end of `drain_wait`. After `drain_wait` ends and
`query_wait` starts, the server shuts down all SQL connections that are not with
inflight queries (such as open transaction sessions). After the end of
Expand All @@ -50,7 +50,7 @@ and the workaround setting for `drain_wait`.

![Current Draining Process][current draining pic]
The current workaround that CockroachDB provides is to set the cluster setting
`server.shutdown.drain_wait` longer than the sum of the maximum lifetime of a
`server.shutdown.initial_wait` longer than the sum of the maximum lifetime of a
connection in the connection pool (`MaxLifeTime`), the time for the load
balancer to stop routing new SQL connections, and the max transaction latency.
This setting will let connections reach `MaxLifeTime` and be recycled by
Expand All @@ -61,9 +61,9 @@ from a different node.

The downsides of this workaround are:

1. The setting `server.shutdown.drain_wait` has to be adjusted based on
1. The setting `server.shutdown.initial_wait` has to be adjusted based on
clients’ configuration of the connection pooling and load balancer.
2. The server is doing a hard wait till the end of `server.shutdown.drain_wait`.
2. The server is doing a hard wait till the end of `server.shutdown.initial_wait`.
So though setting a long `drain_wait` can help ensure all SQL connections
are closed before the (true) draining starts, it brings a long downtime
for this node.
Expand Down Expand Up @@ -165,8 +165,8 @@ connectionWait = settings.RegisterDurationSetting(
continue with the rest of the shutdown process before the end of
connection_wait.”,
0*time.Second,
settings.WithName("server.shutdown.connections.timeout"),
).WithPublic()
```
The default duration length of `connection_wait` is set to be 0 seconds.
Expand Down Expand Up @@ -250,11 +250,11 @@ timeline for the draining:
- lease transfer stage (unchanged): The server keeps retrying until all range
leases and raft leaderships on this draining node have been tranfered.
**In each iteration**, the kvserver transfers the leases with timeout
specified by `server.shutdown.lease_transfer_wait`.
specified by `server.shutdown.lease_transfer_iteration.timeout`.
(Which means the total duration of the lease transfer stage is not totally
determined by the value of `server.shutdown.lease_transfer_wait"`.
determined by the value of `server.shutdown.lease_transfer_iteration.timeout"`.
We [propose][Lease Transfer Issue] to rename this cluster setting with
`server.shutdown.lease_transfer_timeout"`, and document it seperately with
`server.shutdown.lease_transfer_iteration.timeout"`, and document it seperately with
the other three waiting periods.)
# Demo with changes from Technical Design
Expand Down Expand Up @@ -313,7 +313,7 @@ during draining.
upgrade? Which migrations are needed?
- If certain nodes in the cluster are not updated to support this feature,
it may lead to confusion. We may need a version gate for
`server.shutdown.connection_wait`.
`server.shutdown.connections.timeout`.
- Is the result/usage of this change different for CC end-users than for
on-prem deployments? How?
- The new functionality in this design is expected to be used in multitenent
Expand Down Expand Up @@ -372,20 +372,20 @@ during draining.
system during each waiting period. We can refer to the `final state of this
design` in this RFC.
- How to correctly tune the cluster settings
- `server.shutdown.drain_wait` (the wait for the load balancer to detect
- `server.shutdown.initial_wait` (the wait for the load balancer to detect
the draining status of the node) should be in coordination with the load
balancer's settings, such as the health check interval.
- `server.shutdown.connection_wait` (the timeout to wait for all SQL
- `server.shutdown.connections.timeout` (the timeout to wait for all SQL
connections to be closed) should be longer than the maximum life
time of a connection as specified in the connection pool.
- `server.shutdown.query_wait` (the timeout to wait all SQL queries to
- `server.shutdown.transactions.timeout` (the timeout to wait all SQL txns to
finish) should be greater than any of the followings:
- The longest possible transaction in the workload expected to complete
successfully under normal operating conditions.
- The `sql.defaults.idle_in_transaction_session_timeout` cluster setting.
- The `sql.defaults.statement_timeout` cluster setting.
Note that the cluster setting `server.shutdown.drain_wait` is of different
Note that the cluster setting `server.shutdown.initial_wait` is of different
meaning of the `--drain-wait` flag under the `cockroach node drain` command.
The latter specifies the "_amount of time to wait for the node to drain
before returning to the client_", and it is usually set with several
Expand Down
8 changes: 4 additions & 4 deletions docs/generated/settings/settings-for-tenants.txt
Original file line number Diff line number Diff line change
Expand Up @@ -87,10 +87,10 @@ server.oidc_authentication.provider_url string sets OIDC provider URL ({provide
server.oidc_authentication.redirect_url string https://localhost:8080/oidc/v1/callback sets OIDC redirect URL via a URL string or a JSON string containing a required `redirect_urls` key with an object that maps from region keys to URL strings (URLs should point to your load balancer and must route to the path /oidc/v1/callback) application
server.oidc_authentication.scopes string openid sets OIDC scopes to include with authentication request (space delimited list of strings, required to start with `openid`) application
server.rangelog.ttl duration 720h0m0s if nonzero, entries in system.rangelog older than this duration are periodically purged application
server.shutdown.connection_wait duration 0s the maximum amount of time a server waits for all SQL connections to be closed before proceeding with a drain. (note that the --drain-wait parameter for cockroach node drain may need adjustment after changing this setting) application
server.shutdown.drain_wait duration 0s the amount of time a server waits in an unready state before proceeding with a drain (note that the --drain-wait parameter for cockroach node drain may need adjustment after changing this setting. --drain-wait is to specify the duration of the whole draining process, while server.shutdown.drain_wait is to set the wait time for health probes to notice that the node is not ready.) application
server.shutdown.jobs_wait duration 10s the maximum amount of time a server waits for all currently executing jobs to notice drain request and to perform orderly shutdown application
server.shutdown.query_wait duration 10s the timeout for waiting for active queries to finish during a drain (note that the --drain-wait parameter for cockroach node drain may need adjustment after changing this setting) application
server.shutdown.connections.timeout duration 0s the maximum amount of time a server waits for all SQL connections to be closed before proceeding with a drain. (note that the --drain-wait parameter for cockroach node drain may need adjustment after changing this setting) application
server.shutdown.initial_wait duration 0s the amount of time a server waits in an unready state before proceeding with a drain (note that the --drain-wait parameter for cockroach node drain may need adjustment after changing this setting. --drain-wait is to specify the duration of the whole draining process, while server.shutdown.initial_wait is to set the wait time for health probes to notice that the node is not ready.) application
server.shutdown.jobs.timeout duration 10s the maximum amount of time a server waits for all currently executing jobs to notice drain request and to perform orderly shutdown application
server.shutdown.transactions.timeout duration 10s the timeout for waiting for active transactions to finish during a drain (note that the --drain-wait parameter for cockroach node drain may need adjustment after changing this setting) application
server.time_until_store_dead duration 5m0s the time after which if there is no new gossiped information about a store, it is considered dead application
server.user_login.cert_password_method.auto_scram_promotion.enabled boolean true whether to automatically promote cert-password authentication to use SCRAM application
server.user_login.downgrade_scram_stored_passwords_to_bcrypt.enabled boolean true if server.user_login.password_encryption=crdb-bcrypt, this controls whether to automatically re-encode stored passwords using scram-sha-256 to crdb-bcrypt application
Expand Down
10 changes: 5 additions & 5 deletions docs/generated/settings/settings.html
Original file line number Diff line number Diff line change
Expand Up @@ -112,11 +112,11 @@
<tr><td><div id="setting-server-oidc-authentication-redirect-url" class="anchored"><code>server.oidc_authentication.redirect_url</code></div></td><td>string</td><td><code>https://localhost:8080/oidc/v1/callback</code></td><td>sets OIDC redirect URL via a URL string or a JSON string containing a required `redirect_urls` key with an object that maps from region keys to URL strings (URLs should point to your load balancer and must route to the path /oidc/v1/callback)</td><td>Serverless/Dedicated/Self-Hosted</td></tr>
<tr><td><div id="setting-server-oidc-authentication-scopes" class="anchored"><code>server.oidc_authentication.scopes</code></div></td><td>string</td><td><code>openid</code></td><td>sets OIDC scopes to include with authentication request (space delimited list of strings, required to start with `openid`)</td><td>Serverless/Dedicated/Self-Hosted</td></tr>
<tr><td><div id="setting-server-rangelog-ttl" class="anchored"><code>server.rangelog.ttl</code></div></td><td>duration</td><td><code>720h0m0s</code></td><td>if nonzero, entries in system.rangelog older than this duration are periodically purged</td><td>Serverless/Dedicated/Self-Hosted</td></tr>
<tr><td><div id="setting-server-shutdown-connection-wait" class="anchored"><code>server.shutdown.connection_wait</code></div></td><td>duration</td><td><code>0s</code></td><td>the maximum amount of time a server waits for all SQL connections to be closed before proceeding with a drain. (note that the --drain-wait parameter for cockroach node drain may need adjustment after changing this setting)</td><td>Serverless/Dedicated/Self-Hosted</td></tr>
<tr><td><div id="setting-server-shutdown-drain-wait" class="anchored"><code>server.shutdown.drain_wait</code></div></td><td>duration</td><td><code>0s</code></td><td>the amount of time a server waits in an unready state before proceeding with a drain (note that the --drain-wait parameter for cockroach node drain may need adjustment after changing this setting. --drain-wait is to specify the duration of the whole draining process, while server.shutdown.drain_wait is to set the wait time for health probes to notice that the node is not ready.)</td><td>Serverless/Dedicated/Self-Hosted</td></tr>
<tr><td><div id="setting-server-shutdown-jobs-wait" class="anchored"><code>server.shutdown.jobs_wait</code></div></td><td>duration</td><td><code>10s</code></td><td>the maximum amount of time a server waits for all currently executing jobs to notice drain request and to perform orderly shutdown</td><td>Serverless/Dedicated/Self-Hosted</td></tr>
<tr><td><div id="setting-server-shutdown-lease-transfer-wait" class="anchored"><code>server.shutdown.lease_transfer_wait</code></div></td><td>duration</td><td><code>5s</code></td><td>the timeout for a single iteration of the range lease transfer phase of draining (note that the --drain-wait parameter for cockroach node drain may need adjustment after changing this setting)</td><td>Dedicated/Self-Hosted</td></tr>
<tr><td><div id="setting-server-shutdown-query-wait" class="anchored"><code>server.shutdown.query_wait</code></div></td><td>duration</td><td><code>10s</code></td><td>the timeout for waiting for active queries to finish during a drain (note that the --drain-wait parameter for cockroach node drain may need adjustment after changing this setting)</td><td>Serverless/Dedicated/Self-Hosted</td></tr>
<tr><td><div id="setting-server-shutdown-connection-wait" class="anchored"><code>server.shutdown.connections.timeout</code></div></td><td>duration</td><td><code>0s</code></td><td>the maximum amount of time a server waits for all SQL connections to be closed before proceeding with a drain. (note that the --drain-wait parameter for cockroach node drain may need adjustment after changing this setting)</td><td>Serverless/Dedicated/Self-Hosted</td></tr>
<tr><td><div id="setting-server-shutdown-drain-wait" class="anchored"><code>server.shutdown.initial_wait</code></div></td><td>duration</td><td><code>0s</code></td><td>the amount of time a server waits in an unready state before proceeding with a drain (note that the --drain-wait parameter for cockroach node drain may need adjustment after changing this setting. --drain-wait is to specify the duration of the whole draining process, while server.shutdown.initial_wait is to set the wait time for health probes to notice that the node is not ready.)</td><td>Serverless/Dedicated/Self-Hosted</td></tr>
<tr><td><div id="setting-server-shutdown-jobs-wait" class="anchored"><code>server.shutdown.jobs.timeout</code></div></td><td>duration</td><td><code>10s</code></td><td>the maximum amount of time a server waits for all currently executing jobs to notice drain request and to perform orderly shutdown</td><td>Serverless/Dedicated/Self-Hosted</td></tr>
<tr><td><div id="setting-server-shutdown-lease-transfer-wait" class="anchored"><code>server.shutdown.lease_transfer_iteration.timeout</code></div></td><td>duration</td><td><code>5s</code></td><td>the timeout for a single iteration of the range lease transfer phase of draining (note that the --drain-wait parameter for cockroach node drain may need adjustment after changing this setting)</td><td>Dedicated/Self-Hosted</td></tr>
<tr><td><div id="setting-server-shutdown-query-wait" class="anchored"><code>server.shutdown.transactions.timeout</code></div></td><td>duration</td><td><code>10s</code></td><td>the timeout for waiting for active transactions to finish during a drain (note that the --drain-wait parameter for cockroach node drain may need adjustment after changing this setting)</td><td>Serverless/Dedicated/Self-Hosted</td></tr>
<tr><td><div id="setting-server-time-until-store-dead" class="anchored"><code>server.time_until_store_dead</code></div></td><td>duration</td><td><code>5m0s</code></td><td>the time after which if there is no new gossiped information about a store, it is considered dead</td><td>Serverless/Dedicated/Self-Hosted</td></tr>
<tr><td><div id="setting-server-user-login-cert-password-method-auto-scram-promotion-enabled" class="anchored"><code>server.user_login.cert_password_method.auto_scram_promotion.enabled</code></div></td><td>boolean</td><td><code>true</code></td><td>whether to automatically promote cert-password authentication to use SCRAM</td><td>Serverless/Dedicated/Self-Hosted</td></tr>
<tr><td><div id="setting-server-user-login-downgrade-scram-stored-passwords-to-bcrypt-enabled" class="anchored"><code>server.user_login.downgrade_scram_stored_passwords_to_bcrypt.enabled</code></div></td><td>boolean</td><td><code>true</code></td><td>if server.user_login.password_encryption=crdb-bcrypt, this controls whether to automatically re-encode stored passwords using scram-sha-256 to crdb-bcrypt</td><td>Serverless/Dedicated/Self-Hosted</td></tr>
Expand Down
2 changes: 1 addition & 1 deletion pkg/ccl/changefeedccl/changefeed_processors.go
Original file line number Diff line number Diff line change
Expand Up @@ -581,7 +581,7 @@ var aggregatorHeartbeatFrequency = settings.RegisterDurationSetting(
settings.ApplicationLevel,
"changefeed.aggregator.heartbeat",
"changefeed aggregator will emit a heartbeat message to the coordinator with this frequency; 0 disables. "+
"The setting value should be <=1/2 of server.shutdown.jobs_wait period",
"The setting value should be <=1/2 of server.shutdown.jobs.timeout period",
4*time.Second,
settings.NonNegativeDuration,
)
Expand Down
2 changes: 1 addition & 1 deletion pkg/ccl/serverccl/server_controller_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -474,7 +474,7 @@ func TestServerStartStop(t *testing.T) {

// Don't wait for graceful jobs shutdown in this test since
// we want to make sure test completes reasonably quickly.
_, err = db2.Exec("SET CLUSTER SETTING server.shutdown.jobs_wait='0s'")
_, err = db2.Exec("SET CLUSTER SETTING server.shutdown.jobs.timeout='0s'")
require.NoError(t, err)

return nil
Expand Down
16 changes: 9 additions & 7 deletions pkg/cli/rpc_node_shutdown.go
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,8 @@ import (
"math"
"time"

"github.com/cockroachdb/cockroach/pkg/kv/kvserver"
"github.com/cockroachdb/cockroach/pkg/server"
"github.com/cockroachdb/cockroach/pkg/server/serverpb"
"github.com/cockroachdb/cockroach/pkg/util/grpcutil"
"github.com/cockroachdb/cockroach/pkg/util/log"
Expand Down Expand Up @@ -66,11 +68,11 @@ func doDrain(
if err := timeutil.RunWithTimeout(ctx, "get-drain-settings", 5*time.Second, func(ctx context.Context) error {
shutdownSettings, err := c.Settings(ctx, &serverpb.SettingsRequest{
Keys: []string{
"server.shutdown.drain_wait",
"server.shutdown.connection_wait",
"server.shutdown.jobs_wait",
"server.shutdown.query_wait",
"server.shutdown.lease_transfer_wait",
string(server.DrainWait.InternalKey()),
string(server.ConnectionShutdownTimeout.InternalKey()),
string(server.JobShutdownTimeout.InternalKey()),
string(server.QueryShutdownTimeout.InternalKey()),
string(kvserver.LeaseTransferPerIterationTimeout.InternalKey()),
},
UnredactedValues: true,
})
Expand All @@ -85,7 +87,7 @@ func doDrain(
}
minWait += wait
// query_wait is used twice during draining, so count it twice here.
if k == "server.shutdown.query_wait" {
if k == string(server.QueryShutdownTimeout.InternalKey()) {
minWait += wait
}
}
Expand All @@ -107,7 +109,7 @@ func doDrain(
if errors.HasType(err, (*timeutil.TimeoutError)(nil)) || grpcutil.IsTimeout(err) {
log.Infof(ctx, "drain timed out: %v", err)
err = errors.New("drain timeout, consider adjusting --drain-wait, especially under " +
"custom server.shutdown.{drain,query,connection,lease_transfer}_wait cluster settings")
"custom server.shutdown cluster settings")
}
return
}
Expand Down
Loading

0 comments on commit 044867a

Please sign in to comment.