Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
115833: server: make SQL connection keep alive easily configurable r=fqazi a=fqazi

Previously, the only way to control the keep-alive time for CRDB was through the environment variable COCKROACH_SQL_TCP_KEEP_ALIVE, which set the keep-alive idle and probe times only. This was difficult to use, and the default meant that CRDB kept connections alive for 10 minutes of idle time. To help make this more configurable, we will introduce two cluster settings:  server.sql_tcp_keep_alive_probe_interval and server.sql_tcp_keep_alive_timeout. These settings will control the keep-alive probe / idle intervals, and the probe count (derived from the timeout) before the connection will be dropped.

Fixes: #115422

117560: kv: use correct strength for {Get,Scan,ReverseScan} for SKIP LOCKED r=nvanbenschoten a=arulajmani

See individual commits. 

117654: build: add TestMetaTwoInstance nightly run scripts r=RaduBerinde a=RaduBerinde

We also change the race build to run both flavors.

Fixes: cockroachdb/pebble#3175
Release note: None

117834: parser: fix string repr of ALTER ... START REPLICATION r=dt a=dt

Release note: none.
Epic: none.

Co-authored-by: Faizan Qazi <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: Arul Ajmani <[email protected]>
Co-authored-by: Radu Berinde <[email protected]>
Co-authored-by: David Taylor <[email protected]>
  • Loading branch information
6 people committed Jan 17, 2024
5 parents 07f8311 + af23e06 + ed0ccb4 + bd500e8 + 067a73a commit b743201
Show file tree
Hide file tree
Showing 40 changed files with 2,114 additions and 337 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ exit_status=0
$BAZEL_BIN/pkg/cmd/bazci/bazci_/bazci --formatter=pebble-metamorphic -- test --config=race --config=ci \
@com_github_cockroachdb_pebble//internal/metamorphic:metamorphic_test \
--test_env TC_SERVER_URL=$TC_SERVER_URL \
--test_timeout=14400 '--test_filter=TestMeta$' \
--test_timeout=14400 \
--test_sharding_strategy=disabled \
--define gotags=bazel,invariants \
--run_under "@com_github_cockroachdb_stress//:stress -bazel -shardable-artifacts 'XML_OUTPUT_FILE=$BAZEL_BIN/pkg/cmd/bazci/bazci_/bazci merge-test-xmls' -maxtime 3h -maxfails 1 -timeout 30m -stderr -p 1" \
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
#!/usr/bin/env bash
#
# This script is run by the Pebble Nightly Metamorphic Two Instance - TeamCity
# build configuration.

set -euo pipefail

dir="$(dirname $(dirname $(dirname $(dirname "${0}"))))"

source "$dir/teamcity-support.sh" # For $root
source "$dir/teamcity-bazel-support.sh" # For run_bazel

mkdir -p artifacts

# Pull in the latest version of Pebble from upstream. The benchmarks run
# against the tip of the 'master' branch. We do this by `go get`ting the
# latest version of the module, and then running `mirror` to update `DEPS.bzl`
# accordingly.
bazel run @go_sdk//:bin/go get github.com/cockroachdb/pebble@master
# Just dump the diff to see what, if anything, has changed.
git diff
NEW_DEPS_BZL_CONTENT=$(bazel run //pkg/cmd/mirror/go:mirror)
echo "$NEW_DEPS_BZL_CONTENT" > DEPS.bzl

# Use the Pebble SHA from the version in the modified go.mod file.
# Note that we need to pluck the Git SHA from the go.sum-style version, i.e.
# v0.0.0-20220214174839-6af77d5598c9SUM => 6af77d5598c9
PEBBLE_SHA=$(grep 'github\.com/cockroachdb/pebble' go.mod | cut -d'-' -f3)
echo "Pebble module Git SHA: $PEBBLE_SHA"

BAZEL_SUPPORT_EXTRA_DOCKER_ARGS="-e BUILD_VCS_NUMBER=$PEBBLE_SHA -e GITHUB_API_TOKEN -e GITHUB_REPO -e TC_BUILDTYPE_ID -e TC_BUILD_BRANCH -e TC_BUILD_ID -e TC_SERVER_URL" \
run_bazel build/teamcity/cockroach/nightlies/pebble_nightly_metamorphic_two_instance_impl.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
#!/usr/bin/env bash

dir="$(dirname $(dirname $(dirname $(dirname "${0}"))))"

set -euxo pipefail
ARTIFACTS_DIR=/artifacts/meta
mkdir -p $ARTIFACTS_DIR

echo "TC_SERVER_URL is $TC_SERVER_URL"

bazel build //pkg/cmd/bazci --config=ci

BAZEL_BIN=$(bazel info bazel-bin --config ci)

exit_status=0
# NB: If adjusting the metamorphic test flags below, be sure to also update
# pkg/cmd/github-post/main.go to ensure the GitHub issue poster includes the
# correct flags in the reproduction command.
$BAZEL_BIN/pkg/cmd/bazci/bazci_/bazci --formatter=pebble-metamorphic -- test --config=ci \
@com_github_cockroachdb_pebble//internal/metamorphic:metamorphic_test \
--test_env TC_SERVER_URL=$TC_SERVER_URL \
--test_timeout=25200 '--test_filter=TestMetaTwoInstance$' \
--define gotags=bazel,invariants \
--run_under "@com_github_cockroachdb_stress//:stress -bazel -shardable-artifacts 'XML_OUTPUT_FILE=$BAZEL_BIN/pkg/cmd/bazci/bazci_/bazci merge-test-xmls' -maxtime 3h -maxfails 1 -timeout 20m -stderr -p 1" \
--test_arg -dir --test_arg $ARTIFACTS_DIR \
--test_arg -ops --test_arg "uniform:5000-10000" \
--test_output streamed \
|| exit_status=$?

exit $exit_status
2 changes: 2 additions & 0 deletions docs/generated/settings/settings-for-tenants.txt
Original file line number Diff line number Diff line change
Expand Up @@ -95,6 +95,8 @@ server.shutdown.connections.timeout duration 0s the maximum amount of time a ser
server.shutdown.initial_wait duration 0s the amount of time a server waits in an unready state before proceeding with a drain (note that the --drain-wait parameter for cockroach node drain may need adjustment after changing this setting. --drain-wait is to specify the duration of the whole draining process, while server.shutdown.initial_wait is to set the wait time for health probes to notice that the node is not ready.) application
server.shutdown.jobs.timeout duration 10s the maximum amount of time a server waits for all currently executing jobs to notice drain request and to perform orderly shutdown application
server.shutdown.transactions.timeout duration 10s the timeout for waiting for active transactions to finish during a drain (note that the --drain-wait parameter for cockroach node drain may need adjustment after changing this setting) application
server.sql_tcp_keep_alive.count integer 3 maximum number of probes that will be sent out before a connection is dropped because it's unresponsive (Linux and Darwin only) application
server.sql_tcp_keep_alive.interval duration 10s time between keep alive probes and idle time before probes are sent out application
server.time_until_store_dead duration 5m0s the time after which if there is no new gossiped information about a store, it is considered dead application
server.user_login.cert_password_method.auto_scram_promotion.enabled boolean true whether to automatically promote cert-password authentication to use SCRAM application
server.user_login.downgrade_scram_stored_passwords_to_bcrypt.enabled boolean true if server.user_login.password_encryption=crdb-bcrypt, this controls whether to automatically re-encode stored passwords using scram-sha-256 to crdb-bcrypt application
Expand Down
2 changes: 2 additions & 0 deletions docs/generated/settings/settings.html
Original file line number Diff line number Diff line change
Expand Up @@ -124,6 +124,8 @@
<tr><td><div id="setting-server-shutdown-jobs-wait" class="anchored"><code>server.shutdown.jobs.timeout</code></div></td><td>duration</td><td><code>10s</code></td><td>the maximum amount of time a server waits for all currently executing jobs to notice drain request and to perform orderly shutdown</td><td>Serverless/Dedicated/Self-Hosted</td></tr>
<tr><td><div id="setting-server-shutdown-lease-transfer-wait" class="anchored"><code>server.shutdown.lease_transfer_iteration.timeout</code></div></td><td>duration</td><td><code>5s</code></td><td>the timeout for a single iteration of the range lease transfer phase of draining (note that the --drain-wait parameter for cockroach node drain may need adjustment after changing this setting)</td><td>Dedicated/Self-Hosted</td></tr>
<tr><td><div id="setting-server-shutdown-query-wait" class="anchored"><code>server.shutdown.transactions.timeout</code></div></td><td>duration</td><td><code>10s</code></td><td>the timeout for waiting for active transactions to finish during a drain (note that the --drain-wait parameter for cockroach node drain may need adjustment after changing this setting)</td><td>Serverless/Dedicated/Self-Hosted</td></tr>
<tr><td><div id="setting-server-sql-tcp-keep-alive-count" class="anchored"><code>server.sql_tcp_keep_alive.count</code></div></td><td>integer</td><td><code>3</code></td><td>maximum number of probes that will be sent out before a connection is dropped because it&#39;s unresponsive (Linux and Darwin only)</td><td>Serverless/Dedicated/Self-Hosted</td></tr>
<tr><td><div id="setting-server-sql-tcp-keep-alive-interval" class="anchored"><code>server.sql_tcp_keep_alive.interval</code></div></td><td>duration</td><td><code>10s</code></td><td>time between keep alive probes and idle time before probes are sent out</td><td>Serverless/Dedicated/Self-Hosted</td></tr>
<tr><td><div id="setting-server-time-until-store-dead" class="anchored"><code>server.time_until_store_dead</code></div></td><td>duration</td><td><code>5m0s</code></td><td>the time after which if there is no new gossiped information about a store, it is considered dead</td><td>Serverless/Dedicated/Self-Hosted</td></tr>
<tr><td><div id="setting-server-user-login-cert-password-method-auto-scram-promotion-enabled" class="anchored"><code>server.user_login.cert_password_method.auto_scram_promotion.enabled</code></div></td><td>boolean</td><td><code>true</code></td><td>whether to automatically promote cert-password authentication to use SCRAM</td><td>Serverless/Dedicated/Self-Hosted</td></tr>
<tr><td><div id="setting-server-user-login-downgrade-scram-stored-passwords-to-bcrypt-enabled" class="anchored"><code>server.user_login.downgrade_scram_stored_passwords_to_bcrypt.enabled</code></div></td><td>boolean</td><td><code>true</code></td><td>if server.user_login.password_encryption=crdb-bcrypt, this controls whether to automatically re-encode stored passwords using scram-sha-256 to crdb-bcrypt</td><td>Serverless/Dedicated/Self-Hosted</td></tr>
Expand Down
4 changes: 2 additions & 2 deletions pkg/cmd/bazci/githubpost/githubpost.go
Original file line number Diff line number Diff line change
Expand Up @@ -704,8 +704,8 @@ func formatPebbleMetamorphicIssue(
s := f.testMessage[i+len(seedHeader):]
s = strings.TrimSpace(s)
s = strings.TrimSpace(s[:strings.Index(s, "\n")])
repro = fmt.Sprintf("go test -tags 'invariants' -exec 'stress -p 1' "+
`-timeout 0 -test.v -run TestMeta$ ./internal/metamorphic -seed %s -ops "uniform:5000-10000"`, s)
repro = fmt.Sprintf(`go test -tags 'invariants' -exec 'stress -p 1' `+
`-timeout 0 -test.v -run '%s$' ./internal/metamorphic -seed %s -ops "uniform:5000-10000"`, f.testName, s)
}
}
return issues.UnitTestFormatter, issues.PostRequest{
Expand Down
2 changes: 1 addition & 1 deletion pkg/cmd/bazci/githubpost/githubpost_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -334,7 +334,7 @@ TestXXA - 1.00s
testName: "TestMeta",
title: "internal/metamorphic: TestMeta failed",
message: "panic: induced panic",
expRepro: `go test -tags 'invariants' -exec 'stress -p 1' -timeout 0 -test.v -run TestMeta$ ./internal/metamorphic -seed 1600209371838097000 -ops "uniform:5000-10000"`,
expRepro: `go test -tags 'invariants' -exec 'stress -p 1' -timeout 0 -test.v -run 'TestMeta$' ./internal/metamorphic -seed 1600209371838097000 -ops "uniform:5000-10000"`,
labels: []string{"metamorphic-failure", "C-test-failure", "release-blocker"},
},
},
Expand Down
108 changes: 107 additions & 1 deletion pkg/cmd/roachtest/tests/network.go
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,6 @@ package tests

import (
"context"
"errors"
"fmt"
"os"
"path/filepath"
Expand All @@ -26,7 +25,9 @@ import (
"github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test"
"github.com/cockroachdb/cockroach/pkg/roachprod/install"
"github.com/cockroachdb/cockroach/pkg/testutils"
"github.com/cockroachdb/cockroach/pkg/util/ctxgroup"
"github.com/cockroachdb/cockroach/pkg/util/timeutil"
errors "github.com/cockroachdb/errors"
_ "github.com/lib/pq" // register postgres driver
"github.com/stretchr/testify/require"
)
Expand Down Expand Up @@ -297,6 +298,101 @@ sudo iptables-save
m.Wait()
}

// runClientNetworkConnectionTimeout simulates a scenario where the client and
// server loose connectivity with a connection that is idle. The purpose of this
// test is to confirm that the keep alive settings are enforced.
func runClientNetworkConnectionTimeout(ctx context.Context, t test.Test, c cluster.Cluster) {
n := c.Spec().NodeCount
serverNodes, clientNode := c.Range(1, n-1), c.Nodes(n)
settings := install.MakeClusterSettings(install.SecureOption(true))
c.Start(ctx, t.L(), option.DefaultStartOpts(), settings, serverNodes)
certsDir := "/home/ubuntu/certs"
t.L().Printf("connecting to cluster from roachtest...")
db, err := c.ConnE(ctx, t.L(), 1)
require.NoError(t, err)
defer db.Close()

grp := ctxgroup.WithContext(ctx)
// Startup a connection on the client server, which will be running a
// long transaction (i.e. just the sleep builtin).
var runOutput install.RunResultDetails
grp.GoCtx(func(ctx context.Context) error {
ips, err := c.ExternalIP(ctx, t.L(), c.Node(1))
if err != nil {
return err
}
commandThatWillDisconnect := fmt.Sprintf(`./cockroach sql --certs-dir %s --url "postgres://root@%s:26257" -e "SELECT pg_sleep(600)"`, certsDir, ips[0])
t.L().Printf("Executing long running query: %s", commandThatWillDisconnect)
output, err := c.RunWithDetails(ctx, t.L(), clientNode, commandThatWillDisconnect)
runOutput = output[0]
return err
})
// Confirm that the connection was started.
testutils.SucceedsSoon(t, func() error {
row := db.QueryRow("SELECT count(*) FROM [SHOW CLUSTER SESSIONS] WHERE active_queries='SELECT pg_sleep(600)'")
var count int
if err := row.Scan(&count); err != nil {
return err
}
// Wait for the query to start up.
if count != 1 {
return errors.AssertionFailedf("unexepcted count :%v", count)
}
return nil
})

const netConfigCmd = `
# ensure any failure fails the entire script.
set -e;
# Setting default filter policy
sudo iptables -P INPUT ACCEPT;
sudo iptables -P OUTPUT ACCEPT;
# Drop any client traffic to CRDB.
sudo iptables -A INPUT -p tcp --sport 26257 -j DROP;
sudo iptables -A OUTPUT -p tcp --dport 26257 -j DROP;
`
t.L().Printf("blocking networking on client; config cmd:\n%s", netConfigCmd)
blockStartTime := timeutil.Now()
require.NoError(t, c.RunE(ctx, option.WithNodes(clientNode), netConfigCmd))

// (attempt to) restore iptables when test end, so that the client
// can be investigated afterward.
defer func() {
const restoreNet = `
set -e;
sudo iptables -F INPUT;
sudo iptables -F OUTPUT;
`
t.L().Printf("restoring iptables; config cmd:\n%s", restoreNet)
require.NoError(t, c.RunE(ctx, option.WithNodes(clientNode), restoreNet))
}()

// We expect the connection to timeout within 30 seconds based on
// the default settings. We will wait for up to 1 minutes for the
// connection to drop.
testutils.SucceedsWithin(t, func() error {
row := db.QueryRow("SELECT count(*) FROM [SHOW CLUSTER SESSIONS] WHERE active_queries='SELECT pg_sleep(600)'")
var count int
if err := row.Scan(&count); err != nil {
return err
}
if count != 0 {
return errors.AssertionFailedf("unexepcted count :%d", count)
}
return nil
},
time.Minute)
// Confirm it took at least a minute for the connection to clear out.
require.Greaterf(t, timeutil.Since(blockStartTime), time.Second*30, "connection dropped earlier than expected")
t.L().Printf("Connection was dropped after %s", timeutil.Since(blockStartTime))
// We expect the connection to be dropped with the lower keep alive settings.
require.NoError(t, grp.Wait())
require.Contains(t, runOutput.Stderr, "If the server is running, check --host client-side and --advertise server-side",
"Did not detect connection failure %s %d", runOutput.Stderr, runOutput.RemoteExitStatus)
}

func registerNetwork(r registry.Registry) {
const numNodes = 4
r.Add(registry.TestSpec{
Expand All @@ -310,4 +406,14 @@ func registerNetwork(r registry.Registry) {
runNetworkAuthentication(ctx, t, c)
},
})

r.Add(registry.TestSpec{
Name: "network/client-connection-timeout",
Owner: registry.OwnerSQLFoundations,
Cluster: r.MakeClusterSpec(2), // One server and client
CompatibleClouds: registry.AllExceptAWS,
Suites: registry.Suites(registry.Nightly),
Leases: registry.MetamorphicLeases,
Run: runClientNetworkConnectionTimeout,
})
}
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
echo
----
db0.GetForShareSkipLocked(ctx, tk(1)) // usage of shared locks in conjunction with skip locked wait policy is currently unsupported
db0.GetForShareSkipLocked(ctx, tk(1)) // @<ts> (v1, <nil>)
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
echo
----
db0.GetForShareSkipLockedGuaranteedDurability(ctx, tk(1)) // usage of shared locks in conjunction with skip locked wait policy is currently unsupported
db0.GetForShareSkipLockedGuaranteedDurability(ctx, tk(1)) // @<ts> (v1, <nil>)
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
echo
----
db0.ReverseScanForShareSkipLocked(ctx, tk(1), tk(2), 0) // usage of shared locks in conjunction with skip locked wait policy is currently unsupported
db0.ReverseScanForShareSkipLocked(ctx, tk(1), tk(2), 0) // @<ts> (/Table/100/"0000000000000001":v21, <nil>)
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
echo
----
db0.ReverseScanForShareSkipLockedGuaranteedDurability(ctx, tk(1), tk(2), 0) // usage of shared locks in conjunction with skip locked wait policy is currently unsupported
db0.ReverseScanForShareSkipLockedGuaranteedDurability(ctx, tk(1), tk(2), 0) // @<ts> (/Table/100/"0000000000000001":v21, <nil>)
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
echo
----
db0.ScanForShareSkipLocked(ctx, tk(1), tk(3), 0) // usage of shared locks in conjunction with skip locked wait policy is currently unsupported
db0.ScanForShareSkipLocked(ctx, tk(1), tk(3), 0) // @<ts> (/Table/100/"0000000000000001":v1, <nil>)
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
echo
----
db0.ScanForShareSkipLockedGuaranteedDurability(ctx, tk(1), tk(3), 0) // usage of shared locks in conjunction with skip locked wait policy is currently unsupported
db0.ScanForShareSkipLockedGuaranteedDurability(ctx, tk(1), tk(3), 0) // @<ts> (/Table/100/"0000000000000001":v1, <nil>)
4 changes: 2 additions & 2 deletions pkg/kv/kvserver/batcheval/BUILD.bazel
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ go_library(
"command.go",
"declare.go",
"eval_context.go",
"intent.go",
"lock.go",
"ranges.go",
"split_stats_helper.go",
"stateloader.go",
Expand Down Expand Up @@ -121,8 +121,8 @@ go_test(
"cmd_scan_test.go",
"cmd_truncate_log_test.go",
"declare_test.go",
"intent_test.go",
"knobs_use_range_tombstones_test.go",
"lock_test.go",
"main_test.go",
"ranges_test.go",
"testutils_test.go",
Expand Down
7 changes: 4 additions & 3 deletions pkg/kv/kvserver/batcheval/cmd_get.go
Original file line number Diff line number Diff line change
Expand Up @@ -33,8 +33,9 @@ func Get(
h := cArgs.Header
reply := resp.(*kvpb.GetResponse)

if err := maybeDisallowSkipLockedRequest(h, args.KeyLockingStrength); err != nil {
return result.Result{}, err
var lockTableForSkipLocked storage.LockTableView
if h.WaitPolicy == lock.WaitPolicy_SkipLocked {
lockTableForSkipLocked = newRequestBoundLockTableView(cArgs.Concurrency, args.KeyLockingStrength)
}

getRes, err := storage.MVCCGet(ctx, readWriter, args.Key, h.Timestamp, storage.MVCCGetOptions{
Expand All @@ -45,7 +46,7 @@ func Get(
ScanStats: cArgs.ScanStats,
Uncertainty: cArgs.Uncertainty,
MemoryAccount: cArgs.EvalCtx.GetResponseMemoryAccount(),
LockTable: cArgs.Concurrency,
LockTable: lockTableForSkipLocked,
DontInterleaveIntents: cArgs.DontInterleaveIntents,
MaxKeys: cArgs.Header.MaxSpanRequestKeys,
TargetBytes: cArgs.Header.TargetBytes,
Expand Down
7 changes: 4 additions & 3 deletions pkg/kv/kvserver/batcheval/cmd_reverse_scan.go
Original file line number Diff line number Diff line change
Expand Up @@ -35,8 +35,9 @@ func ReverseScan(
h := cArgs.Header
reply := resp.(*kvpb.ReverseScanResponse)

if err := maybeDisallowSkipLockedRequest(h, args.KeyLockingStrength); err != nil {
return result.Result{}, err
var lockTableForSkipLocked storage.LockTableView
if h.WaitPolicy == lock.WaitPolicy_SkipLocked {
lockTableForSkipLocked = newRequestBoundLockTableView(cArgs.Concurrency, args.KeyLockingStrength)
}

var res result.Result
Expand All @@ -58,7 +59,7 @@ func ReverseScan(
FailOnMoreRecent: args.KeyLockingStrength != lock.None,
Reverse: true,
MemoryAccount: cArgs.EvalCtx.GetResponseMemoryAccount(),
LockTable: cArgs.Concurrency,
LockTable: lockTableForSkipLocked,
DontInterleaveIntents: cArgs.DontInterleaveIntents,
ReadCategory: readCategory,
}
Expand Down
Loading

0 comments on commit b743201

Please sign in to comment.