Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
72991: server,sql: implement connection_wait for graceful draining r=ZhouXing19 a=ZhouXing19

Currently, the draining process is consist of three consecutive periods:

1. Server enters the "unready" state: The `/health?ready=1` http endpoint starts to show that the node is shutting down, but new SQL connections and new queries are still allowed. The server does a hard wait till the timeout. This phrase's duration is set with cluster setting `server.shutdown.drain_wait`.

2. Drain SQL connections: New SQL connections are not allowed. SQL Connections with no queries in flight will be closed by the server immediately. The rest of these SQL connections will be terminated by the server as soon as their queries are finished. Early exit if all queries are finished. This phrase's maximum duration is set with cluster setting `server.shutdown.query_wait`.

3. Drain range lease: the server keeps retrying forever until all range leases on this draining node have been transferred. Each retry iteration's duration is specified by the cluster setting `server.shutdown.lease_transfer_timeout`.

This commit reorganizes the draining process by adding a phrase where the server waits SQL connections to be closed, and once all SQL connections are closed before timeout, the server proceeds to the next draining phase.

The newly proposed draining process is:

1. (unchanged) Server enters the "unready" state: The `/health?ready=1` http endpoint starts to show that the node is shutting down, but new SQL connections and new queries are still allowed. The server does a hard wait till the timeout. This phrase's duration is set with cluster setting `server.shutdown.drain_wait`.

2. (new phase) Wait SQL connections to be closed: New SQL connections are not allowed now. The server waits for the remaining SQL connections to be closed or timeout. Once all SQL connections are closed, the draining proceed to the next phase. The maximum duration of this phase is determined by the cluster setting `server.shutdown.connection_wait`.

3. (unchanged) Drain SQL connections: New SQL connections are not allowed. SQL Connections with no queries in flight will be closed by the server immediately. The rest of these SQL connections will be terminated by the server as soon as their queries are finished. Early exit if all queries are finished. This phrase's maximum duration is set with cluster setting `server.shutdown.query_wait`.

4. (unchanged) Drain range lease: the server keeps retrying forever until all range leases on this draining node have been transferred. Each retry iteration's duration is specified by the cluster setting `server.shutdown.lease_transfer_timeout`.

The duration of the new phase ("Wait SQL connections to close") can be set similarly to the other 3 existing draining phases:
```
SET CLUSTER SETTING server.shutdown.connection_wait = '40s'
```

Resolves #66319

Release note (ops change):  add `server.shutdown.connection_wait` to the
draining process configuration. This provides a workaround when customers
encountered intermittent blips and failed requests when they were performing
operations that are related to restarting nodes.

Release justification: Low risk, high benefit changes to existing functionality
(optimize the node draining process).

76430: [CRDB-9550] kv: adjust number of voters needed calculation when determining replication status r=Santamaura a=Santamaura

Currently, when a range has non-voting replicas and it is queried through replication
stats, it will be reported as underreplicated. This is because in the case where a
zone is configured to have non-voting replicas, for the over/under replicated counts,
we compare the number of current voters to the total number of replicas which is
erroneus. Instead, we will compare current number of voters to the total number of
voters if voters has been set and otherwise will defer to the total number of replicas.
This patch ignores the desired non-voters count for the purposes of this report, for
better or worse. Resolves #69335.

Release justification: low risk bug fix

Release note (bug fix): use total number of voters if set when determining replication
status

Before change:
![Screen Shot 2022-02-11 at 10 03 57 AM](https://user-images.githubusercontent.com/17861665/153615571-85163409-5bac-40f4-9669-20dce77185cf.png)

After change:
![Screen Shot 2022-02-11 at 9 53 04 AM](https://user-images.githubusercontent.com/17861665/153615316-785b156b-bd23-4cfa-a76d-7c9fa47fbf1e.png)

77315: backupccl: backup correctly tries reading in from base directory if l… r=DarrylWong a=DarrylWong

…atest/checkpoint files aren't found

Before, we only tried reading from the base directory if we caught a ErrFileDoesNotExist error. However
this does not account for the potential error thrown when the progress/latest directories don't exist.
This changes it so we now correctly retry reading from the base directory.

We also put the latest directory inside of a metadata directory, in order to avoid any potential
conflicts with there being a latest file and latest directory in the same base directory.

Also wraps errors in findLatestFile and readLatestCheckpointFile for more clarity when both base and
latest/progress directories fail to read.

Fixes #77312

Release justification: Low risk bug fix
Release note: none

77406: backupccl: test ignore ProtectionPolicy for exclude_data_from_backup r=dt a=adityamaru

This change adds an end to end test to ensure that a table excluded
from backup will not holdup GC on its replica even in the presence
of a protected timestamp record covering the replica

From a users point of view, this allows them to mark a table whose
row data will be excluded from backup, and to set that tables gc.ttl
to a very low value. Backups that write PTS records will no longer
holdup GC on such low GC TTL tables.

Fixes: #73536

Release note: None

Release justification: low risk update to new functionality

77450: ui: add selected period as part of cached key r=maryliag a=maryliag

Previously, the fingerprint id and the app names were used
as a key for a statement details cache. This commits adds
the start and end time (when existing) to the key, so
the details are correctly assigned to the selected period.

This commit also rounds the selected value period to the hour,
since that is what is used on the persisted statistics, with
the start value keeping the hour and the end value adding one
hour, for example:
start: 17:45:23  ->  17:00:00
end:   20:14:32  ->  21:00:00

Partially addresses #72129

Release note: None
Release Justification: Low risk, high benefit change

77597: kv: Add `created` column to `active_range_feeds` table. r=miretskiy a=miretskiy

Add `created` column to `active_range_feeds` table.
This column is initialized to the time when the partial range feed
was created.  This allows us to determine, among other things,
whether or not the rangefeed is currently performing a catchup scan
(i.e. it's resolved column is 0), and how long the scan has been running
for.

Release Notes (enterprise): Add created time column
to `crdb_internal.active_range_feeds` virtual table to improve observability
and debugability of rangefeed system.

Fixes #77581

Release Justification: Low impact observability/debugability improvement.

Co-authored-by: Jane Xing <[email protected]>
Co-authored-by: Santamaura <[email protected]>
Co-authored-by: Darryl <[email protected]>
Co-authored-by: Aditya Maru <[email protected]>
Co-authored-by: Marylia Gutierrez <[email protected]>
Co-authored-by: Yevgeniy Miretskiy <[email protected]>
  • Loading branch information
7 people committed Mar 10, 2022
7 parents 25fe0db + 345bb31 + e6a483c + c9e9855 + 3470def + a432d7f + 3657162 commit 8d046fc
Show file tree
Hide file tree
Showing 41 changed files with 1,166 additions and 313 deletions.
3 changes: 2 additions & 1 deletion docs/generated/settings/settings-for-tenants.txt
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,8 @@ server.oidc_authentication.provider_url string sets OIDC provider URL ({provide
server.oidc_authentication.redirect_url string https://localhost:8080/oidc/v1/callback sets OIDC redirect URL via a URL string or a JSON string containing a required `redirect_urls` key with an object that maps from region keys to URL strings (URLs should point to your load balancer and must route to the path /oidc/v1/callback)
server.oidc_authentication.scopes string openid sets OIDC scopes to include with authentication request (space delimited list of strings, required to start with `openid`)
server.rangelog.ttl duration 720h0m0s if nonzero, range log entries older than this duration are deleted every 10m0s. Should not be lowered below 24 hours.
server.shutdown.drain_wait duration 0s the amount of time a server waits in an unready state before proceeding with a drain (note that the --drain-wait parameter for cockroach node drain may need adjustment after changing this setting)
server.shutdown.connection_wait duration 0s the maximum amount of time a server waits for all SQL connections to be closed before proceeding with a drain. (note that the --drain-wait parameter for cockroach node drain may need adjustment after changing this setting)
server.shutdown.drain_wait duration 0s the amount of time a server waits in an unready state before proceeding with a drain (note that the --drain-wait parameter for cockroach node drain may need adjustment after changing this setting. --drain-wait is to specify the duration of the whole draining process, while server.shutdown.drain_wait is to set the wait time for health probes to notice that the node is not ready.)
server.shutdown.lease_transfer_wait duration 5s the timeout for a single iteration of the range lease transfer phase of draining (note that the --drain-wait parameter for cockroach node drain may need adjustment after changing this setting)
server.shutdown.query_wait duration 10s the timeout for waiting for active queries to finish during a drain (note that the --drain-wait parameter for cockroach node drain may need adjustment after changing this setting)
server.time_until_store_dead duration 5m0s the time after which if there is no new gossiped information about a store, it is considered dead
Expand Down
3 changes: 2 additions & 1 deletion docs/generated/settings/settings.html
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,8 @@
<tr><td><code>server.oidc_authentication.redirect_url</code></td><td>string</td><td><code>https://localhost:8080/oidc/v1/callback</code></td><td>sets OIDC redirect URL via a URL string or a JSON string containing a required `redirect_urls` key with an object that maps from region keys to URL strings (URLs should point to your load balancer and must route to the path /oidc/v1/callback) </td></tr>
<tr><td><code>server.oidc_authentication.scopes</code></td><td>string</td><td><code>openid</code></td><td>sets OIDC scopes to include with authentication request (space delimited list of strings, required to start with `openid`)</td></tr>
<tr><td><code>server.rangelog.ttl</code></td><td>duration</td><td><code>720h0m0s</code></td><td>if nonzero, range log entries older than this duration are deleted every 10m0s. Should not be lowered below 24 hours.</td></tr>
<tr><td><code>server.shutdown.drain_wait</code></td><td>duration</td><td><code>0s</code></td><td>the amount of time a server waits in an unready state before proceeding with a drain (note that the --drain-wait parameter for cockroach node drain may need adjustment after changing this setting)</td></tr>
<tr><td><code>server.shutdown.connection_wait</code></td><td>duration</td><td><code>0s</code></td><td>the maximum amount of time a server waits for all SQL connections to be closed before proceeding with a drain. (note that the --drain-wait parameter for cockroach node drain may need adjustment after changing this setting)</td></tr>
<tr><td><code>server.shutdown.drain_wait</code></td><td>duration</td><td><code>0s</code></td><td>the amount of time a server waits in an unready state before proceeding with a drain (note that the --drain-wait parameter for cockroach node drain may need adjustment after changing this setting. --drain-wait is to specify the duration of the whole draining process, while server.shutdown.drain_wait is to set the wait time for health probes to notice that the node is not ready.)</td></tr>
<tr><td><code>server.shutdown.lease_transfer_wait</code></td><td>duration</td><td><code>5s</code></td><td>the timeout for a single iteration of the range lease transfer phase of draining (note that the --drain-wait parameter for cockroach node drain may need adjustment after changing this setting)</td></tr>
<tr><td><code>server.shutdown.query_wait</code></td><td>duration</td><td><code>10s</code></td><td>the timeout for waiting for active queries to finish during a drain (note that the --drain-wait parameter for cockroach node drain may need adjustment after changing this setting)</td></tr>
<tr><td><code>server.time_until_store_dead</code></td><td>duration</td><td><code>5m0s</code></td><td>the time after which if there is no new gossiped information about a store, it is considered dead</td></tr>
Expand Down
7 changes: 3 additions & 4 deletions pkg/ccl/backupccl/backup_destination.go
Original file line number Diff line number Diff line change
Expand Up @@ -346,11 +346,10 @@ func findLatestFile(
) (ioctx.ReadCloserCtx, error) {
latestFile, err := exportStore.ReadFile(ctx, latestHistoryDirectory+"/"+latestFileName)
if err != nil {
if !errors.Is(err, cloud.ErrFileDoesNotExist) {
return nil, err
}

latestFile, err = exportStore.ReadFile(ctx, latestFileName)
if err != nil {
return nil, errors.Wrap(err, "LATEST file could not be read in base or metadata directory")
}
}
return latestFile, err
}
Expand Down
19 changes: 10 additions & 9 deletions pkg/ccl/backupccl/backup_job.go
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,6 @@ import (
"github.com/cockroachdb/cockroach/pkg/build"
"github.com/cockroachdb/cockroach/pkg/ccl/utilccl"
"github.com/cockroachdb/cockroach/pkg/cloud"
"github.com/cockroachdb/cockroach/pkg/clusterversion"
"github.com/cockroachdb/cockroach/pkg/gossip"
"github.com/cockroachdb/cockroach/pkg/jobs"
"github.com/cockroachdb/cockroach/pkg/jobs/joberror"
Expand Down Expand Up @@ -516,6 +515,10 @@ func (b *backupResumer) Resume(ctx context.Context, execCtx interface{}) error {
MaxRetries: 5,
}

if err := p.ExecCfg().JobRegistry.CheckPausepoint("backup.before_flow"); err != nil {
return err
}

// We want to retry a backup if there are transient failures (i.e. worker nodes
// dying), so if we receive a retryable error, re-plan and retry the backup.
var res roachpb.RowCount
Expand Down Expand Up @@ -805,17 +808,15 @@ func (b *backupResumer) deleteCheckpoint(
return err
}
defer exportStore.Close()
// The checkpoint in the base directory should only exist if the cluster
// version isn't BackupDoesNotOverwriteLatestAndCheckpoint.
if !cfg.Settings.Version.IsActive(ctx, clusterversion.BackupDoesNotOverwriteLatestAndCheckpoint) {
err = exportStore.Delete(ctx, backupProgressDirectory)
if err != nil {
return err
}
// We first attempt to delete from base directory to account for older
// backups, and then from the progress directory.
err = exportStore.Delete(ctx, backupManifestCheckpointName)
if err != nil {
log.Warningf(ctx, "unable to delete checkpointed backup descriptor file in base directory: %+v", err)
}
return exportStore.Delete(ctx, backupProgressDirectory+"/"+backupManifestCheckpointName)
}(); err != nil {
log.Warningf(ctx, "unable to delete checkpointed backup descriptor: %+v", err)
log.Warningf(ctx, "unable to delete checkpointed backup descriptor file in progress directory: %+v", err)
}
}

Expand Down
26 changes: 8 additions & 18 deletions pkg/ccl/backupccl/backup_tenant_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
//
// https://github.com/cockroachdb/cockroach/blob/master/licenses/CCL.txt

package backupccl_test
package backupccl

import (
"context"
Expand All @@ -15,9 +15,11 @@ import (
"github.com/cockroachdb/cockroach/pkg/base"
_ "github.com/cockroachdb/cockroach/pkg/cloud/impl" // register cloud storage providers
"github.com/cockroachdb/cockroach/pkg/jobs"
"github.com/cockroachdb/cockroach/pkg/jobs/jobspb"
"github.com/cockroachdb/cockroach/pkg/roachpb"
_ "github.com/cockroachdb/cockroach/pkg/sql/importer"
"github.com/cockroachdb/cockroach/pkg/testutils"
"github.com/cockroachdb/cockroach/pkg/testutils/jobutils"
"github.com/cockroachdb/cockroach/pkg/testutils/serverutils"
"github.com/cockroachdb/cockroach/pkg/testutils/sqlutils"
"github.com/cockroachdb/cockroach/pkg/testutils/testcluster"
Expand All @@ -41,6 +43,7 @@ func TestBackupTenantImportingTable(t *testing.T) {
TestingKnobs: base.TestingKnobs{JobsTestingKnobs: jobs.NewTestingKnobsWithShortIntervals()},
})
defer tSQL.Close()
runner := sqlutils.MakeSQLRunner(tSQL)

if _, err := tSQL.Exec("SET CLUSTER SETTING jobs.debug.pausepoints = 'import.after_ingest';"); err != nil {
t.Fatal(err)
Expand All @@ -54,23 +57,10 @@ func TestBackupTenantImportingTable(t *testing.T) {
if _, err := tSQL.Exec("IMPORT INTO x CSV DATA ('workload:///csv/bank/bank?rows=100&version=1.0.0')"); !testutils.IsError(err, "pause") {
t.Fatal(err)
}
var jobID int
if err := tSQL.QueryRow(`SELECT job_id FROM [show jobs] WHERE job_type = 'IMPORT'`).Scan(&jobID); err != nil {
t.Fatal(err)
}
tc.Servers[0].JobRegistry().(*jobs.Registry).TestingNudgeAdoptionQueue()
// wait for it to pause

testutils.SucceedsSoon(t, func() error {
var status string
if err := tSQL.QueryRow(`SELECT status FROM [show jobs] WHERE job_id = $1`, jobID).Scan(&status); err != nil {
t.Fatal(err)
}
if status == string(jobs.StatusPaused) {
return nil
}
return errors.Newf("%s", status)
})
var jobID jobspb.JobID
err := tSQL.QueryRow(`SELECT job_id FROM [show jobs] WHERE job_type = 'IMPORT'`).Scan(&jobID)
require.NoError(t, err)
jobutils.WaitForJobToPause(t, runner, jobID)

// tenant now has a fully ingested, paused import, so back them up.
const dst = "userfile:///t"
Expand Down
170 changes: 147 additions & 23 deletions pkg/ccl/backupccl/backup_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@ import (
"github.com/cockroachdb/cockroach/pkg/kv"
"github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord"
"github.com/cockroachdb/cockroach/pkg/kv/kvserver"
"github.com/cockroachdb/cockroach/pkg/kv/kvserver/protectedts"
"github.com/cockroachdb/cockroach/pkg/roachpb"
"github.com/cockroachdb/cockroach/pkg/security"
"github.com/cockroachdb/cockroach/pkg/settings/cluster"
Expand Down Expand Up @@ -1790,7 +1791,7 @@ func createAndWaitForJob(
t, `INSERT INTO system.jobs (created, status, payload, progress) VALUES ($1, $2, $3, $4) RETURNING id`,
timeutil.FromUnixMicros(now), jobs.StatusRunning, payload, progressBytes,
).Scan(&jobID)
jobutils.WaitForJob(t, db, jobID)
jobutils.WaitForJobToSucceed(t, db, jobID)
}

// TestBackupRestoreResume tests whether backup and restore jobs are properly
Expand Down Expand Up @@ -2015,7 +2016,7 @@ func TestBackupRestoreControlJob(t *testing.T) {
t.Fatalf("%d: expected 'job paused' error, but got %+v", i, err)
}
sqlDB.Exec(t, fmt.Sprintf(`RESUME JOB %d`, jobID))
jobutils.WaitForJob(t, sqlDB, jobID)
jobutils.WaitForJobToSucceed(t, sqlDB, jobID)
}

sqlDB.CheckQueryResults(t,
Expand Down Expand Up @@ -2051,7 +2052,7 @@ func TestBackupRestoreControlJob(t *testing.T) {
sqlDB.CheckQueryResults(t, fmt.Sprintf("SHOW BACKUP '%s'", noOfflineDir), [][]string{})
}
sqlDB.Exec(t, fmt.Sprintf(`RESUME JOB %d`, jobID))
jobutils.WaitForJob(t, sqlDB, jobID)
jobutils.WaitForJobToSucceed(t, sqlDB, jobID)
}
sqlDB.CheckQueryResults(t,
`SELECT count(*) FROM pause.bank`,
Expand All @@ -2071,7 +2072,7 @@ func TestBackupRestoreControlJob(t *testing.T) {
if err != nil {
t.Fatalf("error while running backup %+v", err)
}
jobutils.WaitForJob(t, sqlDB, backupJobID)
jobutils.WaitForJobToSucceed(t, sqlDB, backupJobID)

sqlDB.Exec(t, `DROP DATABASE data`)

Expand Down Expand Up @@ -9051,7 +9052,7 @@ func TestBackupWorkerFailure(t *testing.T) {
}

// But the job should be restarted and succeed eventually.
jobutils.WaitForJob(t, sqlDB, jobID)
jobutils.WaitForJobToSucceed(t, sqlDB, jobID)

// Drop database and restore to ensure that the backup was successful.
sqlDB.Exec(t, `DROP DATABASE data`)
Expand Down Expand Up @@ -9400,7 +9401,7 @@ DROP INDEX foo@bar;
close(allowGC)

// Wait for the GC to complete.
jobutils.WaitForJob(t, sqlRunner, gcJobID)
jobutils.WaitForJobToSucceed(t, sqlRunner, gcJobID)
waitForTableSplit(t, conn, "foo", "test")

// This backup should succeed since the spans being backed up have a default
Expand Down Expand Up @@ -9642,7 +9643,7 @@ func TestExportRequestBelowGCThresholdOnDataExcludedFromBackup(t *testing.T) {
}
args.ServerArgs.Knobs.JobsTestingKnobs = jobs.NewTestingKnobsWithShortIntervals()
args.ServerArgs.ExternalIODir = localExternalDir
tc := testcluster.StartTestCluster(t, 3, args)
tc := testcluster.StartTestCluster(t, 1, args)
defer tc.Stopper().Stop(ctx)

tc.WaitForNodeLiveness(t)
Expand Down Expand Up @@ -9684,21 +9685,6 @@ func TestExportRequestBelowGCThresholdOnDataExcludedFromBackup(t *testing.T) {
require.NoError(t, err)
}
}
const processedPattern = `(?s)shouldQueue=true.*processing replica.*GC score after GC`
processedRegexp := regexp.MustCompile(processedPattern)

gcSoon := func() {
testutils.SucceedsSoon(t, func() error {
upsertUntilBackpressure()
s, repl := getStoreAndReplica(t, tc, conn, "foo", "defaultdb")
trace, _, err := s.ManuallyEnqueue(ctx, "mvccGC", repl, false)
require.NoError(t, err)
if !processedRegexp.MatchString(trace.String()) {
return errors.Errorf("%q does not match %q", trace.String(), processedRegexp)
}
return nil
})
}

waitForTableSplit(t, conn, "foo", "defaultdb")
waitForReplicaFieldToBeSet(t, tc, conn, "foo", "defaultdb", func(r *kvserver.Replica) (bool, error) {
Expand All @@ -9710,7 +9696,15 @@ func TestExportRequestBelowGCThresholdOnDataExcludedFromBackup(t *testing.T) {

var tsBefore string
require.NoError(t, conn.QueryRow("SELECT cluster_logical_timestamp()").Scan(&tsBefore))
gcSoon()
upsertUntilBackpressure()
runGCAndCheckTrace(ctx, t, tc, conn, false /* skipShouldQueue */, "foo", "defaultdb", func(traceStr string) error {
const processedPattern = `(?s)shouldQueue=true.*processing replica.*GC score after GC`
processedRegexp := regexp.MustCompile(processedPattern)
if !processedRegexp.MatchString(traceStr) {
return errors.Errorf("%q does not match %q", traceStr, processedRegexp)
}
return nil
})

_, err = conn.Exec(fmt.Sprintf("BACKUP TABLE foo TO $1 AS OF SYSTEM TIME '%s'", tsBefore), localFoo)
testutils.IsError(err, "must be after replica GC threshold")
Expand All @@ -9728,6 +9722,115 @@ func TestExportRequestBelowGCThresholdOnDataExcludedFromBackup(t *testing.T) {
require.NoError(t, err)
}

// TestExcludeDataFromBackupDoesNotHoldupGC tests that a table marked as
// `exclude_data_from_backup` and with a protected timestamp record covering it
// does not holdup GC, since its data is not going to be backed up.
func TestExcludeDataFromBackupDoesNotHoldupGC(t *testing.T) {
defer leaktest.AfterTest(t)()
defer log.Scope(t).Close(t)

ctx, cancel := context.WithCancel(context.Background())
defer cancel()

dir, dirCleanupFn := testutils.TempDir(t)
defer dirCleanupFn()
params := base.TestClusterArgs{}
params.ServerArgs.ExternalIODir = dir
params.ServerArgs.Knobs.Store = &kvserver.StoreTestingKnobs{
DisableGCQueue: true,
DisableLastProcessedCheck: true,
}
params.ServerArgs.Knobs.ProtectedTS = &protectedts.TestingKnobs{
EnableProtectedTimestampForMultiTenant: true}
params.ServerArgs.Knobs.JobsTestingKnobs = jobs.NewTestingKnobsWithShortIntervals()
tc := testcluster.StartTestCluster(t, 1, params)
defer tc.Stopper().Stop(ctx)

tc.WaitForNodeLiveness(t)
require.NoError(t, tc.WaitForFullReplication())

conn := tc.ServerConn(0)
runner := sqlutils.MakeSQLRunner(conn)
runner.Exec(t, `SET CLUSTER SETTING kv.rangefeed.enabled = true`)
// speeds up the test
runner.Exec(t, `SET CLUSTER SETTING kv.closed_timestamp.target_duration = '100ms'`)
runner.Exec(t, `SET CLUSTER SETTING kv.protectedts.poll_interval = '10ms'`)

runner.Exec(t, `CREATE DATABASE test;`)
runner.Exec(t, `CREATE TABLE test.foo (k INT PRIMARY KEY, v BYTES)`)

// Exclude the table from backup so that it does not hold up GC.
runner.Exec(t, `ALTER TABLE test.foo SET (exclude_data_from_backup = true)`)

const tableRangeMaxBytes = 1 << 18
runner.Exec(t, "ALTER TABLE test.foo CONFIGURE ZONE USING "+
"gc.ttlseconds = 1, range_max_bytes = $1, range_min_bytes = 1<<10;", tableRangeMaxBytes)

rRand, _ := randutil.NewTestRand()
upsertUntilBackpressure := func() {
for {
_, err := conn.Exec("UPSERT INTO test.foo VALUES (1, $1)",
randutil.RandBytes(rRand, 1<<15))
if testutils.IsError(err, "backpressure") {
break
}
require.NoError(t, err)
}
}

// Wait for the span config fields to apply.
waitForTableSplit(t, conn, "foo", "test")
waitForReplicaFieldToBeSet(t, tc, conn, "foo", "test", func(r *kvserver.Replica) (bool, error) {
if !r.ExcludeDataFromBackup() {
return false, errors.New("waiting for exclude_data_from_backup to be applied")
}
conf := r.SpanConfig()
if conf.TTL() != 1*time.Second {
return false, errors.New("waiting for gc.ttlseconds to be applied")
}
if r.GetMaxBytes() != tableRangeMaxBytes {
return false, errors.New("waiting for range_max_bytes to be applied")
}
return true, nil
})

runner.Exec(t, `SET CLUSTER SETTING jobs.debug.pausepoints = 'backup.before_flow'`)
if _, err := conn.Exec(`BACKUP DATABASE test INTO $1`, localFoo); !testutils.IsError(err, "pause") {
t.Fatal(err)
}
// We pause the backup resumer before it plans its flow so this timestamp
// should be very close to the timestamp protected by the record written by
// the backup.
afterBackup := tc.Server(0).Clock().Now()
var jobID jobspb.JobID
err := conn.QueryRow(`SELECT job_id FROM [show jobs] WHERE job_type = 'BACKUP'`).Scan(&jobID)
require.NoError(t, err)
jobutils.WaitForJobToPause(t, runner, jobID)

// Ensure that the replica sees the ProtectionPolicies.
waitForReplicaFieldToBeSet(t, tc, conn, "foo", "test", func(r *kvserver.Replica) (bool, error) {
if len(r.SpanConfig().GCPolicy.ProtectionPolicies) == 0 {
return false, errors.New("no protection policy applied to replica")
}
return true, nil
})

// Now that the backup has written a PTS record protecting the database, we
// check that the replica corresponding to `test.foo` continue to GC data
// since it has been marked as `exclude_data_from_backup`.
upsertUntilBackpressure()
runGCAndCheckTrace(ctx, t, tc, conn, false /* skipShouldQueue */, "foo", "test", func(traceStr string) error {
const processedPattern = `(?s)shouldQueue=true.*processing replica.*GC score after GC`
processedRegexp := regexp.MustCompile(processedPattern)
if !processedRegexp.MatchString(traceStr) {
return errors.Errorf("%q does not match %q", traceStr, processedRegexp)
}
thresh := thresholdFromTrace(t, traceStr)
require.Truef(t, afterBackup.Less(thresh), "%v >= %v", afterBackup, thresh)
return nil
})
}

// TestBackupRestoreSystemUsers tests RESTORE SYSTEM USERS feature which allows user to
// restore users from a backup into current cluster and regrant roles.
func TestBackupRestoreSystemUsers(t *testing.T) {
Expand Down Expand Up @@ -9815,3 +9918,24 @@ func TestBackupRestoreSystemUsers(t *testing.T) {
})
})
}

// TestUserfileNormalizationIncrementalShowBackup tests to see that file
// paths given by SHOW BACKUP on Incremental Backups work on userfiles.
// Specifically we are looking to see that no normalization is needed
// for filepaths, as userfiles do not support file system semnatics.
func TestUserfileNormalizationIncrementalShowBackup(t *testing.T) {
defer leaktest.AfterTest(t)()
defer log.Scope(t).Close(t)

const numAccounts = 1
const userfile = "'userfile:///a'"
_, sqlDB, _, cleanupFn := backupRestoreTestSetup(t, singleNode, numAccounts, InitManualReplication)
defer cleanupFn()

query := fmt.Sprintf("BACKUP bank TO %s", userfile)
sqlDB.Exec(t, query)
query = fmt.Sprintf("BACKUP bank TO %s", userfile)
sqlDB.Exec(t, query)
query = fmt.Sprintf("SHOW BACKUP %s", userfile)
sqlDB.Exec(t, query)
}
Loading

0 comments on commit 8d046fc

Please sign in to comment.