Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
96673: backup: fail instead of pausing when out of retries r=dt a=dt

This changes backups to fail with the most recent error instead of pausing when they run out of retries while hitting non-permanent errors. This is desired as a failed backup is expected to be more likely to be seen and noticed in monitoring (number of failed jobs increments, they show up red on jobs page, etc) and becuase a backup schedule it then able to trigger subsequent jobs instead of wiating on the incomplete paused job that will never resume or complete without human intervention.

Release note (ops change): a BACKUP which encounters too many retryable errors will now fail instead of pausing to allow subsequent backups the chance to succeed.

Epic: CRDB-21953

96702: clusterversion: don't assume BinaryMinSupportedVersionKey is first r=RaduBerinde a=RaduBerinde

The code that applies a "dev offset" to versions assumes that the minimum supported version key is the first (non-primordial) key. This forces us to remove keys when bumping `BinaryMinSupportedVersionKey`.

This change improves this code to specifically check for `BinaryMinSupportedVersionKey` so that we can clean up old keys after bumping the min supported version.

Release note: None
Epic: none

96709: roachtest: fix inconsistency test r=pavelkalinnikov a=pavelkalinnikov

Fix usage of "find" util so that the test works on Linux as well.

Fixes #96707
Release note: none
Epic: none

Co-authored-by: David Taylor <[email protected]>
Co-authored-by: Radu Berinde <[email protected]>
Co-authored-by: Pavel Kalinnikov <[email protected]>
  • Loading branch information
4 people committed Feb 7, 2023
4 parents cc99062 + 8f8767c + 4ed1551 + 0c1013d commit 9444613
Show file tree
Hide file tree
Showing 3 changed files with 11 additions and 14 deletions.
12 changes: 5 additions & 7 deletions pkg/ccl/backupccl/backup_job.go
Original file line number Diff line number Diff line change
Expand Up @@ -722,14 +722,12 @@ func (b *backupResumer) Resume(ctx context.Context, execCtx interface{}) error {
}
}

// We have exhausted retries, but we have not seen a "PermanentBulkJobError" so
// it is possible that this is a transient error that is taking longer than
// our configured retry to go away.
//
// Let's pause the job instead of failing it so that the user can decide
// whether to resume it or cancel it.
// We have exhausted retries without getting a "PermanentBulkJobError", but
// something must be wrong if we keep seeing errors so give up and fail to
// ensure that any alerting on failures is triggered and that any subsequent
// schedule runs are not blocked.
if err != nil {
return jobs.MarkPauseRequestError(errors.Wrap(err, "exhausted retries"))
return errors.Wrap(err, "exhausted retries")
}

var backupDetails jobspb.BackupDetails
Expand Down
9 changes: 4 additions & 5 deletions pkg/clusterversion/cockroach_versions.go
Original file line number Diff line number Diff line change
Expand Up @@ -756,21 +756,20 @@ var versionsSingleton = func() keyedVersions {
// version! For example, on a cluster that is on released version 3, a dev
// binary containing versions 1, 2, 3, and 4 started with this flag would
// renumber only 2-4 to be +1M. It would then step from 3 "up" to 1000002 --
// which conceptually is actually back down to 2 -- then back to to 1000003,
// which conceptually is actually back down to 2 -- then back to 1000003,
// then on to 1000004, etc.
skipFirst := allowUpgradeToDev
first := true
for i := range rawVersionsSingleton {
// VPrimordial versions are not offset; they don't matter for the logic
// offsetting is used for.
if rawVersionsSingleton[i].Major == rawVersionsSingleton.MustByKey(VPrimordialMax).Major {
continue
}

if skipFirst && first {
first = false
if allowUpgradeToDev && rawVersionsSingleton[i].Key <= BinaryMinSupportedVersionKey {
// Support upgrading from the non-development version of BinaryMinSupportedVersionKey.
continue
}

rawVersionsSingleton[i].Major += DevOffset
}
}
Expand Down
4 changes: 2 additions & 2 deletions pkg/cmd/roachtest/tests/inconsistency.go
Original file line number Diff line number Diff line change
Expand Up @@ -145,9 +145,9 @@ func runInconsistency(ctx context.Context, t test.Test, c cluster.Cluster) {
c.Run(ctx, c.Node(n), "find {store-dir}/auxiliary/checkpoints -name checkpoint.txt")
// The checkpoint can be inspected by the tooling.
c.Run(ctx, c.Node(n), "./cockroach debug range-descriptors "+
"$(find {store-dir}/auxiliary/checkpoints/* -type d -depth 0 | head -n1)")
"$(find {store-dir}/auxiliary/checkpoints/* -maxdepth 0 -type d | head -n1)")
c.Run(ctx, c.Node(n), "./cockroach debug range-data --limit 10 "+
"$(find {store-dir}/auxiliary/checkpoints/* -type d -depth 0 | head -n1) 1")
"$(find {store-dir}/auxiliary/checkpoints/* -maxdepth 0 -type d | head -n1) 1")
}

// NB: we can't easily verify the error because there's a lot of output which
Expand Down

0 comments on commit 9444613

Please sign in to comment.