Skip to content

Commit

Permalink
backupccl,sql/importer: pause jobs on exhausting retries
Browse files Browse the repository at this point in the history
If a job has exhausted its retry quota for transient errors
we pause the job instead of failing it. This way it is up to
the user to decide if they want to resume/cancel the job. This
will prevent large amounts of work from being thrown away because
of transient errors that just need > retry limit to go away.

Informs: #36887

Release note (sql change): BACKUP, IMPORT and RESTORE jobs
will be paused instead of entering a failed state if they continue
to encounter transient errors once they have retried a maximum number
of times. The user is responsible for cancelling or resuming the job
from this state.
  • Loading branch information
adityamaru committed Apr 22, 2022
1 parent 0daa129 commit 8fa69c4
Show file tree
Hide file tree
Showing 3 changed files with 22 additions and 3 deletions.
9 changes: 8 additions & 1 deletion pkg/ccl/backupccl/backup_job.go
Original file line number Diff line number Diff line change
Expand Up @@ -582,8 +582,15 @@ func (b *backupResumer) Resume(ctx context.Context, execCtx interface{}) error {
return errors.Wrap(reloadBackupErr, "could not reload backup manifest when retrying")
}
}

// We have exhausted retries, but we have not seen a "PermanentBulkJobError" so
// it is possible that this is a transient error that is taking longer than
// our configured retry to go away.
//
// Let's pause the job instead of failing it so that the user can decide
// whether to resume it or cancel it.
if err != nil {
return errors.Wrap(err, "exhausted retries")
return jobs.MarkPauseRequestError(errors.Wrap(err, "exhausted retries"))
}

var backupDetails jobspb.BackupDetails
Expand Down
8 changes: 7 additions & 1 deletion pkg/ccl/backupccl/restore_job.go
Original file line number Diff line number Diff line change
Expand Up @@ -168,8 +168,14 @@ func restoreWithRetry(
log.Warningf(restoreCtx, `encountered retryable error: %+v`, err)
}

// We have exhausted retries, but we have not seen a "PermanentBulkJobError" so
// it is possible that this is a transient error that is taking longer than
// our configured retry to go away.
//
// Let's pause the job instead of failing it so that the user can decide
// whether to resume it or cancel it.
if err != nil {
return roachpb.RowCount{}, errors.Wrap(err, "exhausted retries")
return res, jobs.MarkPauseRequestError(errors.Wrap(err, "exhausted retries"))
}
return res, nil
}
Expand Down
8 changes: 7 additions & 1 deletion pkg/sql/importer/import_job.go
Original file line number Diff line number Diff line change
Expand Up @@ -1288,8 +1288,14 @@ func ingestWithRetry(
log.Warningf(ctx, `encountered retryable error: %+v`, err)
}

// We have exhausted retries, but we have not seen a "PermanentBulkJobError" so
// it is possible that this is a transient error that is taking longer than
// our configured retry to go away.
//
// Let's pause the job instead of failing it so that the user can decide
// whether to resume it or cancel it.
if err != nil {
return roachpb.BulkOpSummary{}, errors.Wrap(err, "exhausted retries")
return res, jobs.MarkPauseRequestError(errors.Wrap(err, "exhausted retries"))
}
return res, nil
}
Expand Down

0 comments on commit 8fa69c4

Please sign in to comment.