sql: GC job may fail with error which should be retried #65000

ajwerner · 2021-05-11T14:15:04Z

Describe the problem

The GC job should very rarely fail. There is a general fear that some errors may be permanent. Nevertheless, today roughly no errors get retried. That's madness. This can leave orphaned data in the keyspace.

To Reproduce

Kill nodes during the execution of a GC job. See an error.

Expected behavior

The job will get retried.

Additional context

Relates to and may replace #55740.

We could (should?) switch the default to retry everything once we have backoff (#44594).

In the schema changer we have some bespoke retry logic that we could mostly reuse.

cockroach/pkg/sql/schema_changer.go

Lines 141 to 145 in 8bfaaff

    
           // isPermanentSchemaChangeError returns true if the error results in 
        
           // a permanent failure of a schema change. This function is a allowlist 
        
           // instead of a blocklist: only known safe errors are confirmed to not be 
        
           // permanent errors. Anything unknown is assumed to be permanent. 
        
           func isPermanentSchemaChangeError(err error) bool {

ajwerner · 2021-05-11T14:48:12Z

For backport to 21.1 and 20.2 we should classify errors using the linked function or some part of it and then for the fuller solution we can just retry everything with backoff.

ajwerner · 2021-05-12T20:19:21Z

For the backport, something like:

--- a/pkg/sql/gcjob/gc_job.go
+++ b/pkg/sql/gcjob/gc_job.go
@@ -80,7 +80,12 @@ func performGC(
 }
 
 // Resume is part of the jobs.Resumer interface.
+func (r schemaChangeGCResumer) Resume(ctx context.Context, execCtx interface{}) (err error) {
+       defer func() {
+               if !isPermanent(err) {
+                       err = jobs.NewRetryJobError(err.Error())
+               }
+       }()
        p := exe

In the previous implementation, failed GC jobs were not being retried regardless whether the failure is permanent or transient. As a result, a GC job's failure risked orphaned data, which cannot be reclaimed. This commit adds a mechanism to retry failed GC jobs that are not permanent. No limit is set on the number of retries. For the time being, the failure type is determined based on the failure categorization of schema-change jobs. This behavior is expected to change once exponential backoff mechanism is implemented for failed jobs (cockroachdb#44594). Release note: None Fixes: cockroachdb#65000

65867: changefeedccl: Fix flaky tests. r=miretskiy a=miretskiy Fix flaky test and re-enable it to run under stress. The problem was that the transaction executed by the table feed can be restarted. If that happens, then we would see the same keys again, but because we had side effects inside transaction (marking the keys seen), we would not emit those keys causing the test to be hung. The stress race was failing because of both transaction restarts and the 10ms resolved timestamp frequency (with so many resolved timestamps being generated, the table feed transaction was always getting restarted). Fixes #57754 Fixes #65168 Release Notes: None 65868: storage: expose pebble.IteratorStats through {MVCC,Engine}Iterator r=sumeerbhola a=sumeerbhola These will potentially be aggregated before exposing in trace statements, EXPLAIN ANALYZE etc. Release note: None 65900: roachtest: fix ruby-pg test suite r=rafiss a=RichardJCai Update blocklist with passing test. The not run test causing a failure is because the test is no longer failing. Since it is not failing, it shows up under not run. Release note: None 65910: sql/gcjob: retry failed GC jobs r=ajwerner a=sajjadrizvi In the previous implementation, failed GC jobs were not being retried regardless whether the failure is permanent or transient. As a result, a GC job's failure risked orphaned data, which cannot be reclaimed. This commit adds a mechanism to retry failed GC jobs that are not permanent. No limit is set on the number of retries. For the time being, the failure type is determined based on the failure categorization of schema-change jobs. This behavior is expected to change once exponential backoff mechanism is implemented for failed jobs (#44594). Release note: None Fixes: #65000 Release note (<category, see below>): <what> <show> <why> 65925: ccl/importccl: skip TestImportPgDumpSchemas/inject-error-ensure-cleanup r=tbg a=adityamaru Refs: #65878 Reason: flaky test Generated by bin/skip-test. Release justification: non-production code changes Release note: None 65933: kv/kvserver: skip TestReplicateQueueDeadNonVoters under race r=sumeerbhola a=sumeerbhola Refs: #65932 Reason: flaky test Generated by bin/skip-test. Release justification: non-production code changes Release note: None 65934: kv/kvserver: skip TestReplicateQueueSwapVotersWithNonVoters under race r=sumeerbhola a=sumeerbhola Refs: #65932 Reason: flaky test Generated by bin/skip-test. Release justification: non-production code changes Release note: None 65936: jobs: fix flakey TestMetrics r=fqazi a=ajwerner Fixes #65735 The test needed to wait for the job to be fully marked as paused. Release note: None Co-authored-by: Yevgeniy Miretskiy <[email protected]> Co-authored-by: sumeerbhola <[email protected]> Co-authored-by: richardjcai <[email protected]> Co-authored-by: Sajjad Rizvi <[email protected]> Co-authored-by: Aditya Maru <[email protected]> Co-authored-by: Andrew Werner <[email protected]>

In the previous implementation, failed GC jobs were not being retried regardless whether the failure is permanent or transient. As a result, a GC job's failure risked orphaned data, which cannot be reclaimed. This commit adds a mechanism to retry failed GC jobs that are not permanent. No limit is set on the number of retries. For the time being, the failure type is determined based on the failure categorization of schema-change jobs. This behavior is expected to change once exponential backoff mechanism is implemented for failed jobs (cockroachdb#44594). Release note: None Fixes: cockroachdb#65000

In the previous implementation, failed GC jobs were not being retried regardless whether the failure is permanent or transient. As a result, a GC job's failure risked orphaned data, which cannot be reclaimed. This patch adds a mechanism to retry failed GC jobs that are not permanent. No limit is set on the number of retries. For the time being, the failure type is determined based on the failure categorization of schema-change jobs. This behavior is expected to change once exponential backoff mechanism is implemented for failed jobs (cockroachdb#44594). This is a backport of cockroachdb#65910. Release note: None Fixes: cockroachdb#65000

ajwerner added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-schema-changes A-jobs labels May 11, 2021

ajwerner assigned sajjadrizvi May 11, 2021

sajjadrizvi mentioned this issue Jun 1, 2021

sql/gcjob: retry failed GC jobs #65910

Merged

craig bot closed this as completed in e532c2c Jun 1, 2021

sajjadrizvi mentioned this issue Jun 1, 2021

release-21.1: sql/gcjob: retry failed GC jobs #65962

Merged

sajjadrizvi mentioned this issue Jun 1, 2021

release-20.2: sql/gcjob: retry failed GC jobs #65969

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: GC job may fail with error which should be retried #65000

sql: GC job may fail with error which should be retried #65000

ajwerner commented May 11, 2021

ajwerner commented May 11, 2021

ajwerner commented May 12, 2021

sql: GC job may fail with error which should be retried #65000

sql: GC job may fail with error which should be retried #65000

Comments

ajwerner commented May 11, 2021

ajwerner commented May 11, 2021

ajwerner commented May 12, 2021