jobs: perform exponential backoff to reduce impact of jobs that cause panics #44594

joshimhoff · 2020-01-31T14:21:07Z

Is your feature request related to a problem? Please describe.
This bug leads to panics when users run the IMPORT INTO job on 19.2.2: #44252.

The impact can be very high. See this graph of the SQL prober error rate:

50-100% error rate for 1hr!

The nodes crash at a fast enough rate that (a) the cluster is more or less entirely unavailable to the customer for the duration of the incident and (b) it is hard for an operator to get a SQL connection that lives long enough to cancel the problematic jobs (this is why it takes around 1hr to mitigate).

How can we reduce impact / make it easier to mitigate this issue?

If a job fails, the job system could do an exponential backoff.
If a job fails repeatedly and the job system detects that the failures are caused by dying CRDB nodes, the job system could mark the job as a "job of death" and not retry it.
If an SRE passes a command line flag to CRDB, the job system could not pick up any jobs.

This bug tracks 1 only.

I'm suggesting concrete solutions but I am more interested in improving the problem of very high impact than anything else! I'm suggesting concrete solutions to get a conversation started.

Describe the solution you'd like
If a job fails, the job system could do an exponential backoff. This would reduce the impact of a job that causes panics. The amount of time between panics would increase over time. This would also make it easier for an operator to cancel the job.

I don't know that the job system is not ALREADY doing this. If so, my bad! I do see the cluster setting jobs.registry.leniency. The description for this cluster setting reads "the amount of time to defer any attempts to reschedule a job". Doesn't sound like an exponential backoff.

On the CC side, we should set this cluster setting so as to reduce impact of jobs that cause panics, IMHO.

Describe alternatives you've considered
See 1, 2, and 3 from the above list.

@ajwerner @pbardea @spaskob @carloruiz @DuskEagle @chrisseto @vilterp @vladdy

Epic: CRDB-7912

The text was updated successfully, but these errors were encountered:

ajwerner · 2020-08-17T15:09:28Z

We have solved 3. with #44786

In the previous implementation, failed GC jobs were not being retried regardless whether the failure is permanent or transient. As a result, a GC job's failure risked orphaned data, which cannot be reclaimed. This commit adds a mechanism to retry failed GC jobs that are not permanent. No limit is set on the number of retries. For the time being, the failure type is determined based on the failure categorization of schema-change jobs. This behavior is expected to change once exponential backoff mechanism is implemented for failed jobs (cockroachdb#44594). Release note: None Fixes: cockroachdb#65000

65867: changefeedccl: Fix flaky tests. r=miretskiy a=miretskiy Fix flaky test and re-enable it to run under stress. The problem was that the transaction executed by the table feed can be restarted. If that happens, then we would see the same keys again, but because we had side effects inside transaction (marking the keys seen), we would not emit those keys causing the test to be hung. The stress race was failing because of both transaction restarts and the 10ms resolved timestamp frequency (with so many resolved timestamps being generated, the table feed transaction was always getting restarted). Fixes #57754 Fixes #65168 Release Notes: None 65868: storage: expose pebble.IteratorStats through {MVCC,Engine}Iterator r=sumeerbhola a=sumeerbhola These will potentially be aggregated before exposing in trace statements, EXPLAIN ANALYZE etc. Release note: None 65900: roachtest: fix ruby-pg test suite r=rafiss a=RichardJCai Update blocklist with passing test. The not run test causing a failure is because the test is no longer failing. Since it is not failing, it shows up under not run. Release note: None 65910: sql/gcjob: retry failed GC jobs r=ajwerner a=sajjadrizvi In the previous implementation, failed GC jobs were not being retried regardless whether the failure is permanent or transient. As a result, a GC job's failure risked orphaned data, which cannot be reclaimed. This commit adds a mechanism to retry failed GC jobs that are not permanent. No limit is set on the number of retries. For the time being, the failure type is determined based on the failure categorization of schema-change jobs. This behavior is expected to change once exponential backoff mechanism is implemented for failed jobs (#44594). Release note: None Fixes: #65000 Release note (<category, see below>): <what> <show> <why> 65925: ccl/importccl: skip TestImportPgDumpSchemas/inject-error-ensure-cleanup r=tbg a=adityamaru Refs: #65878 Reason: flaky test Generated by bin/skip-test. Release justification: non-production code changes Release note: None 65933: kv/kvserver: skip TestReplicateQueueDeadNonVoters under race r=sumeerbhola a=sumeerbhola Refs: #65932 Reason: flaky test Generated by bin/skip-test. Release justification: non-production code changes Release note: None 65934: kv/kvserver: skip TestReplicateQueueSwapVotersWithNonVoters under race r=sumeerbhola a=sumeerbhola Refs: #65932 Reason: flaky test Generated by bin/skip-test. Release justification: non-production code changes Release note: None 65936: jobs: fix flakey TestMetrics r=fqazi a=ajwerner Fixes #65735 The test needed to wait for the job to be fully marked as paused. Release note: None Co-authored-by: Yevgeniy Miretskiy <[email protected]> Co-authored-by: sumeerbhola <[email protected]> Co-authored-by: richardjcai <[email protected]> Co-authored-by: Sajjad Rizvi <[email protected]> Co-authored-by: Aditya Maru <[email protected]> Co-authored-by: Andrew Werner <[email protected]>

In the previous implementation, failed GC jobs were not being retried regardless whether the failure is permanent or transient. As a result, a GC job's failure risked orphaned data, which cannot be reclaimed. This commit adds a mechanism to retry failed GC jobs that are not permanent. No limit is set on the number of retries. For the time being, the failure type is determined based on the failure categorization of schema-change jobs. This behavior is expected to change once exponential backoff mechanism is implemented for failed jobs (cockroachdb#44594). Release note: None Fixes: cockroachdb#65000

In the previous implementation, failed GC jobs were not being retried regardless whether the failure is permanent or transient. As a result, a GC job's failure risked orphaned data, which cannot be reclaimed. This patch adds a mechanism to retry failed GC jobs that are not permanent. No limit is set on the number of retries. For the time being, the failure type is determined based on the failure categorization of schema-change jobs. This behavior is expected to change once exponential backoff mechanism is implemented for failed jobs (cockroachdb#44594). This is a backport of cockroachdb#65910. Release note: None Fixes: cockroachdb#65000

sajjadrizvi · 2021-06-23T17:14:01Z

I am currently implementing job retries with exponential backoff. I plan to implement it in the following way:

Add two new columns in system.jobs: (1) num_runs that counts the number of times the job has run, and (2) last_run timestamp.
We add a new index in the jobs table on claim_session_id, created, and status columns, storing last_run, num_runs, and claim_instance_id. The index optimizes claiming the jobs that have their next execution time before the current time.
When the registry of a node runs its next adoption loop, it claims only those jobs whose calculated next execution time is before the current time. The next execution time can be calculated as ((last_run::int + base*(2^num_runs))::timestamp). Base is a configurable number in seconds, e.g., 10 seconds.
When a claimed job is processed, we increment job_runs and update last_run to now().
When a job fails, the job moves to its reverting state if it is not retriable. Otherwise, the job remains in the running state and it is retried in the next job-adoption phase.

ajwerner · 2021-06-23T18:09:35Z

This looks good. Let's not touch the last bullet in the first commit.

sajjadrizvi · 2021-06-23T20:19:31Z

The last bullet is just mentioning the current behavior. We are not modifying anything in the system.

Failed jobs were being retried with a constant interval in the previous implementation. This commit enables jobs to be retried with exponentially increasing delays with an upper bound. This change enables to retry the jobs that are not currently retried when they fail due to transient problems. Release note: None Fixes: cockroachdb#44594

66889: jobs: retry jobs with exponential backoff r=ajwerner a=sajjadrizvi This commit adds a mechanism to retry jobs with exponentially increasing delays. This is achieved through two new columns in system.jobs table, last_run and num_runs. In addition, this commit adds cluster settings to control exponential-backoff parameters, initial delay and max delay, with corresponding settings `jobs.registry.retry.initial_delay` and `jobs.registry.retry.max_delay`. Finally, this commit adds a new partial-index in the jobs table that improves the performance of periodic queries run by registry in each node. Release note (general change): The behavior for retrying jobs, which fail due to a retriable error or due to job coordinator failure, is now delayed using exponential backoff. Before this change, jobs which failed in a retryable manner, would be resumed immediately on a different coordinator. This change reduces the impact of recurrently failing jobs on the cluster. This change adds two new cluster settings that control this behavior: "jobs.registry.retry.initial_delay" and "jobs.registry.retry.max_delay", which respectively control initial delay and maximum delay between resumptions. Fixes #44594 Fixes #65080 68212: colexec: add optimized versions of aggregate window functions r=DrewKimball a=DrewKimball **colexecwindow: add sliding window functionality to window framer** This commit adds a method `slidingWindowIntervals` to `windowFramer` operators that returns a set of `toAdd` intervals and a set of `toRemove` intervals, which indicate the rows that should be added to the current aggregation and those that should be removed, respectively. This will be used to implement the sliding window optimization for aggregate window functions such as `sum`. **colexecwindow: implement sliding window aggregator** This commit supplies a new operator, `slidingWindowAggregator`, which is used for any window aggregate functions that implement the `slidingWindowAggregateFunc` interface. Rather than aggregating over the entire window frame for each row, the `slidingWindowAggregator` operator aggregates over the rows that are in the current window frame but were not in the previous, and removes from the aggregation the rows that were in the previous window frame but not the current. This allows window aggregate functions to be evaluated in linear rather than quadratic time. **colexec: implement sliding window optimization for sum window function** This commit modifies the `sum` aggregate window function to implement the `slidingWindowAggregateFunc`, which allows it to be used in a sliding window context. This yields linear rather than quadratic scaling in the worst case, and allows the vectorized engine to meet or exceed parity with the row engine for `sum` window functions. **colexec: implement sliding window optimization for count window function** This commit modifies the count aggregate operator to implement the `slidingWindowAggregateFunc` interface so that it can be used with the sliding window optimization. **colexec: implement sliding window optimization for average window function** This commit modifies the `average` aggregate operator to implement the `slidingWindowAggregateFunc` interface so that it can be used with the sliding window optimization. **colexec: optimize count_rows window function** This commit implements an optimized version of `count_rows` that calculates the size of the window frame as soon as the window frame is calculated. This means that most of the overhead for `count_rows` now comes from calculating the window frame, which is worst-case linear time (previously, the step to retrieve the size of the frame was quadratic, though with a small constant). **colexec: optimize min and max window functions with default exclusion** This commit modifies the 'min' and 'max' aggregate window functions to implement the `slidingWindowAggregateFunc` interface, which allows them to be used in a sliding window context. However, this is only usable when the window frame never shrinks - e.g. it always contains all rows from the previous frame. This commit also provides implementations of `min` and `max` for use when the window frame can shrink. The indices of the 'next best' minimum or maximum values are stored in a priority queue that is updated for each row. Using the priority queue allows the `min` and `max` operators to avoid fully aggregating over the window frame even when the previous best value goes out of scope. Note that this implementation currently does not handle the case of non-default exclusion clause, in which case we must fall back to the quadratic approach. Fixes: #37039 Release note (performance improvement): The vectorized engine can now use the sliding-window approach to execute common aggregate functions as window functions. This allows aggregate window functions to be evaluated in linear rather than quadratic time. Currently, sum, count, average, min, and max are executed using this approach. 68433: sql: implemented placement restricted syntax for domiciling r=pawalt a=pawalt This PR combines the existing restricted placement zone config logic with the stubbed syntax to create an end-to-end PLACEMENT RESTRICTED implementation. Release note: None Note that the cluster setting for domiciling and telemetry will be added in a later PR. 68818: changefeedccl: mark avro format as no longer experimental r=[miretskiy,spiffyeng] a=HonoreDB The avro format for changefeeds now supports all column types and has been in production use for several releases. We'll now allow format=avro rather than format=experimental_avro The old string will remain supported because job payloads can persist across upgrades and downgrades. Release note (enterprise change): changefeed avro format no longer marked experimental Co-authored-by: Sajjad Rizvi <[email protected]> Co-authored-by: Drew Kimball <[email protected]> Co-authored-by: Peyton Walters <[email protected]> Co-authored-by: Aaron Zinger <[email protected]>

66889: jobs: retry jobs with exponential backoff r=ajwerner a=sajjadrizvi This commit adds a mechanism to retry jobs with exponentially increasing delays. This is achieved through two new columns in system.jobs table, last_run and num_runs. In addition, this commit adds cluster settings to control exponential-backoff parameters, initial delay and max delay, with corresponding settings `jobs.registry.retry.initial_delay` and `jobs.registry.retry.max_delay`. Finally, this commit adds a new partial-index in the jobs table that improves the performance of periodic queries run by registry in each node. Release note (general change): The behavior for retrying jobs, which fail due to a retriable error or due to job coordinator failure, is now delayed using exponential backoff. Before this change, jobs which failed in a retryable manner, would be resumed immediately on a different coordinator. This change reduces the impact of recurrently failing jobs on the cluster. This change adds two new cluster settings that control this behavior: "jobs.registry.retry.initial_delay" and "jobs.registry.retry.max_delay", which respectively control initial delay and maximum delay between resumptions. Fixes #44594 Fixes #65080 68212: colexec: add optimized versions of aggregate window functions r=DrewKimball a=DrewKimball **colexecwindow: add sliding window functionality to window framer** This commit adds a method `slidingWindowIntervals` to `windowFramer` operators that returns a set of `toAdd` intervals and a set of `toRemove` intervals, which indicate the rows that should be added to the current aggregation and those that should be removed, respectively. This will be used to implement the sliding window optimization for aggregate window functions such as `sum`. **colexecwindow: implement sliding window aggregator** This commit supplies a new operator, `slidingWindowAggregator`, which is used for any window aggregate functions that implement the `slidingWindowAggregateFunc` interface. Rather than aggregating over the entire window frame for each row, the `slidingWindowAggregator` operator aggregates over the rows that are in the current window frame but were not in the previous, and removes from the aggregation the rows that were in the previous window frame but not the current. This allows window aggregate functions to be evaluated in linear rather than quadratic time. **colexec: implement sliding window optimization for sum window function** This commit modifies the `sum` aggregate window function to implement the `slidingWindowAggregateFunc`, which allows it to be used in a sliding window context. This yields linear rather than quadratic scaling in the worst case, and allows the vectorized engine to meet or exceed parity with the row engine for `sum` window functions. **colexec: implement sliding window optimization for count window function** This commit modifies the count aggregate operator to implement the `slidingWindowAggregateFunc` interface so that it can be used with the sliding window optimization. **colexec: implement sliding window optimization for average window function** This commit modifies the `average` aggregate operator to implement the `slidingWindowAggregateFunc` interface so that it can be used with the sliding window optimization. **colexec: optimize count_rows window function** This commit implements an optimized version of `count_rows` that calculates the size of the window frame as soon as the window frame is calculated. This means that most of the overhead for `count_rows` now comes from calculating the window frame, which is worst-case linear time (previously, the step to retrieve the size of the frame was quadratic, though with a small constant). **colexec: optimize min and max window functions with default exclusion** This commit modifies the 'min' and 'max' aggregate window functions to implement the `slidingWindowAggregateFunc` interface, which allows them to be used in a sliding window context. However, this is only usable when the window frame never shrinks - e.g. it always contains all rows from the previous frame. This commit also provides implementations of `min` and `max` for use when the window frame can shrink. The indices of the 'next best' minimum or maximum values are stored in a priority queue that is updated for each row. Using the priority queue allows the `min` and `max` operators to avoid fully aggregating over the window frame even when the previous best value goes out of scope. Note that this implementation currently does not handle the case of non-default exclusion clause, in which case we must fall back to the quadratic approach. Fixes: #37039 Release note (performance improvement): The vectorized engine can now use the sliding-window approach to execute common aggregate functions as window functions. This allows aggregate window functions to be evaluated in linear rather than quadratic time. Currently, sum, count, average, min, and max are executed using this approach. Co-authored-by: Sajjad Rizvi <[email protected]> Co-authored-by: Drew Kimball <[email protected]>

joshimhoff added O-sre For issues SRE opened or otherwise cares about tracking. A-jobs labels Jan 31, 2020

ajwerner mentioned this issue Feb 4, 2021

sql/schemachanger: prerequisites to setting schema changer "on" by default #59788

Closed

5 tasks

ajwerner mentioned this issue Apr 9, 2021

changefeedccl: transient schemafeed errors are not retried #63317

Closed

ajwerner mentioned this issue May 11, 2021

sql: GC job may fail with error which should be retried #65000

Closed

sajjadrizvi mentioned this issue Jun 1, 2021

sql/gcjob: retry failed GC jobs #65910

Merged

sajjadrizvi mentioned this issue Jun 1, 2021

release-21.1: sql/gcjob: retry failed GC jobs #65962

Merged

sajjadrizvi mentioned this issue Jun 1, 2021

release-20.2: sql/gcjob: retry failed GC jobs #65969

Merged

jlinder added the T-sql-schema-deprecated Use T-sql-foundations instead label Jun 16, 2021

ajwerner mentioned this issue Jun 22, 2021

jobs,sql: transient errors can leave descriptors permanently corrupted #66685

Closed

vy-ton mentioned this issue Jul 28, 2021

ui,jobs: improve jobs overview page in DBConsole #68179

Closed

craig bot closed this as completed in be82c0e Aug 14, 2021

vy-ton mentioned this issue Sep 28, 2021

sql: New crdb_internal.jobs columns did not get added to SHOW JOBS #70765

Closed

jocrl mentioned this issue Nov 1, 2021

ui/db-console: surface more job metrics around reverting and retrying in the DBConsole Jobs Overview page #72291

Merged

joshimhoff mentioned this issue Mar 15, 2022

kvserver: reduce blast radius of Raft application errors #75944

Open

exalate-issue-sync bot added T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) and removed T-sql-schema-deprecated Use T-sql-foundations instead labels May 10, 2023

stevendanna mentioned this issue Nov 21, 2024

jobs: remove re-resume delay #135825

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

jobs: perform exponential backoff to reduce impact of jobs that cause panics #44594

jobs: perform exponential backoff to reduce impact of jobs that cause panics #44594

joshimhoff commented Jan 31, 2020 •

edited by postamar

Loading

ajwerner commented Aug 17, 2020

sajjadrizvi commented Jun 23, 2021

ajwerner commented Jun 23, 2021 •

edited

Loading

sajjadrizvi commented Jun 23, 2021

jobs: perform exponential backoff to reduce impact of jobs that cause panics #44594

jobs: perform exponential backoff to reduce impact of jobs that cause panics #44594

Comments

joshimhoff commented Jan 31, 2020 • edited by postamar Loading

ajwerner commented Aug 17, 2020

sajjadrizvi commented Jun 23, 2021

ajwerner commented Jun 23, 2021 • edited Loading

sajjadrizvi commented Jun 23, 2021

joshimhoff commented Jan 31, 2020 •

edited by postamar

Loading

ajwerner commented Jun 23, 2021 •

edited

Loading