-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
jobs: perform exponential backoff to reduce impact of jobs that cause panics #44594
Labels
A-jobs
O-sre
For issues SRE opened or otherwise cares about tracking.
T-sql-foundations
SQL Foundations Team (formerly SQL Schema + SQL Sessions)
Comments
joshimhoff
added
O-sre
For issues SRE opened or otherwise cares about tracking.
A-jobs
labels
Jan 31, 2020
We have solved 3. with #44786 |
5 tasks
sajjadrizvi
pushed a commit
to sajjadrizvi/cockroach
that referenced
this issue
Jun 1, 2021
In the previous implementation, failed GC jobs were not being retried regardless whether the failure is permanent or transient. As a result, a GC job's failure risked orphaned data, which cannot be reclaimed. This commit adds a mechanism to retry failed GC jobs that are not permanent. No limit is set on the number of retries. For the time being, the failure type is determined based on the failure categorization of schema-change jobs. This behavior is expected to change once exponential backoff mechanism is implemented for failed jobs (cockroachdb#44594). Release note: None Fixes: cockroachdb#65000
sajjadrizvi
pushed a commit
to sajjadrizvi/cockroach
that referenced
this issue
Jun 1, 2021
In the previous implementation, failed GC jobs were not being retried regardless whether the failure is permanent or transient. As a result, a GC job's failure risked orphaned data, which cannot be reclaimed. This commit adds a mechanism to retry failed GC jobs that are not permanent. No limit is set on the number of retries. For the time being, the failure type is determined based on the failure categorization of schema-change jobs. This behavior is expected to change once exponential backoff mechanism is implemented for failed jobs (cockroachdb#44594). Release note: None Fixes: cockroachdb#65000
craig bot
pushed a commit
that referenced
this issue
Jun 1, 2021
65867: changefeedccl: Fix flaky tests. r=miretskiy a=miretskiy Fix flaky test and re-enable it to run under stress. The problem was that the transaction executed by the table feed can be restarted. If that happens, then we would see the same keys again, but because we had side effects inside transaction (marking the keys seen), we would not emit those keys causing the test to be hung. The stress race was failing because of both transaction restarts and the 10ms resolved timestamp frequency (with so many resolved timestamps being generated, the table feed transaction was always getting restarted). Fixes #57754 Fixes #65168 Release Notes: None 65868: storage: expose pebble.IteratorStats through {MVCC,Engine}Iterator r=sumeerbhola a=sumeerbhola These will potentially be aggregated before exposing in trace statements, EXPLAIN ANALYZE etc. Release note: None 65900: roachtest: fix ruby-pg test suite r=rafiss a=RichardJCai Update blocklist with passing test. The not run test causing a failure is because the test is no longer failing. Since it is not failing, it shows up under not run. Release note: None 65910: sql/gcjob: retry failed GC jobs r=ajwerner a=sajjadrizvi In the previous implementation, failed GC jobs were not being retried regardless whether the failure is permanent or transient. As a result, a GC job's failure risked orphaned data, which cannot be reclaimed. This commit adds a mechanism to retry failed GC jobs that are not permanent. No limit is set on the number of retries. For the time being, the failure type is determined based on the failure categorization of schema-change jobs. This behavior is expected to change once exponential backoff mechanism is implemented for failed jobs (#44594). Release note: None Fixes: #65000 Release note (<category, see below>): <what> <show> <why> 65925: ccl/importccl: skip TestImportPgDumpSchemas/inject-error-ensure-cleanup r=tbg a=adityamaru Refs: #65878 Reason: flaky test Generated by bin/skip-test. Release justification: non-production code changes Release note: None 65933: kv/kvserver: skip TestReplicateQueueDeadNonVoters under race r=sumeerbhola a=sumeerbhola Refs: #65932 Reason: flaky test Generated by bin/skip-test. Release justification: non-production code changes Release note: None 65934: kv/kvserver: skip TestReplicateQueueSwapVotersWithNonVoters under race r=sumeerbhola a=sumeerbhola Refs: #65932 Reason: flaky test Generated by bin/skip-test. Release justification: non-production code changes Release note: None 65936: jobs: fix flakey TestMetrics r=fqazi a=ajwerner Fixes #65735 The test needed to wait for the job to be fully marked as paused. Release note: None Co-authored-by: Yevgeniy Miretskiy <[email protected]> Co-authored-by: sumeerbhola <[email protected]> Co-authored-by: richardjcai <[email protected]> Co-authored-by: Sajjad Rizvi <[email protected]> Co-authored-by: Aditya Maru <[email protected]> Co-authored-by: Andrew Werner <[email protected]>
sajjadrizvi
pushed a commit
to sajjadrizvi/cockroach
that referenced
this issue
Jun 1, 2021
In the previous implementation, failed GC jobs were not being retried regardless whether the failure is permanent or transient. As a result, a GC job's failure risked orphaned data, which cannot be reclaimed. This commit adds a mechanism to retry failed GC jobs that are not permanent. No limit is set on the number of retries. For the time being, the failure type is determined based on the failure categorization of schema-change jobs. This behavior is expected to change once exponential backoff mechanism is implemented for failed jobs (cockroachdb#44594). Release note: None Fixes: cockroachdb#65000
sajjadrizvi
pushed a commit
to sajjadrizvi/cockroach
that referenced
this issue
Jun 1, 2021
In the previous implementation, failed GC jobs were not being retried regardless whether the failure is permanent or transient. As a result, a GC job's failure risked orphaned data, which cannot be reclaimed. This patch adds a mechanism to retry failed GC jobs that are not permanent. No limit is set on the number of retries. For the time being, the failure type is determined based on the failure categorization of schema-change jobs. This behavior is expected to change once exponential backoff mechanism is implemented for failed jobs (cockroachdb#44594). This is a backport of cockroachdb#65910. Release note: None Fixes: cockroachdb#65000
sajjadrizvi
pushed a commit
to sajjadrizvi/cockroach
that referenced
this issue
Jun 2, 2021
In the previous implementation, failed GC jobs were not being retried regardless whether the failure is permanent or transient. As a result, a GC job's failure risked orphaned data, which cannot be reclaimed. This patch adds a mechanism to retry failed GC jobs that are not permanent. No limit is set on the number of retries. For the time being, the failure type is determined based on the failure categorization of schema-change jobs. This behavior is expected to change once exponential backoff mechanism is implemented for failed jobs (cockroachdb#44594). This is a backport of cockroachdb#65910. Release note: None Fixes: cockroachdb#65000
I am currently implementing job retries with exponential backoff. I plan to implement it in the following way:
|
This looks good. Let's not touch the last bullet in the first commit. |
The last bullet is just mentioning the current behavior. We are not modifying anything in the system. |
sajjadrizvi
pushed a commit
to sajjadrizvi/cockroach
that referenced
this issue
Jul 22, 2021
Failed jobs were being retried with a constant interval in the previous implementation. This commit enables jobs to be retried with exponentially increasing delays with an upper bound. This change enables to retry the jobs that are not currently retried when they fail due to transient problems. Release note: None Fixes: cockroachdb#44594
sajjadrizvi
pushed a commit
to sajjadrizvi/cockroach
that referenced
this issue
Jul 26, 2021
Failed jobs were being retried with a constant interval in the previous implementation. This commit enables jobs to be retried with exponentially increasing delays with an upper bound. This change enables to retry the jobs that are not currently retried when they fail due to transient problems. Release note: None Fixes: cockroachdb#44594
sajjadrizvi
pushed a commit
to sajjadrizvi/cockroach
that referenced
this issue
Jul 26, 2021
Failed jobs were being retried with a constant interval in the previous implementation. This commit enables jobs to be retried with exponentially increasing delays with an upper bound. This change enables to retry the jobs that are not currently retried when they fail due to transient problems. Release note: None Fixes: cockroachdb#44594
sajjadrizvi
pushed a commit
to sajjadrizvi/cockroach
that referenced
this issue
Jul 28, 2021
Failed jobs were being retried with a constant interval in the previous implementation. This commit enables jobs to be retried with exponentially increasing delays with an upper bound. This change enables to retry the jobs that are not currently retried when they fail due to transient problems. Release note: None Fixes: cockroachdb#44594
sajjadrizvi
pushed a commit
to sajjadrizvi/cockroach
that referenced
this issue
Aug 2, 2021
Failed jobs were being retried with a constant interval in the previous implementation. This commit enables jobs to be retried with exponentially increasing delays with an upper bound. This change enables to retry the jobs that are not currently retried when they fail due to transient problems. Release note: None Fixes: cockroachdb#44594
sajjadrizvi
pushed a commit
to sajjadrizvi/cockroach
that referenced
this issue
Aug 10, 2021
Failed jobs were being retried with a constant interval in the previous implementation. This commit enables jobs to be retried with exponentially increasing delays with an upper bound. This change enables to retry the jobs that are not currently retried when they fail due to transient problems. Release note: None Fixes: cockroachdb#44594
sajjadrizvi
pushed a commit
to sajjadrizvi/cockroach
that referenced
this issue
Aug 10, 2021
Failed jobs were being retried with a constant interval in the previous implementation. This commit enables jobs to be retried with exponentially increasing delays with an upper bound. This change enables to retry the jobs that are not currently retried when they fail due to transient problems. Release note: None Fixes: cockroachdb#44594
sajjadrizvi
pushed a commit
to sajjadrizvi/cockroach
that referenced
this issue
Aug 10, 2021
Failed jobs were being retried with a constant interval in the previous implementation. This commit enables jobs to be retried with exponentially increasing delays with an upper bound. This change enables to retry the jobs that are not currently retried when they fail due to transient problems. Release note: None Fixes: cockroachdb#44594
craig bot
pushed a commit
that referenced
this issue
Aug 14, 2021
66889: jobs: retry jobs with exponential backoff r=ajwerner a=sajjadrizvi This commit adds a mechanism to retry jobs with exponentially increasing delays. This is achieved through two new columns in system.jobs table, last_run and num_runs. In addition, this commit adds cluster settings to control exponential-backoff parameters, initial delay and max delay, with corresponding settings `jobs.registry.retry.initial_delay` and `jobs.registry.retry.max_delay`. Finally, this commit adds a new partial-index in the jobs table that improves the performance of periodic queries run by registry in each node. Release note (general change): The behavior for retrying jobs, which fail due to a retriable error or due to job coordinator failure, is now delayed using exponential backoff. Before this change, jobs which failed in a retryable manner, would be resumed immediately on a different coordinator. This change reduces the impact of recurrently failing jobs on the cluster. This change adds two new cluster settings that control this behavior: "jobs.registry.retry.initial_delay" and "jobs.registry.retry.max_delay", which respectively control initial delay and maximum delay between resumptions. Fixes #44594 Fixes #65080 68212: colexec: add optimized versions of aggregate window functions r=DrewKimball a=DrewKimball **colexecwindow: add sliding window functionality to window framer** This commit adds a method `slidingWindowIntervals` to `windowFramer` operators that returns a set of `toAdd` intervals and a set of `toRemove` intervals, which indicate the rows that should be added to the current aggregation and those that should be removed, respectively. This will be used to implement the sliding window optimization for aggregate window functions such as `sum`. **colexecwindow: implement sliding window aggregator** This commit supplies a new operator, `slidingWindowAggregator`, which is used for any window aggregate functions that implement the `slidingWindowAggregateFunc` interface. Rather than aggregating over the entire window frame for each row, the `slidingWindowAggregator` operator aggregates over the rows that are in the current window frame but were not in the previous, and removes from the aggregation the rows that were in the previous window frame but not the current. This allows window aggregate functions to be evaluated in linear rather than quadratic time. **colexec: implement sliding window optimization for sum window function** This commit modifies the `sum` aggregate window function to implement the `slidingWindowAggregateFunc`, which allows it to be used in a sliding window context. This yields linear rather than quadratic scaling in the worst case, and allows the vectorized engine to meet or exceed parity with the row engine for `sum` window functions. **colexec: implement sliding window optimization for count window function** This commit modifies the count aggregate operator to implement the `slidingWindowAggregateFunc` interface so that it can be used with the sliding window optimization. **colexec: implement sliding window optimization for average window function** This commit modifies the `average` aggregate operator to implement the `slidingWindowAggregateFunc` interface so that it can be used with the sliding window optimization. **colexec: optimize count_rows window function** This commit implements an optimized version of `count_rows` that calculates the size of the window frame as soon as the window frame is calculated. This means that most of the overhead for `count_rows` now comes from calculating the window frame, which is worst-case linear time (previously, the step to retrieve the size of the frame was quadratic, though with a small constant). **colexec: optimize min and max window functions with default exclusion** This commit modifies the 'min' and 'max' aggregate window functions to implement the `slidingWindowAggregateFunc` interface, which allows them to be used in a sliding window context. However, this is only usable when the window frame never shrinks - e.g. it always contains all rows from the previous frame. This commit also provides implementations of `min` and `max` for use when the window frame can shrink. The indices of the 'next best' minimum or maximum values are stored in a priority queue that is updated for each row. Using the priority queue allows the `min` and `max` operators to avoid fully aggregating over the window frame even when the previous best value goes out of scope. Note that this implementation currently does not handle the case of non-default exclusion clause, in which case we must fall back to the quadratic approach. Fixes: #37039 Release note (performance improvement): The vectorized engine can now use the sliding-window approach to execute common aggregate functions as window functions. This allows aggregate window functions to be evaluated in linear rather than quadratic time. Currently, sum, count, average, min, and max are executed using this approach. 68433: sql: implemented placement restricted syntax for domiciling r=pawalt a=pawalt This PR combines the existing restricted placement zone config logic with the stubbed syntax to create an end-to-end PLACEMENT RESTRICTED implementation. Release note: None Note that the cluster setting for domiciling and telemetry will be added in a later PR. 68818: changefeedccl: mark avro format as no longer experimental r=[miretskiy,spiffyeng] a=HonoreDB The avro format for changefeeds now supports all column types and has been in production use for several releases. We'll now allow format=avro rather than format=experimental_avro The old string will remain supported because job payloads can persist across upgrades and downgrades. Release note (enterprise change): changefeed avro format no longer marked experimental Co-authored-by: Sajjad Rizvi <[email protected]> Co-authored-by: Drew Kimball <[email protected]> Co-authored-by: Peyton Walters <[email protected]> Co-authored-by: Aaron Zinger <[email protected]>
craig bot
pushed a commit
that referenced
this issue
Aug 14, 2021
66889: jobs: retry jobs with exponential backoff r=ajwerner a=sajjadrizvi This commit adds a mechanism to retry jobs with exponentially increasing delays. This is achieved through two new columns in system.jobs table, last_run and num_runs. In addition, this commit adds cluster settings to control exponential-backoff parameters, initial delay and max delay, with corresponding settings `jobs.registry.retry.initial_delay` and `jobs.registry.retry.max_delay`. Finally, this commit adds a new partial-index in the jobs table that improves the performance of periodic queries run by registry in each node. Release note (general change): The behavior for retrying jobs, which fail due to a retriable error or due to job coordinator failure, is now delayed using exponential backoff. Before this change, jobs which failed in a retryable manner, would be resumed immediately on a different coordinator. This change reduces the impact of recurrently failing jobs on the cluster. This change adds two new cluster settings that control this behavior: "jobs.registry.retry.initial_delay" and "jobs.registry.retry.max_delay", which respectively control initial delay and maximum delay between resumptions. Fixes #44594 Fixes #65080 68212: colexec: add optimized versions of aggregate window functions r=DrewKimball a=DrewKimball **colexecwindow: add sliding window functionality to window framer** This commit adds a method `slidingWindowIntervals` to `windowFramer` operators that returns a set of `toAdd` intervals and a set of `toRemove` intervals, which indicate the rows that should be added to the current aggregation and those that should be removed, respectively. This will be used to implement the sliding window optimization for aggregate window functions such as `sum`. **colexecwindow: implement sliding window aggregator** This commit supplies a new operator, `slidingWindowAggregator`, which is used for any window aggregate functions that implement the `slidingWindowAggregateFunc` interface. Rather than aggregating over the entire window frame for each row, the `slidingWindowAggregator` operator aggregates over the rows that are in the current window frame but were not in the previous, and removes from the aggregation the rows that were in the previous window frame but not the current. This allows window aggregate functions to be evaluated in linear rather than quadratic time. **colexec: implement sliding window optimization for sum window function** This commit modifies the `sum` aggregate window function to implement the `slidingWindowAggregateFunc`, which allows it to be used in a sliding window context. This yields linear rather than quadratic scaling in the worst case, and allows the vectorized engine to meet or exceed parity with the row engine for `sum` window functions. **colexec: implement sliding window optimization for count window function** This commit modifies the count aggregate operator to implement the `slidingWindowAggregateFunc` interface so that it can be used with the sliding window optimization. **colexec: implement sliding window optimization for average window function** This commit modifies the `average` aggregate operator to implement the `slidingWindowAggregateFunc` interface so that it can be used with the sliding window optimization. **colexec: optimize count_rows window function** This commit implements an optimized version of `count_rows` that calculates the size of the window frame as soon as the window frame is calculated. This means that most of the overhead for `count_rows` now comes from calculating the window frame, which is worst-case linear time (previously, the step to retrieve the size of the frame was quadratic, though with a small constant). **colexec: optimize min and max window functions with default exclusion** This commit modifies the 'min' and 'max' aggregate window functions to implement the `slidingWindowAggregateFunc` interface, which allows them to be used in a sliding window context. However, this is only usable when the window frame never shrinks - e.g. it always contains all rows from the previous frame. This commit also provides implementations of `min` and `max` for use when the window frame can shrink. The indices of the 'next best' minimum or maximum values are stored in a priority queue that is updated for each row. Using the priority queue allows the `min` and `max` operators to avoid fully aggregating over the window frame even when the previous best value goes out of scope. Note that this implementation currently does not handle the case of non-default exclusion clause, in which case we must fall back to the quadratic approach. Fixes: #37039 Release note (performance improvement): The vectorized engine can now use the sliding-window approach to execute common aggregate functions as window functions. This allows aggregate window functions to be evaluated in linear rather than quadratic time. Currently, sum, count, average, min, and max are executed using this approach. Co-authored-by: Sajjad Rizvi <[email protected]> Co-authored-by: Drew Kimball <[email protected]>
exalate-issue-sync
bot
added
T-sql-foundations
SQL Foundations Team (formerly SQL Schema + SQL Sessions)
and removed
T-sql-schema-deprecated
Use T-sql-foundations instead
labels
May 10, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
A-jobs
O-sre
For issues SRE opened or otherwise cares about tracking.
T-sql-foundations
SQL Foundations Team (formerly SQL Schema + SQL Sessions)
Is your feature request related to a problem? Please describe.
This bug leads to panics when users run the IMPORT INTO job on 19.2.2: #44252.
The impact can be very high. See this graph of the SQL prober error rate:
50-100% error rate for 1hr!
The nodes crash at a fast enough rate that (a) the cluster is more or less entirely unavailable to the customer for the duration of the incident and (b) it is hard for an operator to get a SQL connection that lives long enough to cancel the problematic jobs (this is why it takes around 1hr to mitigate).
How can we reduce impact / make it easier to mitigate this issue?
This bug tracks 1 only.
I'm suggesting concrete solutions but I am more interested in improving the problem of very high impact than anything else! I'm suggesting concrete solutions to get a conversation started.
Describe the solution you'd like
If a job fails, the job system could do an exponential backoff. This would reduce the impact of a job that causes panics. The amount of time between panics would increase over time. This would also make it easier for an operator to cancel the job.
I don't know that the job system is not ALREADY doing this. If so, my bad! I do see the cluster setting
jobs.registry.leniency
. The description for this cluster setting reads "the amount of time to defer any attempts to reschedule a job". Doesn't sound like an exponential backoff.On the CC side, we should set this cluster setting so as to reduce impact of jobs that cause panics, IMHO.
Describe alternatives you've considered
See 1, 2, and 3 from the above list.
@ajwerner @pbardea @spaskob @carloruiz @DuskEagle @chrisseto @vilterp @vladdy
Epic: CRDB-7912
The text was updated successfully, but these errors were encountered: