Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jobs: perform exponential backoff to reduce impact of jobs that cause panics #44594

Closed
joshimhoff opened this issue Jan 31, 2020 · 5 comments · Fixed by #66889
Closed

jobs: perform exponential backoff to reduce impact of jobs that cause panics #44594

joshimhoff opened this issue Jan 31, 2020 · 5 comments · Fixed by #66889
Assignees
Labels
A-jobs O-sre For issues SRE opened or otherwise cares about tracking. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)

Comments

@joshimhoff
Copy link
Collaborator

joshimhoff commented Jan 31, 2020

Is your feature request related to a problem? Please describe.
This bug leads to panics when users run the IMPORT INTO job on 19.2.2: #44252.

The impact can be very high. See this graph of the SQL prober error rate:

image

50-100% error rate for 1hr!

The nodes crash at a fast enough rate that (a) the cluster is more or less entirely unavailable to the customer for the duration of the incident and (b) it is hard for an operator to get a SQL connection that lives long enough to cancel the problematic jobs (this is why it takes around 1hr to mitigate).

How can we reduce impact / make it easier to mitigate this issue?

  1. If a job fails, the job system could do an exponential backoff.
  2. If a job fails repeatedly and the job system detects that the failures are caused by dying CRDB nodes, the job system could mark the job as a "job of death" and not retry it.
  3. If an SRE passes a command line flag to CRDB, the job system could not pick up any jobs.

This bug tracks 1 only.

I'm suggesting concrete solutions but I am more interested in improving the problem of very high impact than anything else! I'm suggesting concrete solutions to get a conversation started.

Describe the solution you'd like
If a job fails, the job system could do an exponential backoff. This would reduce the impact of a job that causes panics. The amount of time between panics would increase over time. This would also make it easier for an operator to cancel the job.

I don't know that the job system is not ALREADY doing this. If so, my bad! I do see the cluster setting jobs.registry.leniency. The description for this cluster setting reads "the amount of time to defer any attempts to reschedule a job". Doesn't sound like an exponential backoff.

On the CC side, we should set this cluster setting so as to reduce impact of jobs that cause panics, IMHO.

Describe alternatives you've considered
See 1, 2, and 3 from the above list.

@ajwerner @pbardea @spaskob @carloruiz @DuskEagle @chrisseto @vilterp @vladdy

Epic: CRDB-7912

@joshimhoff joshimhoff added O-sre For issues SRE opened or otherwise cares about tracking. A-jobs labels Jan 31, 2020
@ajwerner
Copy link
Contributor

We have solved 3. with #44786

sajjadrizvi pushed a commit to sajjadrizvi/cockroach that referenced this issue Jun 1, 2021
In the previous implementation, failed GC jobs were not being retried regardless
whether the failure is permanent or transient. As a result, a GC job's failure
risked orphaned data, which cannot be reclaimed.

This commit adds a mechanism to retry failed GC jobs that are not permanent. No
limit is set on the number of retries. For the time being, the failure type is
determined based on the failure categorization of schema-change jobs. This
behavior is expected to change once exponential backoff mechanism is
implemented for failed jobs (cockroachdb#44594).

Release note: None

Fixes: cockroachdb#65000
sajjadrizvi pushed a commit to sajjadrizvi/cockroach that referenced this issue Jun 1, 2021
In the previous implementation, failed GC jobs were not being retried
regardless whether the failure is permanent or transient. As a
result, a GC job's failure risked orphaned data, which cannot be
reclaimed.

This commit adds a mechanism to retry failed GC jobs that are not
permanent. No limit is set on the number of retries. For the time
being, the failure type is determined based on the failure
categorization of schema-change jobs. This behavior is expected to
change once exponential backoff mechanism is implemented for failed
jobs (cockroachdb#44594).

Release note: None

Fixes: cockroachdb#65000
craig bot pushed a commit that referenced this issue Jun 1, 2021
65867: changefeedccl: Fix flaky tests. r=miretskiy a=miretskiy

Fix flaky test and re-enable it to run under stress.
The problem was that the transaction executed by the table feed can
be restarted.  If that happens, then we would see the same keys again,
but because we had side effects inside transaction (marking the keys
seen), we would not emit those keys causing the test to be hung.
The stress race was failing because of both transaction restarts and
the 10ms resolved timestamp frequency (with so many resolved timestamps
being generated, the table feed transaction was always getting
restarted).

Fixes #57754
Fixes #65168

Release Notes: None

65868: storage: expose pebble.IteratorStats through {MVCC,Engine}Iterator r=sumeerbhola a=sumeerbhola

These will potentially be aggregated before exposing in trace
statements, EXPLAIN ANALYZE etc.

Release note: None

65900: roachtest: fix ruby-pg test suite r=rafiss a=RichardJCai

Update blocklist with passing test.
The not run test causing a failure is because the test is no longer failing.
Since it is not failing, it shows up under not run.

Release note: None

65910: sql/gcjob: retry failed GC jobs r=ajwerner a=sajjadrizvi

In the previous implementation, failed GC jobs were not being retried regardless
whether the failure is permanent or transient. As a result, a GC job's failure
risked orphaned data, which cannot be reclaimed.

This commit adds a mechanism to retry failed GC jobs that are not permanent. No
limit is set on the number of retries. For the time being, the failure type is
determined based on the failure categorization of schema-change jobs. This
behavior is expected to change once exponential backoff mechanism is
implemented for failed jobs (#44594).

Release note: None

Fixes: #65000

Release note (<category, see below>): <what> <show> <why>

65925: ccl/importccl: skip TestImportPgDumpSchemas/inject-error-ensure-cleanup r=tbg a=adityamaru

Refs: #65878

Reason: flaky test

Generated by bin/skip-test.

Release justification: non-production code changes

Release note: None

65933: kv/kvserver: skip TestReplicateQueueDeadNonVoters under race r=sumeerbhola a=sumeerbhola

Refs: #65932

Reason: flaky test

Generated by bin/skip-test.

Release justification: non-production code changes

Release note: None

65934: kv/kvserver: skip TestReplicateQueueSwapVotersWithNonVoters under race r=sumeerbhola a=sumeerbhola

Refs: #65932

Reason: flaky test

Generated by bin/skip-test.

Release justification: non-production code changes

Release note: None

65936: jobs: fix flakey TestMetrics r=fqazi a=ajwerner

Fixes #65735

The test needed to wait for the job to be fully marked as paused.

Release note: None

Co-authored-by: Yevgeniy Miretskiy <[email protected]>
Co-authored-by: sumeerbhola <[email protected]>
Co-authored-by: richardjcai <[email protected]>
Co-authored-by: Sajjad Rizvi <[email protected]>
Co-authored-by: Aditya Maru <[email protected]>
Co-authored-by: Andrew Werner <[email protected]>
sajjadrizvi pushed a commit to sajjadrizvi/cockroach that referenced this issue Jun 1, 2021
In the previous implementation, failed GC jobs were not being retried
regardless whether the failure is permanent or transient. As a
result, a GC job's failure risked orphaned data, which cannot be
reclaimed.

This commit adds a mechanism to retry failed GC jobs that are not
permanent. No limit is set on the number of retries. For the time
being, the failure type is determined based on the failure
categorization of schema-change jobs. This behavior is expected to
change once exponential backoff mechanism is implemented for failed
jobs (cockroachdb#44594).

Release note: None

Fixes: cockroachdb#65000
sajjadrizvi pushed a commit to sajjadrizvi/cockroach that referenced this issue Jun 1, 2021
In the previous implementation, failed GC jobs were not being retried
regardless whether the failure is permanent or transient. As a
result, a GC job's failure risked orphaned data, which cannot be
reclaimed.

This patch adds a mechanism to retry failed GC jobs that are not
permanent. No limit is set on the number of retries. For the time
being, the failure type is determined based on the failure
categorization of schema-change jobs. This behavior is expected to
change once exponential backoff mechanism is implemented for failed
jobs (cockroachdb#44594).

This is a backport of cockroachdb#65910.

Release note: None

Fixes: cockroachdb#65000
sajjadrizvi pushed a commit to sajjadrizvi/cockroach that referenced this issue Jun 2, 2021
In the previous implementation, failed GC jobs were not being retried
regardless whether the failure is permanent or transient. As a
result, a GC job's failure risked orphaned data, which cannot be
reclaimed.

This patch adds a mechanism to retry failed GC jobs that are not
permanent. No limit is set on the number of retries. For the time
being, the failure type is determined based on the failure
categorization of schema-change jobs. This behavior is expected to
change once exponential backoff mechanism is implemented for failed
jobs (cockroachdb#44594).

This is a backport of cockroachdb#65910.

Release note: None

Fixes: cockroachdb#65000
@jlinder jlinder added the T-sql-schema-deprecated Use T-sql-foundations instead label Jun 16, 2021
@sajjadrizvi
Copy link

I am currently implementing job retries with exponential backoff. I plan to implement it in the following way:

  • Add two new columns in system.jobs: (1) num_runs that counts the number of times the job has run, and (2) last_run timestamp.
  • We add a new index in the jobs table on claim_session_id, created, and status columns, storing last_run, num_runs, and claim_instance_id. The index optimizes claiming the jobs that have their next execution time before the current time.
  • When the registry of a node runs its next adoption loop, it claims only those jobs whose calculated next execution time is before the current time. The next execution time can be calculated as ((last_run::int + base*(2^num_runs))::timestamp). Base is a configurable number in seconds, e.g., 10 seconds.
  • When a claimed job is processed, we increment job_runs and update last_run to now().
  • When a job fails, the job moves to its reverting state if it is not retriable. Otherwise, the job remains in the running state and it is retried in the next job-adoption phase.

@ajwerner
Copy link
Contributor

ajwerner commented Jun 23, 2021

This looks good. Let's not touch the last bullet in the first commit.

@sajjadrizvi
Copy link

The last bullet is just mentioning the current behavior. We are not modifying anything in the system.

sajjadrizvi pushed a commit to sajjadrizvi/cockroach that referenced this issue Jul 22, 2021
Failed jobs were being retried with a constant interval in the previous
implementation. This commit enables jobs to be retried with exponentially
increasing delays with an upper bound. This change enables to retry the jobs
that are not currently retried when they fail due to transient problems.

Release note: None

Fixes: cockroachdb#44594
sajjadrizvi pushed a commit to sajjadrizvi/cockroach that referenced this issue Jul 26, 2021
Failed jobs were being retried with a constant interval in the previous
implementation. This commit enables jobs to be retried with exponentially
increasing delays with an upper bound. This change enables to retry the jobs
that are not currently retried when they fail due to transient problems.

Release note: None

Fixes: cockroachdb#44594
sajjadrizvi pushed a commit to sajjadrizvi/cockroach that referenced this issue Jul 26, 2021
Failed jobs were being retried with a constant interval in the previous
implementation. This commit enables jobs to be retried with exponentially
increasing delays with an upper bound. This change enables to retry the jobs
that are not currently retried when they fail due to transient problems.

Release note: None

Fixes: cockroachdb#44594
sajjadrizvi pushed a commit to sajjadrizvi/cockroach that referenced this issue Jul 28, 2021
Failed jobs were being retried with a constant interval in the previous
implementation. This commit enables jobs to be retried with exponentially
increasing delays with an upper bound. This change enables to retry the jobs
that are not currently retried when they fail due to transient problems.

Release note: None

Fixes: cockroachdb#44594
sajjadrizvi pushed a commit to sajjadrizvi/cockroach that referenced this issue Aug 2, 2021
Failed jobs were being retried with a constant interval in the previous
implementation. This commit enables jobs to be retried with exponentially
increasing delays with an upper bound. This change enables to retry the jobs
that are not currently retried when they fail due to transient problems.

Release note: None

Fixes: cockroachdb#44594
sajjadrizvi pushed a commit to sajjadrizvi/cockroach that referenced this issue Aug 10, 2021
Failed jobs were being retried with a constant interval in the previous
implementation. This commit enables jobs to be retried with exponentially
increasing delays with an upper bound. This change enables to retry the jobs
that are not currently retried when they fail due to transient problems.

Release note: None

Fixes: cockroachdb#44594
sajjadrizvi pushed a commit to sajjadrizvi/cockroach that referenced this issue Aug 10, 2021
Failed jobs were being retried with a constant interval in the previous
implementation. This commit enables jobs to be retried with exponentially
increasing delays with an upper bound. This change enables to retry the jobs
that are not currently retried when they fail due to transient problems.

Release note: None

Fixes: cockroachdb#44594
sajjadrizvi pushed a commit to sajjadrizvi/cockroach that referenced this issue Aug 10, 2021
Failed jobs were being retried with a constant interval in the previous
implementation. This commit enables jobs to be retried with exponentially
increasing delays with an upper bound. This change enables to retry the jobs
that are not currently retried when they fail due to transient problems.

Release note: None

Fixes: cockroachdb#44594
craig bot pushed a commit that referenced this issue Aug 14, 2021
66889: jobs: retry jobs with exponential backoff r=ajwerner a=sajjadrizvi

This commit adds a mechanism to retry jobs with exponentially increasing
delays. This is achieved through two new columns in system.jobs table,
last_run and num_runs. In addition, this commit adds cluster settings
to control exponential-backoff parameters, initial delay and max delay,
with corresponding settings `jobs.registry.retry.initial_delay` and
`jobs.registry.retry.max_delay`. Finally, this commit adds a new
partial-index in the jobs table that improves the performance of periodic 
queries run by registry in each node.

Release note (general change): The behavior for retrying jobs, which fail
due to a retriable error or due to job coordinator failure, is now delayed
using exponential backoff. Before this change, jobs which failed in a
retryable manner, would be resumed immediately on a different coordinator.
This change reduces the impact of recurrently failing jobs on the cluster.
This change adds two new cluster settings that control this behavior:
"jobs.registry.retry.initial_delay" and "jobs.registry.retry.max_delay",
which respectively control initial delay and maximum delay between 
resumptions.

Fixes #44594
Fixes #65080

68212: colexec: add optimized versions of aggregate window functions r=DrewKimball a=DrewKimball

**colexecwindow: add sliding window functionality to window framer**

This commit adds a method `slidingWindowIntervals` to `windowFramer`
operators that returns a set of `toAdd` intervals and a set of
`toRemove` intervals, which indicate the rows that should be added
to the current aggregation and those that should be removed, respectively.
This will be used to implement the sliding window optimization for
aggregate window functions such as `sum`.

**colexecwindow: implement sliding window aggregator**

This commit supplies a new operator, `slidingWindowAggregator`, which
is used for any window aggregate functions that implement the
`slidingWindowAggregateFunc` interface. Rather than aggregating over
the entire window frame for each row, the `slidingWindowAggregator`
operator aggregates over the rows that are in the current window
frame but were not in the previous, and removes from the aggregation
the rows that were in the previous window frame but not the current.
This allows window aggregate functions to be evaluated in linear rather
than quadratic time.

**colexec: implement sliding window optimization for sum window function**

This commit modifies the `sum` aggregate window function to implement
the `slidingWindowAggregateFunc`, which allows it to be used in a
sliding window context. This yields linear rather than quadratic scaling
in the worst case, and allows the vectorized engine to meet or exceed
parity with the row engine for `sum` window functions.

**colexec: implement sliding window optimization for count window function**

This commit modifies the count aggregate operator to implement the
`slidingWindowAggregateFunc` interface so that it can be used with
the sliding window optimization.

**colexec: implement sliding window optimization for average window function**

This commit modifies the `average` aggregate operator to implement the
`slidingWindowAggregateFunc` interface so that it can be used with the
sliding window optimization.

**colexec: optimize count_rows window function**

This commit implements an optimized version of `count_rows` that
calculates the size of the window frame as soon as the window frame
is calculated. This means that most of the overhead for `count_rows`
now comes from calculating the window frame, which is worst-case
linear time (previously, the step to retrieve the size of the frame
was quadratic, though with a small constant).

**colexec: optimize min and max window functions with default exclusion**

This commit modifies the 'min' and 'max' aggregate window functions
to implement the `slidingWindowAggregateFunc` interface, which allows
them to be used in a sliding window context. However, this is only
usable when the window frame never shrinks - e.g. it always contains
all rows from the previous frame.

This commit also provides implementations of `min` and `max` for use
when the window frame can shrink. The indices of the 'next best'
minimum or maximum values are stored in a priority queue that is
updated for each row. Using the priority queue allows the `min` and
`max` operators to avoid fully aggregating over the window frame
even when the previous best value goes out of scope. Note that this
implementation currently does not handle the case of non-default
exclusion clause, in which case we must fall back to the quadratic
approach.

Fixes: #37039

Release note (performance improvement): The vectorized engine can now
use the sliding-window approach to execute common aggregate functions 
as window functions. This allows aggregate window functions to be evaluated
in linear rather than quadratic time. Currently, sum, count, average, min, and 
max are executed using this approach.

68433: sql: implemented placement restricted syntax for domiciling r=pawalt a=pawalt

This PR combines the existing restricted placement zone config logic
with the stubbed syntax to create an end-to-end PLACEMENT RESTRICTED
implementation.

Release note: None

Note that the cluster setting for domiciling and telemetry will be added in a later PR.

68818: changefeedccl: mark avro format as no longer experimental r=[miretskiy,spiffyeng] a=HonoreDB

The avro format for changefeeds now supports all column types
and has been in production use for several releases.
We'll now allow format=avro rather than format=experimental_avro
The old string will remain supported because job payloads can
persist across upgrades and downgrades.

Release note (enterprise change): changefeed avro format no longer marked experimental

Co-authored-by: Sajjad Rizvi <[email protected]>
Co-authored-by: Drew Kimball <[email protected]>
Co-authored-by: Peyton Walters <[email protected]>
Co-authored-by: Aaron Zinger <[email protected]>
craig bot pushed a commit that referenced this issue Aug 14, 2021
66889: jobs: retry jobs with exponential backoff r=ajwerner a=sajjadrizvi

This commit adds a mechanism to retry jobs with exponentially increasing
delays. This is achieved through two new columns in system.jobs table,
last_run and num_runs. In addition, this commit adds cluster settings
to control exponential-backoff parameters, initial delay and max delay,
with corresponding settings `jobs.registry.retry.initial_delay` and
`jobs.registry.retry.max_delay`. Finally, this commit adds a new
partial-index in the jobs table that improves the performance of periodic 
queries run by registry in each node.

Release note (general change): The behavior for retrying jobs, which fail
due to a retriable error or due to job coordinator failure, is now delayed
using exponential backoff. Before this change, jobs which failed in a
retryable manner, would be resumed immediately on a different coordinator.
This change reduces the impact of recurrently failing jobs on the cluster.
This change adds two new cluster settings that control this behavior:
"jobs.registry.retry.initial_delay" and "jobs.registry.retry.max_delay",
which respectively control initial delay and maximum delay between 
resumptions.

Fixes #44594
Fixes #65080

68212: colexec: add optimized versions of aggregate window functions r=DrewKimball a=DrewKimball

**colexecwindow: add sliding window functionality to window framer**

This commit adds a method `slidingWindowIntervals` to `windowFramer`
operators that returns a set of `toAdd` intervals and a set of
`toRemove` intervals, which indicate the rows that should be added
to the current aggregation and those that should be removed, respectively.
This will be used to implement the sliding window optimization for
aggregate window functions such as `sum`.

**colexecwindow: implement sliding window aggregator**

This commit supplies a new operator, `slidingWindowAggregator`, which
is used for any window aggregate functions that implement the
`slidingWindowAggregateFunc` interface. Rather than aggregating over
the entire window frame for each row, the `slidingWindowAggregator`
operator aggregates over the rows that are in the current window
frame but were not in the previous, and removes from the aggregation
the rows that were in the previous window frame but not the current.
This allows window aggregate functions to be evaluated in linear rather
than quadratic time.

**colexec: implement sliding window optimization for sum window function**

This commit modifies the `sum` aggregate window function to implement
the `slidingWindowAggregateFunc`, which allows it to be used in a
sliding window context. This yields linear rather than quadratic scaling
in the worst case, and allows the vectorized engine to meet or exceed
parity with the row engine for `sum` window functions.

**colexec: implement sliding window optimization for count window function**

This commit modifies the count aggregate operator to implement the
`slidingWindowAggregateFunc` interface so that it can be used with
the sliding window optimization.

**colexec: implement sliding window optimization for average window function**

This commit modifies the `average` aggregate operator to implement the
`slidingWindowAggregateFunc` interface so that it can be used with the
sliding window optimization.

**colexec: optimize count_rows window function**

This commit implements an optimized version of `count_rows` that
calculates the size of the window frame as soon as the window frame
is calculated. This means that most of the overhead for `count_rows`
now comes from calculating the window frame, which is worst-case
linear time (previously, the step to retrieve the size of the frame
was quadratic, though with a small constant).

**colexec: optimize min and max window functions with default exclusion**

This commit modifies the 'min' and 'max' aggregate window functions
to implement the `slidingWindowAggregateFunc` interface, which allows
them to be used in a sliding window context. However, this is only
usable when the window frame never shrinks - e.g. it always contains
all rows from the previous frame.

This commit also provides implementations of `min` and `max` for use
when the window frame can shrink. The indices of the 'next best'
minimum or maximum values are stored in a priority queue that is
updated for each row. Using the priority queue allows the `min` and
`max` operators to avoid fully aggregating over the window frame
even when the previous best value goes out of scope. Note that this
implementation currently does not handle the case of non-default
exclusion clause, in which case we must fall back to the quadratic
approach.

Fixes: #37039

Release note (performance improvement): The vectorized engine can now
use the sliding-window approach to execute common aggregate functions 
as window functions. This allows aggregate window functions to be evaluated
in linear rather than quadratic time. Currently, sum, count, average, min, and 
max are executed using this approach.

Co-authored-by: Sajjad Rizvi <[email protected]>
Co-authored-by: Drew Kimball <[email protected]>
@craig craig bot closed this as completed in be82c0e Aug 14, 2021
@exalate-issue-sync exalate-issue-sync bot added T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) and removed T-sql-schema-deprecated Use T-sql-foundations instead labels May 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-jobs O-sre For issues SRE opened or otherwise cares about tracking. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants