jobs: allow jobs to pause themselves on "failure" instead of terminal failure #36887

danhhz · 2019-04-16T20:20:12Z

A failed job is terminal and a new job has to be created, but a paused job can be resumed. In many cases, something that is transient fails the job and it's unnecessary hassle to start from the beginning. During cluster instability or choas, for example, all sorts of errors can be returned, but once the cluster recovers, it'd be nice if an operator could resume the backup or restore or changefeed.

There are a few UX issues to work out before doing this. First is how the error will be surfaced to the user. Second is whether it should eventually time out and fail. This latter is important to IMPORT/RESTORE, for example, which clean up the partial data when they fail.

pbardea · 2021-02-23T18:59:43Z

also adding this to the bulk io backlog. Similarly, something like this would be even more useful on the stream ingestion job since we may have already ingested days worth of data and don't want to wipe it all for what could be a transient failure.

ajwerner · 2021-02-24T00:25:32Z

This seems extremely similar to though slightly different from #59542

Previously RESTORE jobs would automatically revert on failure. It may be advantageous from a debugging perspective to allow the job to pause at the exact moment of failure to identify the source of the errors. This patch adds a WITH option DEBUG_PAUSE_ON which allows the job to pause itself when it encounters the events described in this option. Currently the only value it can take is 'error', which allows jobs to self pause on errors. The job can then be RESUMEd after the error has been fixed, or CANCELed if the desired behavior is to rollback the job. Resolves cockroachdb#36887 Release note (enterprise change): Added new DEBUG_PAUSE_ON option to RESTORE jobs to allow for self pause on errors. Release Justification: low-risk as it is opt-in debugging tool off by default.

If a job has exhausted its retry quota for transient errors we pause the job instead of failing it. This way it is up to the user to decide if they want to resume/cancel the job. This will prevent large amounts of work from being thrown away because of transient errors that just need > retry limit to go away. Informs: cockroachdb#36887 Release note (sql change): BACKUP, IMPORT and RESTORE jobs will be paused instead of entering a failed state if they continue to encounter transient errors once they have retried a maximum number of times. The user is responsible for cancelling or resuming the job from this state.

80403: backupccl,sql/importer: pause jobs on exhausting retries r=dt a=adityamaru If a job has exhausted its retry quota for transient errors we pause the job instead of failing it. This way it is up to the user to decide if they want to resume/cancel the job. This will prevent large amounts of work from being thrown away because of transient errors that just need > retry limit to go away. Informs: #36887 Release note (sql change): BACKUP, IMPORT and RESTORE jobs will be paused instead of entering a failed state if they continue to encounter transient errors once they have retried a maximum number of times. The user is responsible for cancelling or resuming the job from this state. Co-authored-by: Aditya Maru <[email protected]>

If a job has exhausted its retry quota for transient errors we pause the job instead of failing it. This way it is up to the user to decide if they want to resume/cancel the job. This will prevent large amounts of work from being thrown away because of transient errors that just need > retry limit to go away. Informs: #36887 Release note (sql change): BACKUP, IMPORT and RESTORE jobs will be paused instead of entering a failed state if they continue to encounter transient errors once they have retried a maximum number of times. The user is responsible for cancelling or resuming the job from this state.

dt · 2023-08-21T14:20:45Z

Any job resumer can just emit a pause-request error to pause itself, and the cluster setting for debug pause points can be used to set a "after_exec_error" pause point that applies to any job. Closing this as done.

danhhz added A-disaster-recovery A-cdc Change Data Capture labels Apr 16, 2019

danhhz mentioned this issue Apr 17, 2019

changefeedccl: make a rangefeeds are not enabled error pause the job #34691

Closed

awoods187 added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Apr 22, 2019

dt changed the title ~~backupccl/restoreccl/changefeedccl: many job "failures" should pause instead of fail~~ jobs: allow jobs to pause themselves on "failure" instead of terminal failure Mar 6, 2020

pbardea mentioned this issue Oct 22, 2020

bulkio: Add onfailure option to restore #55856

Closed

elinorgarcia added the T-cdc label Dec 7, 2020

mwang1026 added the T-disaster-recovery label May 19, 2021

jlinder added the T-sql-schema-deprecated Use T-sql-foundations instead label Jun 16, 2021

miretskiy mentioned this issue Jul 27, 2021

Resume CHANGEFEED from last committed cursor timestamp #65573

Closed

rhu713 self-assigned this Aug 26, 2021

rhu713 mentioned this issue Aug 26, 2021

backupccl: allow restore jobs to self pause on errors #69422

Merged

shermanCRL added the A-jobs label Apr 14, 2022

blathers-crl bot added the T-jobs label Apr 14, 2022

shermanCRL assigned adityamaru Apr 19, 2022

shermanCRL removed A-cdc Change Data Capture T-cdc labels Apr 20, 2022

adityamaru mentioned this issue Apr 22, 2022

backupccl,sql/importer: pause jobs on exhausting retries #80403

Merged

blathers-crl bot mentioned this issue Apr 23, 2022

release-22.1: backupccl,sql/importer: pause jobs on exhausting retries #80434

Merged

jlinder added sync-me-3 and removed sync-me-3 labels May 24, 2022

healthy-pod added T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) and removed T-sql-schema-deprecated Use T-sql-foundations instead labels May 17, 2023

rafiss removed the T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) label Jun 5, 2023

adityamaru removed their assignment Jul 10, 2023

dt closed this as completed Aug 21, 2023

exalate-issue-sync bot removed the T-disaster-recovery label Aug 21, 2023

github-project-automation bot added this to Disaster Recovery Backlog Aug 28, 2024

github-project-automation bot moved this to Done in Disaster Recovery Backlog Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

jobs: allow jobs to pause themselves on "failure" instead of terminal failure #36887

jobs: allow jobs to pause themselves on "failure" instead of terminal failure #36887

danhhz commented Apr 16, 2019

pbardea commented Feb 23, 2021

ajwerner commented Feb 24, 2021 •

edited

Loading

dt commented Aug 21, 2023

jobs: allow jobs to pause themselves on "failure" instead of terminal failure #36887

jobs: allow jobs to pause themselves on "failure" instead of terminal failure #36887

Comments

danhhz commented Apr 16, 2019

pbardea commented Feb 23, 2021

ajwerner commented Feb 24, 2021 • edited Loading

dt commented Aug 21, 2023

ajwerner commented Feb 24, 2021 •

edited

Loading