Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jobs: allow jobs to pause themselves on "failure" instead of terminal failure #36887

Closed
danhhz opened this issue Apr 16, 2019 · 4 comments · Fixed by #69422
Closed

jobs: allow jobs to pause themselves on "failure" instead of terminal failure #36887

danhhz opened this issue Apr 16, 2019 · 4 comments · Fixed by #69422
Assignees
Labels
A-disaster-recovery A-jobs C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-jobs

Comments

@danhhz
Copy link
Contributor

danhhz commented Apr 16, 2019

A failed job is terminal and a new job has to be created, but a paused job can be resumed. In many cases, something that is transient fails the job and it's unnecessary hassle to start from the beginning. During cluster instability or choas, for example, all sorts of errors can be returned, but once the cluster recovers, it'd be nice if an operator could resume the backup or restore or changefeed.

There are a few UX issues to work out before doing this. First is how the error will be surfaced to the user. Second is whether it should eventually time out and fail. This latter is important to IMPORT/RESTORE, for example, which clean up the partial data when they fail.

@danhhz danhhz added A-disaster-recovery A-cdc Change Data Capture labels Apr 16, 2019
@awoods187 awoods187 added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Apr 22, 2019
@dt dt changed the title backupccl/restoreccl/changefeedccl: many job "failures" should pause instead of fail jobs: allow jobs to pause themselves on "failure" instead of terminal failure Mar 6, 2020
@pbardea
Copy link
Contributor

pbardea commented Feb 23, 2021

also adding this to the bulk io backlog. Similarly, something like this would be even more useful on the stream ingestion job since we may have already ingested days worth of data and don't want to wipe it all for what could be a transient failure.

@ajwerner
Copy link
Contributor

ajwerner commented Feb 24, 2021

This seems extremely similar to though slightly different from #59542

@jlinder jlinder added the T-sql-schema-deprecated Use T-sql-foundations instead label Jun 16, 2021
@rhu713 rhu713 self-assigned this Aug 26, 2021
rhu713 pushed a commit to rhu713/cockroach that referenced this issue Aug 30, 2021
Previously RESTORE jobs would automatically revert on failure. It may be
advantageous from a debugging perspective to allow the job to pause at the exact
moment of failure to identify the source of the errors. This patch adds a WITH
option DEBUG_PAUSE_ON which allows the job to pause itself when it encounters
the events described in this option. Currently the only value it can take is
'error', which allows jobs to self pause on errors. The job can then be RESUMEd
after the error has been fixed, or CANCELed if the desired behavior is to
rollback the job.

Resolves cockroachdb#36887

Release note (enterprise change): Added new DEBUG_PAUSE_ON option to RESTORE
jobs to allow for self pause on errors.

Release Justification: low-risk as it is opt-in debugging tool off by default.
@blathers-crl blathers-crl bot added the T-jobs label Apr 14, 2022
@shermanCRL shermanCRL removed A-cdc Change Data Capture T-cdc labels Apr 20, 2022
adityamaru added a commit to adityamaru/cockroach that referenced this issue Apr 22, 2022
If a job has exhausted its retry quota for transient errors
we pause the job instead of failing it. This way it is up to
the user to decide if they want to resume/cancel the job. This
will prevent large amounts of work from being thrown away because
of transient errors that just need > retry limit to go away.

Informs: cockroachdb#36887

Release note (sql change): BACKUP, IMPORT and RESTORE jobs
will be paused instead of entering a failed state if they continue
to encounter transient errors once they have retried a maximum number
of times. The user is responsible for cancelling or resuming the job
from this state.
craig bot pushed a commit that referenced this issue Apr 23, 2022
80403: backupccl,sql/importer: pause jobs on exhausting retries r=dt a=adityamaru

If a job has exhausted its retry quota for transient errors
we pause the job instead of failing it. This way it is up to
the user to decide if they want to resume/cancel the job. This
will prevent large amounts of work from being thrown away because
of transient errors that just need > retry limit to go away.

Informs: #36887

Release note (sql change): BACKUP, IMPORT and RESTORE jobs
will be paused instead of entering a failed state if they continue
to encounter transient errors once they have retried a maximum number
of times. The user is responsible for cancelling or resuming the job
from this state.

Co-authored-by: Aditya Maru <[email protected]>
blathers-crl bot pushed a commit that referenced this issue Apr 23, 2022
If a job has exhausted its retry quota for transient errors
we pause the job instead of failing it. This way it is up to
the user to decide if they want to resume/cancel the job. This
will prevent large amounts of work from being thrown away because
of transient errors that just need > retry limit to go away.

Informs: #36887

Release note (sql change): BACKUP, IMPORT and RESTORE jobs
will be paused instead of entering a failed state if they continue
to encounter transient errors once they have retried a maximum number
of times. The user is responsible for cancelling or resuming the job
from this state.
@healthy-pod healthy-pod added T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) and removed T-sql-schema-deprecated Use T-sql-foundations instead labels May 17, 2023
@rafiss rafiss removed the T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) label Jun 5, 2023
@adityamaru adityamaru removed their assignment Jul 10, 2023
@dt
Copy link
Member

dt commented Aug 21, 2023

Any job resumer can just emit a pause-request error to pause itself, and the cluster setting for debug pause points can be used to set a "after_exec_error" pause point that applies to any job. Closing this as done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-disaster-recovery A-jobs C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-jobs
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.