-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
jobs: allow jobs to pause themselves on "failure" instead of terminal failure #36887
Comments
also adding this to the bulk io backlog. Similarly, something like this would be even more useful on the stream ingestion job since we may have already ingested days worth of data and don't want to wipe it all for what could be a transient failure. |
This seems extremely similar to though slightly different from #59542 |
Previously RESTORE jobs would automatically revert on failure. It may be advantageous from a debugging perspective to allow the job to pause at the exact moment of failure to identify the source of the errors. This patch adds a WITH option DEBUG_PAUSE_ON which allows the job to pause itself when it encounters the events described in this option. Currently the only value it can take is 'error', which allows jobs to self pause on errors. The job can then be RESUMEd after the error has been fixed, or CANCELed if the desired behavior is to rollback the job. Resolves cockroachdb#36887 Release note (enterprise change): Added new DEBUG_PAUSE_ON option to RESTORE jobs to allow for self pause on errors. Release Justification: low-risk as it is opt-in debugging tool off by default.
If a job has exhausted its retry quota for transient errors we pause the job instead of failing it. This way it is up to the user to decide if they want to resume/cancel the job. This will prevent large amounts of work from being thrown away because of transient errors that just need > retry limit to go away. Informs: cockroachdb#36887 Release note (sql change): BACKUP, IMPORT and RESTORE jobs will be paused instead of entering a failed state if they continue to encounter transient errors once they have retried a maximum number of times. The user is responsible for cancelling or resuming the job from this state.
80403: backupccl,sql/importer: pause jobs on exhausting retries r=dt a=adityamaru If a job has exhausted its retry quota for transient errors we pause the job instead of failing it. This way it is up to the user to decide if they want to resume/cancel the job. This will prevent large amounts of work from being thrown away because of transient errors that just need > retry limit to go away. Informs: #36887 Release note (sql change): BACKUP, IMPORT and RESTORE jobs will be paused instead of entering a failed state if they continue to encounter transient errors once they have retried a maximum number of times. The user is responsible for cancelling or resuming the job from this state. Co-authored-by: Aditya Maru <[email protected]>
If a job has exhausted its retry quota for transient errors we pause the job instead of failing it. This way it is up to the user to decide if they want to resume/cancel the job. This will prevent large amounts of work from being thrown away because of transient errors that just need > retry limit to go away. Informs: #36887 Release note (sql change): BACKUP, IMPORT and RESTORE jobs will be paused instead of entering a failed state if they continue to encounter transient errors once they have retried a maximum number of times. The user is responsible for cancelling or resuming the job from this state.
Any job resumer can just emit a pause-request error to pause itself, and the cluster setting for debug pause points can be used to set a "after_exec_error" pause point that applies to any job. Closing this as done. |
A failed job is terminal and a new job has to be created, but a paused job can be resumed. In many cases, something that is transient fails the job and it's unnecessary hassle to start from the beginning. During cluster instability or choas, for example, all sorts of errors can be returned, but once the cluster recovers, it'd be nice if an operator could resume the backup or restore or changefeed.
There are a few UX issues to work out before doing this. First is how the error will be surfaced to the user. Second is whether it should eventually time out and fail. This latter is important to IMPORT/RESTORE, for example, which clean up the partial data when they fail.
The text was updated successfully, but these errors were encountered: