Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jobs system could detect jobs that are likely causing panics ("jobs of death") and not run them anymore #44596

Closed
joshimhoff opened this issue Jan 31, 2020 · 2 comments
Labels
A-jobs O-sre For issues SRE opened or otherwise cares about tracking. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)

Comments

@joshimhoff
Copy link
Collaborator

joshimhoff commented Jan 31, 2020

Is your feature request related to a problem? Please describe.
This bug leads to panics when users run the IMPORT INTO job on 19.2.2: #44252.

The impact can be very high. See this graph of the SQL prober error rate:

image

50-100% error rate for 1hr!

The nodes crash at a fast enough rate that (a) the cluster is more or less entirely unavailable to the customer for the duration of the incident and (b) it is hard for an operator to get a SQL connection that lives long enough to cancel the problematic jobs (this is why it takes around 1hr to mitigate).

How can we reduce impact / make it easier to mitigate this issue?

  1. If a job fails, the job system could do an exponential backoff.
  2. If a job fails repeatedly and the job system detects that the failures are caused by dying CRDB nodes, the job system could mark the job as a "job of death" and not retry it.
  3. If an operator passes a command line flag to CRDB, the job system could not pick up any jobs.

This bug tracks 2 only.

I'm suggesting concrete solutions but I am more interested in improving the problem of very high impact than anything else! I'm suggesting concrete solutions to get a conversation started.

Describe the solution you'd like
If a job fails repeatedly and the job system detects that the failures are caused by dying CRDB nodes, the job system could mark the job as a "job of death" and not retry it. I recognize this detection is complicated to get right (how to know the job is CAUSING the crash), but still I wonder about this idea.

Describe alternatives you've considered
See 1, 2, and 3 from the above list.

@ajwerner @pbardea @spaskob @carloruiz @DuskEagle @chrisseto @vilterp @vladdy

Jira issue: CRDB-5217

@joshimhoff joshimhoff changed the title Jobs system could detect jobs that are likely causing panics and not run them anymore Jobs system could detect jobs that are likely causing panics ("jobs of death") and not run them anymore Jan 31, 2020
@joshimhoff joshimhoff added O-sre For issues SRE opened or otherwise cares about tracking. A-jobs labels Jan 31, 2020
@joshimhoff
Copy link
Collaborator Author

Prob best to start with #52815 & #51643.

@jlinder jlinder added the T-sql-schema-deprecated Use T-sql-foundations instead label Jun 16, 2021
@postamar
Copy link
Contributor

postamar commented Mar 8, 2022

This issue has effectively been addressed by the addition of exponential backoffs for job retries

@postamar postamar closed this as completed Mar 8, 2022
@exalate-issue-sync exalate-issue-sync bot added T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) and removed T-sql-schema-deprecated Use T-sql-foundations instead labels May 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-jobs O-sre For issues SRE opened or otherwise cares about tracking. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)
Projects
None yet
Development

No branches or pull requests

3 participants