Jobs system could detect jobs that are likely causing panics ("jobs of death") and not run them anymore #44596
Labels
A-jobs
O-sre
For issues SRE opened or otherwise cares about tracking.
T-sql-foundations
SQL Foundations Team (formerly SQL Schema + SQL Sessions)
Is your feature request related to a problem? Please describe.
This bug leads to panics when users run the IMPORT INTO job on 19.2.2: #44252.
The impact can be very high. See this graph of the SQL prober error rate:
50-100% error rate for 1hr!
The nodes crash at a fast enough rate that (a) the cluster is more or less entirely unavailable to the customer for the duration of the incident and (b) it is hard for an operator to get a SQL connection that lives long enough to cancel the problematic jobs (this is why it takes around 1hr to mitigate).
How can we reduce impact / make it easier to mitigate this issue?
This bug tracks 2 only.
I'm suggesting concrete solutions but I am more interested in improving the problem of very high impact than anything else! I'm suggesting concrete solutions to get a conversation started.
Describe the solution you'd like
If a job fails repeatedly and the job system detects that the failures are caused by dying CRDB nodes, the job system could mark the job as a "job of death" and not retry it. I recognize this detection is complicated to get right (how to know the job is CAUSING the crash), but still I wonder about this idea.
Describe alternatives you've considered
See 1, 2, and 3 from the above list.
@ajwerner @pbardea @spaskob @carloruiz @DuskEagle @chrisseto @vilterp @vladdy
Jira issue: CRDB-5217
The text was updated successfully, but these errors were encountered: