Database jobs still running during database maintenance #694

evansd · 2023-12-18T12:08:05Z

There's always going to be a delay between us detecting that the db has entered maintenance mode and bringing down all relevant jobs. This leaves the possibility of spurious errors (or even, possibly, spurious results) from jobs which run during this window.

Having not seen this type of error in practice before, we've had two in quite close succession recently:

This job where it errored because the CodedEvent_SNOMED table had disappeared (Slack thread).
And this job where the only plausible explanation was that the population query ran against different data than the final results query.

We've raised with TPP the idea of building in a delay on their end between announcing the start of maintenance mode and actually making changes to the database in order to give us time to gracefully shutdown:
Asked TPP:
https://bennettoxford.slack.com/archives/C010SJ89SA3/p1701786382187729

The text was updated successfully, but these errors were encountered:

bloodearnest · 2023-12-18T14:58:58Z

Found two more this year with CodedEvent_SNOMED table disappearing, one in July and one in August.

https://jobs.opensafely.org/brit-antibiotic-research/adverse-event-prediction-model/20070/x7lsis3ix4qafzn4/
https://jobs.opensafely.org/effect-of-covid-19-on-prescribing-of-dependence-forming-medicines-and-the-associated-health-utilisation/the_effects_of_covid-19_on_dfms_prescribing/18897/ytxekbf4zrzpy752/

However, both of these were cohortextractor jobs, which explicitly handled this case by exiting with error code 4, which job-runner reports to the user (and doesn't bubble up to INTERNAL_ERROR)

This is possibly left-over defence in depth from before we had maintenance mode, perhaps? Or

I didn't find any other instances of the 2nd type of failure, which presumably only ehrql would generate anyway.

bloodearnest · 2023-12-18T15:02:05Z

One thing we could do is this:

Grab a list of all maintenance mode timings (its an append only table) from TPP db.
Grab the timestamps of job-runner detecting maintenance mode (from job-runner logs)
Compare the lag between the two.

Worth persuing?

bloodearnest self-assigned this Dec 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Database jobs still running during database maintenance #694

Database jobs still running during database maintenance #694

evansd commented Dec 18, 2023

bloodearnest commented Dec 18, 2023

bloodearnest commented Dec 18, 2023 •

edited

Loading

Database jobs still running during database maintenance #694

Database jobs still running during database maintenance #694

Comments

evansd commented Dec 18, 2023

bloodearnest commented Dec 18, 2023

bloodearnest commented Dec 18, 2023 • edited Loading

bloodearnest commented Dec 18, 2023 •

edited

Loading