Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Database jobs still running during database maintenance #694

Open
evansd opened this issue Dec 18, 2023 · 2 comments
Open

Database jobs still running during database maintenance #694

evansd opened this issue Dec 18, 2023 · 2 comments
Assignees

Comments

@evansd
Copy link
Contributor

evansd commented Dec 18, 2023

There's always going to be a delay between us detecting that the db has entered maintenance mode and bringing down all relevant jobs. This leaves the possibility of spurious errors (or even, possibly, spurious results) from jobs which run during this window.

Having not seen this type of error in practice before, we've had two in quite close succession recently:

We've raised with TPP the idea of building in a delay on their end between announcing the start of maintenance mode and actually making changes to the database in order to give us time to gracefully shutdown:
Asked TPP:
https://bennettoxford.slack.com/archives/C010SJ89SA3/p1701786382187729

@bloodearnest bloodearnest self-assigned this Dec 18, 2023
@bloodearnest
Copy link
Member

Found two more this year with CodedEvent_SNOMED table disappearing, one in July and one in August.

https://jobs.opensafely.org/brit-antibiotic-research/adverse-event-prediction-model/20070/x7lsis3ix4qafzn4/
https://jobs.opensafely.org/effect-of-covid-19-on-prescribing-of-dependence-forming-medicines-and-the-associated-health-utilisation/the_effects_of_covid-19_on_dfms_prescribing/18897/ytxekbf4zrzpy752/

However, both of these were cohortextractor jobs, which explicitly handled this case by exiting with error code 4, which job-runner reports to the user (and doesn't bubble up to INTERNAL_ERROR)

This is possibly left-over defence in depth from before we had maintenance mode, perhaps? Or

I didn't find any other instances of the 2nd type of failure, which presumably only ehrql would generate anyway.

@bloodearnest
Copy link
Member

bloodearnest commented Dec 18, 2023

One thing we could do is this:

  1. Grab a list of all maintenance mode timings (its an append only table) from TPP db.
  2. Grab the timestamps of job-runner detecting maintenance mode (from job-runner logs)
  3. Compare the lag between the two.

Worth persuing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants