Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert database notebooks to use SQL Runner #20

Closed
6 tasks done
iaindillingham opened this issue Oct 3, 2022 · 9 comments
Closed
6 tasks done

Convert database notebooks to use SQL Runner #20

iaindillingham opened this issue Oct 3, 2022 · 9 comments
Assignees

Comments

@iaindillingham
Copy link
Member

iaindillingham commented Oct 3, 2022

opensafely/database-notebooks contains four database notebooks; three are published on OpenSAFELY Reports (builds, history, schema). This playbook describes how we update them.

We should convert each database notebook to use SQL Runner. We should start with the schema notebook (#9).

Considerations

Create or update

Why create a new repo? If we created a new repo from opensafely-core/repo-template, then we would have a more developer-friendly setup. This would make it easier to incorporate some SQL linting (e.g. SQLFluff has a pre-commit hook).

Why update the existing repo? The existing repo, opensafely/database-notebooks, has seen nearly two years of development; it's not immediately clear which parts should be retained and which parts should be discarded, so updating it would be safer than creating a new repo; opensafely-core/reports and https://reports.opensafely.org/ may have references to the existing repo.

Repos and notebooks

Should there be one repo for all notebooks or one repo for each notebook?

@inglesp
Copy link
Contributor

inglesp commented Oct 3, 2022

If possible, can we do the schema one first?

@iaindillingham
Copy link
Member Author

I'm leaning towards creating a new repo for each report (starting with the schema report, as per #9 and Peter's comment).

Rationale:

  • The developer-friendly setup would be useful. The existing repo contains requirements.txt, which is built from requirements.in. Whilst we can't change the production dependencies much (i.e. those on python-docker), we'd like to specify some development dependencies. For example, the existing repo requires nbval. And as I mentioned, we could incorporate some SQL linting, which could help us find and fix errors faster.
  • The new repo would be very similar to that of a reusable action: a study repo with developer tools. See, for example, opensafely-actions/cohort-joiner.
  • The one-repo-per-report model would make a report more like a study, and would make it easier to determine when the schema report was generated: OpenSAFELY Jobs would record when the schema report was generated, rather than when all reports were generated.
  • Much of the existing repo contains code that is unrelated to the schema report, or is for configuring/running Jupyter Notebook. opensafely jupyter may have made this code redundant.

@iaindillingham
Copy link
Member Author

It looks as though pointing OpenSAFELY Reports to a new repo would be straightforward; the existing repo isn't hard-coded; the fields on the admin page in OpenSAFELY Reports are clear.

@iaindillingham iaindillingham moved this from Next to In Progress in Data Team Oct 3, 2022
@wjchulme
Copy link

wjchulme commented Oct 4, 2022

Starting with a new repo instead of updating the existing repo is absolutely the right thing to do. Start afresh, cut the chaff.

I'm not convinced of the benefit of splitting each notebook into different repos. It feels like part of the same suite of information that will likely all be updated on the same schedule, and so can be accessed and updated all in one place. Although if they're set to be updated automatically and viewed on the reports site, then maybe having them all in one place matters less. I don't have a strong preference so if there are developery reasons that I'm not fully grasping then go for it.

@wjchulme
Copy link

wjchulme commented Oct 4, 2022

Thinking a bit more -- the "history" and "builds" notebooks should be combined, as raised in this issue. So maybe one repo for that combined report, and one for the schema report does make sense after all!

@wjchulme
Copy link

wjchulme commented Oct 4, 2022

Another thing in passing. There is currently no sych between the repo on the L2/3 server and on github because git pushing out from L2/3 is no longer possible for mere mortals like me. So when we convert we need to check for any server side updates that haven't made it to github. Happy to flag these when needed

@iaindillingham
Copy link
Member Author

Thanks for your comments @wjchulme.

I'm still leaning towards a one-to-one mapping between repos and reports. As I understand it, each report (notebook) queries the database, transforms the results, and displays the transformed results. There's one action in project.yaml for each report,1 so regenerating a report means rerunning that action.

When we convert each report to use SQL Runner, there will be separate actions in project.yaml for querying (sqlrunner), transforming (python), and displaying (python). Regenerating a report will mean rerunning three actions. If there were a one-to-many mapping between repos and reports, then we'd need to run the actions we want to run, without running the actions we don't want to run. If there were a one-to-one mapping between repos and reports, however, then we'd not need to be as careful.

Footnotes

  1. This isn't quite correct; there's one action for each output format for each report 🙂.

@wjchulme
Copy link

wjchulme commented Oct 4, 2022

Thanks @iaindillingham that makes sense. Often in research repos you often may only want to run a subset of actions at a time -- to run a specific subgroup analysis for example -- so that feels like less of a problem to me. Particularly in this case as I think all these reports will likely end up always being run at the same time (= after each database update).

But as long as there's no inelegant duplication across reports then there is certainly a logic to a one-repo-per-report approach.

@iaindillingham iaindillingham moved this from In Progress to Next in Data Team Oct 26, 2022
@rebkwok rebkwok removed the status in Data Team Dec 8, 2022
@inglesp inglesp moved this to Next in Data Team Mar 13, 2023
@iaindillingham iaindillingham moved this from Next to In Progress in Data Team Mar 14, 2023
@iaindillingham
Copy link
Member Author

Phew, so all check boxes checked. We're good to close this issue, right?

Yes and no.

There are two open issues that give reports parity with notebooks:

And three that make reports a bit more user-friendly:

And three that remove a potential error, but have a dependency on SQL Runner:

There's also the issue of running the TPP database reports after L2 access is clarified.

Nevertheless, these issues are captured elsewhere; breaking them out into smaller issues means I think we are good to close this (larger, long-running) issue.

🎉

@github-project-automation github-project-automation bot moved this from In Progress to Done in Data Team May 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

4 participants