Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draw the owl #1633

Closed
iaindillingham opened this issue Sep 29, 2023 · 7 comments · Fixed by #1679
Closed

Draw the owl #1633

iaindillingham opened this issue Sep 29, 2023 · 7 comments · Fixed by #1679
Assignees
Labels
documentation Improvements or additions to documentation

Comments

@iaindillingham
Copy link
Member

iaindillingham commented Sep 29, 2023

As @inglesp has pointed out, the ehrQL tutorial is similar to How to draw an owl.

How to draw an owl

Upon conclusion of the ehrQL tutorial, the reader has created a repo, created (and deleted) a codespace, interacted with the sandbox, created a minimal dataset definition, and generated a dummy dataset that is displayed in the terminal (i.e. it is not written to a file).

To become a competent user of ehrQL, however, the reader should also:

  • Expand the dataset definition
  • Write a dummy dataset to a file
  • Commit the dataset definition to main

Expand the dataset definition

I'd like to check with a couple of researchers about what "expand" most usefully means,1 but based on this dataset definition, which @alschaffer said was written by her pilots without her help,2 I think "expand" probably means:

  • Combining Boolean series to define the population (e.g. was born on or before a date and is alive and is either male or female; was registered with a practice on a date; was registered with a practice for a minimum of k days)
  • Adding some simple demographic variables, such as age and sex
  • Adding a complex demographic variable, such as ethnicity (codelist_from_csv)
  • Adding a complex socioeconomic variable, such as IMD quintile (case)
  • Deriving a variable, such as counting the number of medications within the last 30 days (.is_in, .is_on_or_between, days, .count_for_patient)

Write a dummy dataset to a file

The reader should add an associated action to project.yaml, which they will run with opensafely run [action]. They should compare and contrast run with exec, noticing that exec is good for eyeballing the data but run is good for developing downstream actions, especially when the dummy dataset isn't written to a CSV file.

Commit the dataset definition to main

Upon conclusion of the ehrQL tutorial, the reader will be at "Initial commit" and be ready to run the associated action on OpenSAFELY Jobs. (Crating a project and workspace, and using OpenSAFELY Jobs is out of scope.) Also, they will have created an artefact inside the codespace that persists outside the codespace.

The reader shouldn't commit the dataset definition to a feature branch and open a pull request, because different projects and different organizations have different guidelines about feature branches and pull requests.

Footnotes

  1. https://bennettoxford.slack.com/archives/C02HJTL065A/p1696001906535959

  2. https://bennettoxford.slack.com/archives/C31D62X5X/p1695306487454799

@iaindillingham iaindillingham added the documentation Improvements or additions to documentation label Sep 29, 2023
@iaindillingham iaindillingham self-assigned this Sep 29, 2023
@iaindillingham iaindillingham moved this to In Progress in Data Team Sep 29, 2023
@sebbacon
Copy link
Contributor

sebbacon commented Oct 2, 2023

Regarding "Expand the dataset definition": this reminds me of background research I've been doing in preparation for some Great Variables Library Thinking.

I've asked around a few times (example) what the most common variables are; and I've cross-referenced them with a bit of grep-foo, and I came up with this tentative list:

  • age bands (see Andrea docs for example)
  • ethnicity (of different flavours) (see Colm’s data report work)
  • IMD
  • NHS region
  • sex
  • bmi (raw number and categories)
  • smoking
  • covid infection/hospitalisation/vaccination (at the moment at least)
  • date of death (patients table vs ONSDeath table)
  • equivalent of patients.registered_as_of() and patients.registered_with_one_practice_between()
  • deregistration date
  • for service analytics we often have practice id
  • care home residence (how often is the care home variable updated?)
  • cause of death, ICD-10

Fundamentally, a peer-reviews and agreed common set of things like this, in the research template, is the core of a variables library. So I'm excited to see this happening!

@iaindillingham
Copy link
Member Author

I'm putting together an extended dataset definition in this gist, with feedback in Slack.1

Footnotes

  1. https://bennettoxford.slack.com/archives/C02HJTL065A/p1696001906535959

@iaindillingham
Copy link
Member Author

Thanks, @sebbacon. At the moment, the expanded dataset definition hits several of those. I don't think it can hit them all, but hitting several suggests that it will be useful.

@sebbacon
Copy link
Contributor

sebbacon commented Oct 3, 2023

I don't think it can hit them all

Devil's advocate: why not? If nearly every study includes all of them anyway:

  • It's didactically useful as it covers all common cases
  • It's pragmatically useful for the same reason
  • It helps extend our "best-practice" reach deeper into peoples' code

@iaindillingham
Copy link
Member Author

Because it's a tutorial and not a how-to. Hitting all of them will make the tutorial longer, which means it will take more time to complete and more time to maintain. I think a more effective use of time would be to incorporate several into the tutorial and the remainder into how-tos, or, indeed, reusable variables.

@sebbacon
Copy link
Contributor

sebbacon commented Oct 3, 2023

Fair, I think I'm eliding our tutorial with our research template.

It leads me to ask if this part of the tutorial content might also live in the research template?

The familiarity when moving on from the tutorial could be helpful.

@iaindillingham
Copy link
Member Author

It could, but I think that's a separate issue, so I've created opensafely/research-template#108.

@iaindillingham iaindillingham changed the title Draw the rest of the owl Draw the owl Oct 14, 2023
@inglesp inglesp added this to the Deprecate cohort-extractor milestone Oct 24, 2023
@iaindillingham iaindillingham moved this from In Progress to Under Review in Data Team Oct 24, 2023
@github-project-automation github-project-automation bot moved this from Under Review to Done in Data Team Oct 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants