Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Local-run mode should check for patient-level data mislabelled as moderately sensitive #315

Closed
Tracked by #406
evansd opened this issue Nov 5, 2021 · 1 comment

Comments

@evansd
Copy link
Contributor

evansd commented Nov 5, 2021

Files which contain patient level data should always be labelled as highly_sensitive. We want to guard against users accidentally mislabelling such files or, more likely, accidentally including patient level data in files which they did not intent to contain such data (which is easy to do with e.g. .rds files).

We should add checks which are active in "local run" mode which inspect all moderately_sensitive files for anything that looks like patient data and which fail loudly if it is found.

By design, the new abstracted job-runner execution model doesn't give us direct access to the output files so we can't run these checks in production but applying them in development and test should be sufficient to catch the kind of accidental issues we're anticipating.

Exactly how these checks should be implemented is still an open question. The most basic check would be, does this data file have a patient_id column? More sophisticated would be:

  • fetch the list of patient IDs from the dummy data (which we should be able to identify by inspecting the project.yaml)
  • check whether the data files contain any columns with "significant" overlap with the patient_id column.

Inevitably there will be some false negatives and probably some false positives and we'll have to iterate on the checks as we discover them.

I initially thought we should build in some escape hatch for false positives so a researcher can say "no, this file really doesn't contain patient level data". But if we do that and document it, eventually someone will use it because they really honestly believe the file is OK, but in fact it has embedded patient data they didn't know about. So I think the process has to be "ask the dev team and we'll fix our check for this specific instance". This means we'll need to be careful that the false positive rate stays pretty low or it will get overwhelming for us and infuriating for researchers. I'm reasonably confident it's do-able though.

@iaindillingham iaindillingham changed the title Proposal: local run mode should check for files which look like patient level data mislabelled as moderately_sensitive Local-run mode should check for patient-level data mislabelled as *moderately sensitive* Apr 19, 2022
@iaindillingham iaindillingham changed the title Local-run mode should check for patient-level data mislabelled as *moderately sensitive* Local-run mode should check for patient-level data mislabelled as moderately sensitive Apr 19, 2022
@benbc
Copy link
Contributor

benbc commented Jun 23, 2022

Closing in favour of this option in our pipeline.

@benbc benbc closed this as completed Jun 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants