Local-run mode should check for patient-level data mislabelled as moderately sensitive #315

evansd · 2021-11-05T17:25:09Z

Files which contain patient level data should always be labelled as highly_sensitive. We want to guard against users accidentally mislabelling such files or, more likely, accidentally including patient level data in files which they did not intent to contain such data (which is easy to do with e.g. .rds files).

We should add checks which are active in "local run" mode which inspect all moderately_sensitive files for anything that looks like patient data and which fail loudly if it is found.

By design, the new abstracted job-runner execution model doesn't give us direct access to the output files so we can't run these checks in production but applying them in development and test should be sufficient to catch the kind of accidental issues we're anticipating.

Exactly how these checks should be implemented is still an open question. The most basic check would be, does this data file have a patient_id column? More sophisticated would be:

fetch the list of patient IDs from the dummy data (which we should be able to identify by inspecting the project.yaml)
check whether the data files contain any columns with "significant" overlap with the patient_id column.

Inevitably there will be some false negatives and probably some false positives and we'll have to iterate on the checks as we discover them.

I initially thought we should build in some escape hatch for false positives so a researcher can say "no, this file really doesn't contain patient level data". But if we do that and document it, eventually someone will use it because they really honestly believe the file is OK, but in fact it has embedded patient data they didn't know about. So I think the process has to be "ask the dev team and we'll fix our check for this specific instance". This means we'll need to be careful that the false positive rate stays pretty low or it will get overwhelming for us and infuriating for researchers. I'm reasonably confident it's do-able though.

The text was updated successfully, but these errors were encountered:

benbc · 2022-06-23T16:05:34Z

Closing in favour of this option in our pipeline.

evansd mentioned this issue Nov 29, 2021

Encode privacy levels via reserved directory names #320

Closed

sebbacon mentioned this issue Apr 4, 2022

Review and prioritise output file options listed in this issue #261

Closed

iaindillingham changed the title ~~Proposal: local run mode should check for files which look like patient level data mislabelled as moderately_sensitive~~ Local-run mode should check for patient-level data mislabelled as *moderately sensitive* Apr 19, 2022

iaindillingham changed the title Local-run mode should check for patient-level data mislabelled as *moderately sensitive* Local-run mode should check for patient-level data mislabelled as moderately sensitive Apr 19, 2022

LFISHER7 mentioned this issue May 3, 2022

Delete files that shouldn't be on level 4 #393

Closed

10 tasks

iaindillingham mentioned this issue May 9, 2022

Limit medium_privacy files sizes #298

Closed

sebbacon mentioned this issue May 16, 2022

[meta] Pipeline/dependency issues #406

Closed

5 tasks

benbc closed this as completed Jun 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Local-run mode should check for patient-level data mislabelled as moderately sensitive #315

Local-run mode should check for patient-level data mislabelled as moderately sensitive #315

evansd commented Nov 5, 2021

benbc commented Jun 23, 2022

Local-run mode should check for patient-level data mislabelled as moderately sensitive #315

Local-run mode should check for patient-level data mislabelled as moderately sensitive #315

Comments

evansd commented Nov 5, 2021

benbc commented Jun 23, 2022