You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Files which contain patient level data should always be labelled as highly_sensitive. We want to guard against users accidentally mislabelling such files or, more likely, accidentally including patient level data in files which they did not intent to contain such data (which is easy to do with e.g. .rds files).
We should add checks which are active in "local run" mode which inspect all moderately_sensitive files for anything that looks like patient data and which fail loudly if it is found.
By design, the new abstracted job-runner execution model doesn't give us direct access to the output files so we can't run these checks in production but applying them in development and test should be sufficient to catch the kind of accidental issues we're anticipating.
Exactly how these checks should be implemented is still an open question. The most basic check would be, does this data file have a patient_id column? More sophisticated would be:
fetch the list of patient IDs from the dummy data (which we should be able to identify by inspecting the project.yaml)
check whether the data files contain any columns with "significant" overlap with the patient_id column.
Inevitably there will be some false negatives and probably some false positives and we'll have to iterate on the checks as we discover them.
I initially thought we should build in some escape hatch for false positives so a researcher can say "no, this file really doesn't contain patient level data". But if we do that and document it, eventually someone will use it because they really honestly believe the file is OK, but in fact it has embedded patient data they didn't know about. So I think the process has to be "ask the dev team and we'll fix our check for this specific instance". This means we'll need to be careful that the false positive rate stays pretty low or it will get overwhelming for us and infuriating for researchers. I'm reasonably confident it's do-able though.
The text was updated successfully, but these errors were encountered:
iaindillingham
changed the title
Proposal: local run mode should check for files which look like patient level data mislabelled as moderately_sensitive
Local-run mode should check for patient-level data mislabelled as *moderately sensitive*
Apr 19, 2022
iaindillingham
changed the title
Local-run mode should check for patient-level data mislabelled as *moderately sensitive*
Local-run mode should check for patient-level data mislabelled as moderately sensitive
Apr 19, 2022
Files which contain patient level data should always be labelled as highly_sensitive. We want to guard against users accidentally mislabelling such files or, more likely, accidentally including patient level data in files which they did not intent to contain such data (which is easy to do with e.g.
.rds
files).We should add checks which are active in "local run" mode which inspect all moderately_sensitive files for anything that looks like patient data and which fail loudly if it is found.
By design, the new abstracted job-runner execution model doesn't give us direct access to the output files so we can't run these checks in production but applying them in development and test should be sufficient to catch the kind of accidental issues we're anticipating.
Exactly how these checks should be implemented is still an open question. The most basic check would be, does this data file have a
patient_id
column? More sophisticated would be:project.yaml
)Inevitably there will be some false negatives and probably some false positives and we'll have to iterate on the checks as we discover them.
I initially thought we should build in some escape hatch for false positives so a researcher can say "no, this file really doesn't contain patient level data". But if we do that and document it, eventually someone will use it because they really honestly believe the file is OK, but in fact it has embedded patient data they didn't know about. So I think the process has to be "ask the dev team and we'll fix our check for this specific instance". This means we'll need to be careful that the false positive rate stays pretty low or it will get overwhelming for us and infuriating for researchers. I'm reasonably confident it's do-able though.
The text was updated successfully, but these errors were encountered: