-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Limit medium_privacy files sizes #298
Comments
Related: we should also implement file size limits in release-hatch and job-server APIs |
We should also limit the total size of files in medium_privacy. |
Whoever picks this up should:
|
Before proposing individual and total (for the workspace) file size limits, I'd like to compute some summary statistics. However, this would involve running some code within the TPP backend. Python 3.8.10 is available, so something like: from collections import namedtuple
import glob
import itertools
import pathlib
import statistics
root_dir = pathlib.Path("/srv/medium_privacy")
output_files = [
pathlib.Path(x) for x in glob.glob(f"{root_dir}/**/output/*.*", recursive=True)
]
Record = namedtuple("Record", ["workspace", "max_st_size", "sum_st_size"])
records = []
for workspace, grouped_paths in itertools.groupby(
output_files,
key=lambda p: str(list(p.relative_to(root_dir).parents)[-2]),
):
st_size = [p.stat().st_size for p in grouped_paths]
max_st_size = max(st_size)
sum_st_size = sum(st_size)
records.append(Record(workspace, max_st_size, sum_st_size)) This groups output files by workspace, and then computes some summary statistics. I think the individual file size limit should probably be a round number of megabytes above But... running some code within the TPP backend? I think this is covered by our policy; and this code is run against medium privacy (i.e. released) files. Could I check, though, @sebbacon? |
Getting arbitrary code on to the backend requires (by design) some hoop jumping so you might be better off with bash. This will give you the size of each workspace in MB: cd /src/medium_privacy
du -ms * | sort -n And this will give you the size of individual files in KB (annoyingly printf doesn't suport MB): find . -printf '%k %p\n' | sort -n |
Thanks Dave 🙂 My Bash is at the Stack Overflow level. Entering three lines of your Bash is better than over a dozen of my Python, though 👍🏻 |
Largest five workspaces:
Largest ten outputs:
Issue: Measure files can be larger than input files. If we set a per-file threshold high enough to allow measure files, then we would also allow input files. Could the measure files be split into multiple, smaller measure files? Issue: to what extent is it possible to check a svg file? (#159) Footnotes
|
Of the large measure files:
However, scanning the largest 100 outputs, most (~80%) are measure files or graphics files (svg, tiff). As an aside, of the accidentally released input files:
|
This is very useful, Iain. Thanks. I'd be interested to know what makes those measure files so huge? Is it that they're grouping on something with high cardinality? |
I think it's because sro-pulse-oximetry and sro-smr group by That said, a large proportion of the largest 100 outputs are measure files; surely they can't all group by |
Ah right, yes. So I guess these are really a different sort of beast from other measure outputs and arguably should be considered high sensitivity. |
For the archaeologists, the accidentally released input files were discussed in #393. |
It's not easy to determine a reasonable file size limit for medium privacy files, so as not to release cohort files accidentally. This is because the largest files are images or measure tables. We could discount images and ungrouped measure tables1 when setting the file size limit. However, even then, the largest file (325.3 MB) is still a grouped measure table: setting a file size limit at a round number of MB above the size of the largest file would probably still be too large. And limiting the number of grouping variables, to reduce the size of the largest file, seems like a separate issue. Why is the largest file a grouped measure table? Possibly because the number of grouping variables is large. The largest file comes from the We could set per-file-type file size limits.
However, most files are CSV files. Consequently, we could determine a reasonable file size limit for the majority of file types; but these do not represent the majority of files. We could prevent files that match a pattern from being released. Cohort files start with However, it's possible that addressing the following would make this issue redundant:
Footnotes |
How large are cohort files, typically? The largest cohort file is roughly 5GB. However, about 60% of the 2,532 cohort files are less than 325.3MB -- the size of the largest grouped measure table. So, if we were to set a file size limit that allowed the largest grouped measure table to be released, this would allow roughly 60% of the cohort files to be released, too. |
A couple of thoughts
Why are users exposing 325mb of csv to level 4? They cannot ever release it, so are they debugging/checking their code? In this really necessary? Could we provide alternate ways? I don't believe they are manually reviewing 325mb of csvs, so perhaps producing and output of the first 1000 lines in moderate_privacy is enough? |
I also think that we could block certain types, regardless of size: e.g. .dta, .feather. |
Looking at the files that have been released to Job Server so far (and using Dave's helpful bash commands), the largest file is a 29MB SVG, closely followed by a 25MB CSV. So, I think we could reasonably have quite a low limit on filesizes (perhaps 45MB or less) of files released by releasehatch and tweak it as needed. EDIT: this is now covered in this ticket. |
I would argue that those example files were anomalies, and we shouldn't have allowed them to be released, and thus we could set the threshold lower. It's worth noting that there are also two maximum size values we could implement:
|
I'm not sure how an output-checker can meaningfully review and check a 25mb csv file. And svg has its own challenges, like should we even allow that format at all? |
This adds a global config, `config.MAX_LEVEL4_FILESIZE` (which defaults to 16mb) which limits the size of files that will be copied to level 4 storage. The goal here is to prevent accidental copying of files to level 4 that should not be. Level 4 files should be aggregate data, and potentially reviewable/releaseable by output checkers. Datasets including pseudonymized patient level data should not be marked as `moderately_senstitive`, and this is one thing to prevent them being done so by mistake. When triggered, it communciates the fact that files have not been copied to users in two ways: 1) It includes the list of files not cpied in the job satus_message, which will be visible at jobs.opensafely.org 2) It writes file with the same name plus `.txt` file with a message, so it discoverable in level 4. It deletes this file if files are successfully copied. Fixes #298
This adds a global config, `config.MAX_LEVEL4_FILESIZE` (which defaults to 16mb) which limits the size of files that will be copied to level 4 storage. The goal here is to prevent accidental copying of files to level 4 that should not be. Level 4 files should be aggregate data, and potentially reviewable/releaseable by output checkers. Datasets including pseudonymized patient level data should not be marked as `moderately_senstitive`, and this is one thing to prevent them being done so by mistake. When triggered, it communciates the fact that files have not been copied to users in two ways: 1) It includes the list of files not cpied in the job satus_message, which will be visible at jobs.opensafely.org 2) It writes file with the same name plus `.txt` file with a message, so it discoverable in level 4. It deletes this file if files are successfully copied. Fixes #298
We have cases where researchers accidentally release cohort files to level 4
#159 doesn't help here, as csv is a valid type of file to release.
We could instead limit the file size, as most cohort files are very large.
Currently, the largest file on tpp is 18mb, so perhaps a limit of 32 or similar would work?
The text was updated successfully, but these errors were encountered: