Limit medium_privacy files sizes #298

bloodearnest · 2021-10-18T10:37:25Z

We have cases where researchers accidentally release cohort files to level 4

#159 doesn't help here, as csv is a valid type of file to release.

We could instead limit the file size, as most cohort files are very large.

Currently, the largest file on tpp is 18mb, so perhaps a limit of 32 or similar would work?

bloodearnest · 2021-10-18T10:45:59Z

Related: we should also implement file size limits in release-hatch and job-server APIs

inglesp · 2022-04-05T09:15:59Z

We should also limit the total size of files in medium_privacy.

sebbacon · 2022-04-06T08:19:39Z

Whoever picks this up should:

Propose individual and total file sizes and get agreement from a couple of other people
Consider how to conveniently communicate total file size errors to users so they can self-serve fixing it
Remember to update the documentation describing the new limits and the rationale
Remember to update the news page in the documentation

Fixes #298

iaindillingham · 2022-04-27T17:16:37Z

Before proposing individual and total (for the workspace) file size limits, I'd like to compute some summary statistics. However, this would involve running some code within the TPP backend. Python 3.8.10 is available, so something like:

from collections import namedtuple
import glob
import itertools
import pathlib
import statistics

root_dir = pathlib.Path("/srv/medium_privacy")

output_files = [
    pathlib.Path(x) for x in glob.glob(f"{root_dir}/**/output/*.*", recursive=True)
]

Record = namedtuple("Record", ["workspace", "max_st_size", "sum_st_size"])

records = []

for workspace, grouped_paths in itertools.groupby(
    output_files,
    key=lambda p: str(list(p.relative_to(root_dir).parents)[-2]),
):
    st_size = [p.stat().st_size for p in grouped_paths]
    max_st_size = max(st_size)
    sum_st_size = sum(st_size)
    records.append(Record(workspace, max_st_size, sum_st_size))

This groups output files by workspace, and then computes some summary statistics.

I think the individual file size limit should probably be a round number of megabytes above max(max_st_size); similarly, the total file size limit should be a round number of megabytes above max(sum_st_size). Clearly, this assumes that current workspaces are representative of all possible workspaces; and that no current workspace includes a cohort file. So, I'd like to 👀 the distribution of max_st_size and investigate the outliers.

But... running some code within the TPP backend? I think this is covered by our policy; and this code is run against medium privacy (i.e. released) files. Could I check, though, @sebbacon?

evansd · 2022-04-27T18:47:09Z

Getting arbitrary code on to the backend requires (by design) some hoop jumping so you might be better off with bash.

This will give you the size of each workspace in MB:

cd /src/medium_privacy
du -ms * | sort -n

And this will give you the size of individual files in KB (annoyingly printf doesn't suport MB):

find . -printf '%k %p\n' | sort -n

iaindillingham · 2022-04-28T10:23:12Z

Thanks Dave 🙂 My Bash is at the Stack Overflow level. Entering three lines of your Bash is better than over a dozen of my Python, though 👍🏻

iaindillingham · 2022-05-03T10:01:46Z

Largest five workspaces:

GB	Workspace
18.1	ethnicity-master
9.0	long-covid-descriptive
8.5	sro-smr
6.3	the_effects_of_covid-19_on_doac_prescribing
5.5	service_eval_work

Largest ten outputs:

GB	Workspace	Type	Purpose
18.5	ethnicity-master	svg
9.3	long-covid-descriptive	csv	measure
3.6	sro-pulse-oximetry	csv	measure
2.6	sro-smr	csv	measure
2.5	smr-sro¹	csv	measure
0.8 (x8)	the_effects_of_covid-19_on_doac_prescribing	feather	input
0.8 (x3)	lockdown_and_vulnerable_groups	csv	input
0.6	highcostdrugs_adalimumab_master	csv	input
0.6	highcostdrugs_v3	csv	input
0.4	covid-ve-change-over-time-main	feather

Issue: Measure files can be larger than input files. If we set a per-file threshold high enough to allow measure files, then we would also allow input files. Could the measure files be split into multiple, smaller measure files?

Issue: to what extent is it possible to check a svg file? (#159)

This isn't a typo 🙂 I think the repo was renamed from smr-sro to sro-smr, but the workspace directory wasn't renamed. ↩

iaindillingham · 2022-05-03T12:11:41Z

Of the large measure files:

long-covid-descriptive no longer generates measure files
sro-pulse-oximetry does but is unlikely to be re-run
sro-smr does but is unlikely to be re-run

However, scanning the largest 100 outputs, most (~80%) are measure files or graphics files (svg, tiff).

As an aside, of the accidentally released input files:

the_effects_of_covid-19_on_doac_prescribing seems to have been fixed
lockdown_and_vulnerable_groups seems to have been fixed
highcostdrugs_adalimumab_master seems to be the result of a greedy glob in a downstream action; I think this would be fixed by Allowlist of file extentions copied into medium privacy outputs #159
The branch associated with highcostdrugs_v3 no longer exists

evansd · 2022-05-03T12:22:12Z

This is very useful, Iain. Thanks. I'd be interested to know what makes those measure files so huge? Is it that they're grouping on something with high cardinality?

iaindillingham · 2022-05-03T14:56:56Z

I think it's because sro-pulse-oximetry and sro-smr group by None -- "calculated on the individual patient level", which I think means "ungrouped". Possibly long-covid-descriptive and smr-sro did too, but I couldn't easily interrogate their study definitions.

That said, a large proportion of the largest 100 outputs are measure files; surely they can't all group by None? I'll keep digging.

evansd · 2022-05-03T15:06:19Z

I think it's because sro-pulse-oximetry and sro-smr group by None -- "calculated on the individual patient level", which I think means "ungrouped"

Ah right, yes. So I guess these are really a different sort of beast from other measure outputs and arguably should be considered high sensitivity.

iaindillingham · 2022-05-05T08:30:41Z

For the archaeologists, the accidentally released input files were discussed in #393.

iaindillingham · 2022-05-09T16:25:02Z

It's not easy to determine a reasonable file size limit for medium privacy files, so as not to release cohort files accidentally. This is because the largest files are images or measure tables. We could discount images and ungrouped measure tables¹ when setting the file size limit. However, even then, the largest file (325.3 MB) is still a grouped measure table: setting a file size limit at a round number of MB above the size of the largest file would probably still be too large. And limiting the number of grouping variables, to reduce the size of the largest file, seems like a separate issue.

Why is the largest file a grouped measure table? Possibly because the number of grouping variables is large. The largest file comes from the service_eval_work workspace and has five grouping variables (infection_repeat_antibiotics_lrti). If I understand the measures framework correctly, then as the number of grouping variables increases, the likelihood that the measure table will approximate the cohort also increases. So, this measure table will approximate the cohort.

We could set per-file-type file size limits.

File type	Largest file MB	Number of files
`.csv`	325.3	33,434
`.csv.gz`	0.1	1
`.dta`	0.7	339
`.dta.gz`	-	0
`.feather`	484.6²	12

However, most files are CSV files. Consequently, we could determine a reasonable file size limit for the majority of file types; but these do not represent the majority of files.

We could prevent files that match a pattern from being released. Cohort files start with input. However, at present, there is only one cohort file that matches this pattern (from the postopcovid workspace). The associated study definition seems to have been fixed.

However, it's possible that addressing the following would make this issue redundant:

Measure(..., group_by=None). ↩
All feather files are from covid-ve-change-over-time-* workspaces and appear to contain dummy data. ↩

iaindillingham · 2022-05-10T15:09:41Z

setting a file size limit at a round number of MB above the size of the largest file would probably still be too large.

How large are cohort files, typically? The largest cohort file is roughly 5GB. However, about 60% of the 2,532 cohort files are less than 325.3MB -- the size of the largest grouped measure table. So, if we were to set a file size limit that allowed the largest grouped measure table to be released, this would allow roughly 60% of the cohort files to be released, too.

bloodearnest · 2022-05-12T10:20:17Z

A couple of thoughts

I think the ideaof limiting files sizes is more of a "prevent accidental large files" than any serious security barrier. So we should definitely set some limit, even if it's large enough to allow smaller cohorts. In the case of large measure csvs, then the user could gzip it if needed, potentially.
I'm not 100% sure that we should base all our decisions here on existing files, and also include defining a policy based on the goals we want.

Why are users exposing 325mb of csv to level 4? They cannot ever release it, so are they debugging/checking their code? In this really necessary? Could we provide alternate ways? I don't believe they are manually reviewing 325mb of csvs, so perhaps producing and output of the first 1000 lines in moderate_privacy is enough?

bloodearnest · 2022-05-12T10:22:38Z

I also think that we could block certain types, regardless of size: e.g. .dta, .feather.
This could be based on extension, and we could also use https://pypi.org/project/python-magic/

lucyb · 2023-09-12T09:41:31Z

Looking at the files that have been released to Job Server so far (and using Dave's helpful bash commands), the largest file is a 29MB SVG, closely followed by a 25MB CSV.

So, I think we could reasonably have quite a low limit on filesizes (perhaps 45MB or less) of files released by releasehatch and tweak it as needed.

EDIT: this is now covered in this ticket.

bloodearnest · 2023-09-12T09:55:58Z

I would argue that those example files were anomalies, and we shouldn't have allowed them to be released, and thus we could set the threshold lower.

It's worth noting that there are also two maximum size values we could implement:

the maximum size of file that job-runner is willing to copy into level 4. The goal of this is limit accidentally putting level 3 data into level 4.
the maximum size of a file that can be released. This should probably be less than the abov
e.

bloodearnest · 2023-09-12T09:56:58Z

I'm not sure how an output-checker can meaningfully review and check a 25mb csv file.

And svg has its own challenges, like should we even allow that format at all?

This adds a global config, `config.MAX_LEVEL4_FILESIZE` (which defaults to 16mb) which limits the size of files that will be copied to level 4 storage. The goal here is to prevent accidental copying of files to level 4 that should not be. Level 4 files should be aggregate data, and potentially reviewable/releaseable by output checkers. Datasets including pseudonymized patient level data should not be marked as `moderately_senstitive`, and this is one thing to prevent them being done so by mistake. When triggered, it communciates the fact that files have not been copied to users in two ways: 1) It includes the list of files not cpied in the job satus_message, which will be visible at jobs.opensafely.org 2) It writes file with the same name plus `.txt` file with a message, so it discoverable in level 4. It deletes this file if files are successfully copied. Fixes #298

inglesp self-assigned this Apr 6, 2022

inglesp added a commit that referenced this issue Apr 7, 2022

Limit the allowed size of moderately-sensitive outputs

b9a4e42

Fixes #298

inglesp added a commit that referenced this issue Apr 7, 2022

Limit the allowed size of moderately-sensitive outputs

a47b932

Fixes #298

inglesp removed their assignment Apr 12, 2022

iaindillingham self-assigned this Apr 26, 2022

iaindillingham mentioned this issue May 9, 2022

Replace needs with inputs #319

Closed

madwort unassigned iaindillingham Jun 8, 2022

bloodearnest added the security label Feb 1, 2023

lucyb mentioned this issue May 16, 2023

Free up some disk space on TPP opensafely-core/sysadmin#103

Closed

lucyb assigned bloodearnest Sep 20, 2023

bloodearnest closed this as completed in 5c9cf6a Oct 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Limit medium_privacy files sizes #298

Limit medium_privacy files sizes #298

bloodearnest commented Oct 18, 2021 •

edited

Loading

bloodearnest commented Oct 18, 2021

inglesp commented Apr 5, 2022

sebbacon commented Apr 6, 2022

iaindillingham commented Apr 27, 2022

evansd commented Apr 27, 2022

iaindillingham commented Apr 28, 2022

iaindillingham commented May 3, 2022 •

edited

Loading

iaindillingham commented May 3, 2022 •

edited

Loading

evansd commented May 3, 2022

iaindillingham commented May 3, 2022

evansd commented May 3, 2022

iaindillingham commented May 5, 2022

iaindillingham commented May 9, 2022

iaindillingham commented May 10, 2022

bloodearnest commented May 12, 2022 •

edited

Loading

bloodearnest commented May 12, 2022

lucyb commented Sep 12, 2023 •

edited

Loading

bloodearnest commented Sep 12, 2023

bloodearnest commented Sep 12, 2023

Limit medium_privacy files sizes #298

Limit medium_privacy files sizes #298

Comments

bloodearnest commented Oct 18, 2021 • edited Loading

bloodearnest commented Oct 18, 2021

inglesp commented Apr 5, 2022

sebbacon commented Apr 6, 2022

iaindillingham commented Apr 27, 2022

evansd commented Apr 27, 2022

iaindillingham commented Apr 28, 2022

iaindillingham commented May 3, 2022 • edited Loading

Footnotes

iaindillingham commented May 3, 2022 • edited Loading

evansd commented May 3, 2022

iaindillingham commented May 3, 2022

evansd commented May 3, 2022

iaindillingham commented May 5, 2022

iaindillingham commented May 9, 2022

Footnotes

iaindillingham commented May 10, 2022

bloodearnest commented May 12, 2022 • edited Loading

bloodearnest commented May 12, 2022

lucyb commented Sep 12, 2023 • edited Loading

bloodearnest commented Sep 12, 2023

bloodearnest commented Sep 12, 2023

bloodearnest commented Oct 18, 2021 •

edited

Loading

iaindillingham commented May 3, 2022 •

edited

Loading

iaindillingham commented May 3, 2022 •

edited

Loading

bloodearnest commented May 12, 2022 •

edited

Loading

lucyb commented Sep 12, 2023 •

edited

Loading