Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit medium_privacy files sizes #298

Closed
bloodearnest opened this issue Oct 18, 2021 · 19 comments
Closed

Limit medium_privacy files sizes #298

bloodearnest opened this issue Oct 18, 2021 · 19 comments
Assignees
Labels

Comments

@bloodearnest
Copy link
Member

bloodearnest commented Oct 18, 2021

We have cases where researchers accidentally release cohort files to level 4

#159 doesn't help here, as csv is a valid type of file to release.

We could instead limit the file size, as most cohort files are very large.

Currently, the largest file on tpp is 18mb, so perhaps a limit of 32 or similar would work?

@bloodearnest
Copy link
Member Author

Related: we should also implement file size limits in release-hatch and job-server APIs

@inglesp
Copy link
Contributor

inglesp commented Apr 5, 2022

We should also limit the total size of files in medium_privacy.

@sebbacon
Copy link
Contributor

sebbacon commented Apr 6, 2022

Whoever picks this up should:

  • Propose individual and total file sizes and get agreement from a couple of other people
  • Consider how to conveniently communicate total file size errors to users so they can self-serve fixing it
  • Remember to update the documentation describing the new limits and the rationale
  • Remember to update the news page in the documentation

@inglesp inglesp self-assigned this Apr 6, 2022
@inglesp inglesp removed their assignment Apr 12, 2022
@iaindillingham iaindillingham self-assigned this Apr 26, 2022
@iaindillingham
Copy link
Member

Before proposing individual and total (for the workspace) file size limits, I'd like to compute some summary statistics. However, this would involve running some code within the TPP backend. Python 3.8.10 is available, so something like:

from collections import namedtuple
import glob
import itertools
import pathlib
import statistics

root_dir = pathlib.Path("/srv/medium_privacy")

output_files = [
    pathlib.Path(x) for x in glob.glob(f"{root_dir}/**/output/*.*", recursive=True)
]

Record = namedtuple("Record", ["workspace", "max_st_size", "sum_st_size"])

records = []

for workspace, grouped_paths in itertools.groupby(
    output_files,
    key=lambda p: str(list(p.relative_to(root_dir).parents)[-2]),
):
    st_size = [p.stat().st_size for p in grouped_paths]
    max_st_size = max(st_size)
    sum_st_size = sum(st_size)
    records.append(Record(workspace, max_st_size, sum_st_size))

This groups output files by workspace, and then computes some summary statistics.

I think the individual file size limit should probably be a round number of megabytes above max(max_st_size); similarly, the total file size limit should be a round number of megabytes above max(sum_st_size). Clearly, this assumes that current workspaces are representative of all possible workspaces; and that no current workspace includes a cohort file. So, I'd like to 👀 the distribution of max_st_size and investigate the outliers.

But... running some code within the TPP backend? I think this is covered by our policy; and this code is run against medium privacy (i.e. released) files. Could I check, though, @sebbacon?

@evansd
Copy link
Contributor

evansd commented Apr 27, 2022

Getting arbitrary code on to the backend requires (by design) some hoop jumping so you might be better off with bash.

This will give you the size of each workspace in MB:

cd /src/medium_privacy
du -ms * | sort -n

And this will give you the size of individual files in KB (annoyingly printf doesn't suport MB):

find . -printf '%k %p\n' | sort -n

@iaindillingham
Copy link
Member

Thanks Dave 🙂 My Bash is at the Stack Overflow level. Entering three lines of your Bash is better than over a dozen of my Python, though 👍🏻

@iaindillingham
Copy link
Member

iaindillingham commented May 3, 2022

Largest five workspaces:

GB Workspace
18.1 ethnicity-master
9.0 long-covid-descriptive
8.5 sro-smr
6.3 the_effects_of_covid-19_on_doac_prescribing
5.5 service_eval_work

Largest ten outputs:

GB Workspace Type Purpose
18.5 ethnicity-master svg
9.3 long-covid-descriptive csv measure
3.6 sro-pulse-oximetry csv measure
2.6 sro-smr csv measure
2.5 smr-sro1 csv measure
0.8 (x8) the_effects_of_covid-19_on_doac_prescribing feather input
0.8 (x3) lockdown_and_vulnerable_groups csv input
0.6 highcostdrugs_adalimumab_master csv input
0.6 highcostdrugs_v3 csv input
0.4 covid-ve-change-over-time-main feather

Issue: Measure files can be larger than input files. If we set a per-file threshold high enough to allow measure files, then we would also allow input files. Could the measure files be split into multiple, smaller measure files?

Issue: to what extent is it possible to check a svg file? (#159)

Footnotes

  1. This isn't a typo 🙂 I think the repo was renamed from smr-sro to sro-smr, but the workspace directory wasn't renamed.

@iaindillingham
Copy link
Member

iaindillingham commented May 3, 2022

Of the large measure files:

  • long-covid-descriptive no longer generates measure files
  • sro-pulse-oximetry does but is unlikely to be re-run
  • sro-smr does but is unlikely to be re-run

However, scanning the largest 100 outputs, most (~80%) are measure files or graphics files (svg, tiff).


As an aside, of the accidentally released input files:

  • the_effects_of_covid-19_on_doac_prescribing seems to have been fixed
  • lockdown_and_vulnerable_groups seems to have been fixed
  • highcostdrugs_adalimumab_master seems to be the result of a greedy glob in a downstream action; I think this would be fixed by Allowlist of file extentions copied into medium privacy outputs #159
  • The branch associated with highcostdrugs_v3 no longer exists

@evansd
Copy link
Contributor

evansd commented May 3, 2022

This is very useful, Iain. Thanks. I'd be interested to know what makes those measure files so huge? Is it that they're grouping on something with high cardinality?

@iaindillingham
Copy link
Member

I think it's because sro-pulse-oximetry and sro-smr group by None -- "calculated on the individual patient level", which I think means "ungrouped". Possibly long-covid-descriptive and smr-sro did too, but I couldn't easily interrogate their study definitions.

That said, a large proportion of the largest 100 outputs are measure files; surely they can't all group by None? I'll keep digging.

@evansd
Copy link
Contributor

evansd commented May 3, 2022

I think it's because sro-pulse-oximetry and sro-smr group by None -- "calculated on the individual patient level", which I think means "ungrouped"

Ah right, yes. So I guess these are really a different sort of beast from other measure outputs and arguably should be considered high sensitivity.

@iaindillingham
Copy link
Member

For the archaeologists, the accidentally released input files were discussed in #393.

@iaindillingham
Copy link
Member

It's not easy to determine a reasonable file size limit for medium privacy files, so as not to release cohort files accidentally. This is because the largest files are images or measure tables. We could discount images and ungrouped measure tables1 when setting the file size limit. However, even then, the largest file (325.3 MB) is still a grouped measure table: setting a file size limit at a round number of MB above the size of the largest file would probably still be too large. And limiting the number of grouping variables, to reduce the size of the largest file, seems like a separate issue.


Why is the largest file a grouped measure table? Possibly because the number of grouping variables is large. The largest file comes from the service_eval_work workspace and has five grouping variables (infection_repeat_antibiotics_lrti). If I understand the measures framework correctly, then as the number of grouping variables increases, the likelihood that the measure table will approximate the cohort also increases. So, this measure table will approximate the cohort.


We could set per-file-type file size limits.

File type Largest file MB Number of files
.csv 325.3 33,434
.csv.gz 0.1 1
.dta 0.7 339
.dta.gz - 0
.feather 484.62 12

However, most files are CSV files. Consequently, we could determine a reasonable file size limit for the majority of file types; but these do not represent the majority of files.

We could prevent files that match a pattern from being released. Cohort files start with input. However, at present, there is only one cohort file that matches this pattern (from the postopcovid workspace). The associated study definition seems to have been fixed.

However, it's possible that addressing the following would make this issue redundant:

Footnotes

  1. Measure(..., group_by=None).

  2. All feather files are from covid-ve-change-over-time-* workspaces and appear to contain dummy data.

@iaindillingham
Copy link
Member

setting a file size limit at a round number of MB above the size of the largest file would probably still be too large.

How large are cohort files, typically? The largest cohort file is roughly 5GB. However, about 60% of the 2,532 cohort files are less than 325.3MB -- the size of the largest grouped measure table. So, if we were to set a file size limit that allowed the largest grouped measure table to be released, this would allow roughly 60% of the cohort files to be released, too.

@bloodearnest
Copy link
Member Author

bloodearnest commented May 12, 2022

A couple of thoughts

  1. I think the ideaof limiting files sizes is more of a "prevent accidental large files" than any serious security barrier. So we should definitely set some limit, even if it's large enough to allow smaller cohorts. In the case of large measure csvs, then the user could gzip it if needed, potentially.

  2. I'm not 100% sure that we should base all our decisions here on existing files, and also include defining a policy based on the goals we want.

Why are users exposing 325mb of csv to level 4? They cannot ever release it, so are they debugging/checking their code? In this really necessary? Could we provide alternate ways? I don't believe they are manually reviewing 325mb of csvs, so perhaps producing and output of the first 1000 lines in moderate_privacy is enough?

@bloodearnest
Copy link
Member Author

I also think that we could block certain types, regardless of size: e.g. .dta, .feather.
This could be based on extension, and we could also use https://pypi.org/project/python-magic/

@lucyb
Copy link
Contributor

lucyb commented Sep 12, 2023

Looking at the files that have been released to Job Server so far (and using Dave's helpful bash commands), the largest file is a 29MB SVG, closely followed by a 25MB CSV.

So, I think we could reasonably have quite a low limit on filesizes (perhaps 45MB or less) of files released by releasehatch and tweak it as needed.

EDIT: this is now covered in this ticket.

@bloodearnest
Copy link
Member Author

I would argue that those example files were anomalies, and we shouldn't have allowed them to be released, and thus we could set the threshold lower.

It's worth noting that there are also two maximum size values we could implement:

  • the maximum size of file that job-runner is willing to copy into level 4. The goal of this is limit accidentally putting level 3 data into level 4.

  • the maximum size of a file that can be released. This should probably be less than the abov
    e.

@bloodearnest
Copy link
Member Author

I'm not sure how an output-checker can meaningfully review and check a 25mb csv file.

And svg has its own challenges, like should we even allow that format at all?

bloodearnest added a commit that referenced this issue Sep 28, 2023
This adds a global config, `config.MAX_LEVEL4_FILESIZE` (which defaults
to 16mb) which limits the size of files that will be copied to level
4 storage.

The goal here is to prevent accidental copying of files to level 4 that
should not be. Level 4 files should be aggregate data, and potentially
reviewable/releaseable by output checkers. Datasets including
pseudonymized patient level data should not be marked as
`moderately_senstitive`, and this is one thing to prevent them being
done so by mistake.

When triggered, it communciates the fact that files have not been
copied to users in two ways:

1) It includes the list of files not cpied in the job satus_message,
   which will be visible at jobs.opensafely.org

2) It writes file with the same name plus `.txt` file with a message, so
   it discoverable in level 4. It deletes this file if files are
   successfully copied.

Fixes #298
bloodearnest added a commit that referenced this issue Sep 28, 2023
This adds a global config, `config.MAX_LEVEL4_FILESIZE` (which defaults
to 16mb) which limits the size of files that will be copied to level
4 storage.

The goal here is to prevent accidental copying of files to level 4 that
should not be. Level 4 files should be aggregate data, and potentially
reviewable/releaseable by output checkers. Datasets including
pseudonymized patient level data should not be marked as
`moderately_senstitive`, and this is one thing to prevent them being
done so by mistake.

When triggered, it communciates the fact that files have not been
copied to users in two ways:

1) It includes the list of files not cpied in the job satus_message,
   which will be visible at jobs.opensafely.org

2) It writes file with the same name plus `.txt` file with a message, so
   it discoverable in level 4. It deletes this file if files are
   successfully copied.

Fixes #298
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants