Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Project Detail page: remove concurrency, improve performance by making a single request #4634

Open
mikerkelly opened this issue Oct 1, 2024 · 1 comment
Labels
deck-scrubbing Tech debt or other between-initiative tidy-up work

Comments

@mikerkelly
Copy link
Contributor

mikerkelly commented Oct 1, 2024

The ProjectDetailView uses the concurrent Python module to parallelise Github API requests through github.get_repo_is_private(). This is unnecessary and over-complicated and makes reasoning about and maintaining the code more complicated than necessary (ref #4591). It's also inefficient to open a separate connection for each item as the connection and API endpoint overheads are higher than a bulk request. #3393 noted this code is inefficient.

The more typical approach would be for the endpoint to provide a method to do a bulk request with a single API access (possibly paginated/chunked if needed due to a max size limit). Doing this would allow us to eliminate this idiosyncratic use of concurrency. Per #4685 is a simple request to https://api.github.com/orgs/opensafely/repos sufficient? If we look in the github module (our thin-wrapper around requests and the GitHub API), we can see that functions like get_repos_with_branches and get_repos_with_dates follow a cursor query approach with GraphQL API accesses. get_repos_with_status_and_url already exists and returns repos with their privacy.

Generally, we should avoid using concurrency unless we need it to avoid over-complication. This is the only use of the concurrent module across the entire repo. It looks like #1562 introduced it to parallelize the requests to GitHub but did not consider restructuring to make a single request for many repos rather than many requests each to a single repo. The other two clients of github.get_repo_is_private() (WorkspaceDetail and SignOffRepo views) use this function for a single request, which is reasonable.

We might consider refactoring the commonality of the GraphQL code in github.py alongside doing this. Or put that in a separate issue. That might take inspiration from the metrics codebase approach but note this is a CLI-tool so there may be some differences in desired behavior. It may also be that the GraphQL code in github.py is written inefficiently.

If you work on this ticket probably start with quick prototyping of the different ways of getting the info we need from the API to compare performance. Also consider restructuring the view.

#3401 added some telemetry for diagnosis of #3393. That could be removed if this ticket is successful in improving performance.

@mikerkelly mikerkelly added the deck-scrubbing Tech debt or other between-initiative tidy-up work label Oct 1, 2024
@mikerkelly mikerkelly changed the title Remove unnecessary concurrency on Project Detail page Project Detail page: remove concurrency, improve performance by making a single request Oct 23, 2024
@bloodearnest
Copy link
Member

bloodearnest commented Oct 23, 2024

It's also inefficient to open a separate connection for each item as the connection and API endpoint overheads are higher than a bulk request. #3393 noted this code is inefficient.

I don't think this section is quite right, as the client uses a persistent requests Session object to maintain a per process global connection pool: https://github.com/opensafely-core/job-server/blob/main/jobserver/github.py#L19

So TLS connections to api.github.com should be long lived and reused. It may be that the default pool size needs increasing, or other config tweaks, to handle projects with lots of workspaces, perhaps.

Unrelated to the larger question of prefering to use bulk API endpoints, but possibly a useful detail.
This is a pattern we use in other service clients too, so I think worth calling out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deck-scrubbing Tech debt or other between-initiative tidy-up work
Projects
None yet
Development

No branches or pull requests

2 participants