You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The ProjectDetailView uses the concurrent Python module to parallelise Github API requests through github.get_repo_is_private(). This is unnecessary and over-complicated and makes reasoning about and maintaining the code more complicated than necessary (ref #4591). It's also inefficient to open a separate connection for each item as the connection and API endpoint overheads are higher than a bulk request. #3393 noted this code is inefficient.
The more typical approach would be for the endpoint to provide a method to do a bulk request with a single API access (possibly paginated/chunked if needed due to a max size limit). Doing this would allow us to eliminate this idiosyncratic use of concurrency. Per #4685 is a simple request to https://api.github.com/orgs/opensafely/repos sufficient? If we look in the github module (our thin-wrapper around requests and the GitHub API), we can see that functions like get_repos_with_branches and get_repos_with_dates follow a cursor query approach with GraphQL API accesses. get_repos_with_status_and_url already exists and returns repos with their privacy.
Generally, we should avoid using concurrency unless we need it to avoid over-complication. This is the only use of the concurrent module across the entire repo. It looks like #1562 introduced it to parallelize the requests to GitHub but did not consider restructuring to make a single request for many repos rather than many requests each to a single repo. The other two clients of github.get_repo_is_private() (WorkspaceDetail and SignOffRepo views) use this function for a single request, which is reasonable.
We might consider refactoring the commonality of the GraphQL code in github.py alongside doing this. Or put that in a separate issue. That might take inspiration from the metrics codebase approach but note this is a CLI-tool so there may be some differences in desired behavior. It may also be that the GraphQL code in github.py is written inefficiently.
If you work on this ticket probably start with quick prototyping of the different ways of getting the info we need from the API to compare performance. Also consider restructuring the view.
#3401 added some telemetry for diagnosis of #3393. That could be removed if this ticket is successful in improving performance.
The text was updated successfully, but these errors were encountered:
mikerkelly
changed the title
Remove unnecessary concurrency on Project Detail page
Project Detail page: remove concurrency, improve performance by making a single request
Oct 23, 2024
It's also inefficient to open a separate connection for each item as the connection and API endpoint overheads are higher than a bulk request. #3393 noted this code is inefficient.
So TLS connections to api.github.com should be long lived and reused. It may be that the default pool size needs increasing, or other config tweaks, to handle projects with lots of workspaces, perhaps.
Unrelated to the larger question of prefering to use bulk API endpoints, but possibly a useful detail.
This is a pattern we use in other service clients too, so I think worth calling out.
The
ProjectDetailView
uses theconcurrent
Python module to parallelise Github API requests throughgithub.get_repo_is_private()
. This is unnecessary and over-complicated and makes reasoning about and maintaining the code more complicated than necessary (ref #4591). It's also inefficient to open a separate connection for each item as the connection and API endpoint overheads are higher than a bulk request. #3393 noted this code is inefficient.The more typical approach would be for the endpoint to provide a method to do a bulk request with a single API access (possibly paginated/chunked if needed due to a max size limit). Doing this would allow us to eliminate this idiosyncratic use of concurrency. Per #4685 is a simple request to https://api.github.com/orgs/opensafely/repos sufficient? If we look in the
github
module (our thin-wrapper aroundrequests
and the GitHub API), we can see that functions likeget_repos_with_branches
andget_repos_with_dates
follow a cursor query approach with GraphQL API accesses.get_repos_with_status_and_url
already exists and returns repos with their privacy.Generally, we should avoid using concurrency unless we need it to avoid over-complication. This is the only use of the
concurrent
module across the entire repo. It looks like #1562 introduced it to parallelize the requests to GitHub but did not consider restructuring to make a single request for many repos rather than many requests each to a single repo. The other two clients ofgithub.get_repo_is_private()
(WorkspaceDetail
andSignOffRepo
views) use this function for a single request, which is reasonable.We might consider refactoring the commonality of the GraphQL code in
github.py
alongside doing this. Or put that in a separate issue. That might take inspiration from the metrics codebase approach but note this is a CLI-tool so there may be some differences in desired behavior. It may also be that the GraphQL code ingithub.py
is written inefficiently.If you work on this ticket probably start with quick prototyping of the different ways of getting the info we need from the API to compare performance. Also consider restructuring the view.
#3401 added some telemetry for diagnosis of #3393. That could be removed if this ticket is successful in improving performance.
The text was updated successfully, but these errors were encountered: