Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scale scorecard from 2K to 100K to a million repos #318

Closed
inferno-chromium opened this issue Mar 30, 2021 · 14 comments
Closed

Scale scorecard from 2K to 100K to a million repos #318

inferno-chromium opened this issue Mar 30, 2021 · 14 comments
Assignees
Labels
priority/must-do Upcoming release
Milestone

Comments

@inferno-chromium
Copy link
Contributor

No description provided.

@naveensrinivasan
Copy link
Member

I agree with the goal. But I think it should be initially 10k repositories.

@azeemshaikh38
Copy link
Contributor

azeemshaikh38 commented May 5, 2021

Outcome of initial internal discussion - we'll start with focusing on horizontal scalability of the cron job itself. That is, instead of a single worker machine processing all repositories, we will re-design the architecture such that more workers can be added to improve the performance of the job. To that end, the proposed solution is to use a PubSub architecture - all the repositories to be processed will be pushed to a Topic and multiple subscribers/workers read from this and write to GCS in JSON formatted output.

The JSON output will be sharded, and BQ will import this sharded JSON using BQ data transfer service.

BQ table will be date partitioned for better scalability. Detailed design discussions about how this sharded JSON output will interact with the BQ table is here - #366

PS: GCS - Google Cloud Storage, BQ - BigQuery.

@azeemshaikh38 azeemshaikh38 added this to the milestone-q2 milestone May 17, 2021
@azeemshaikh38
Copy link
Contributor

@inferno-chromium @oliverchang @naveensrinivasan

Update on the GitHub token issue: my thoughts on it so far are that we should consider using GitHub conditional requests as a starter. Basically, we store the entire HTTP response from GitHub (along with ETag) indexed by the requestURL. Our subsequent HTTP calls will be backed by this "ETag Cache".

A simple implementation of the "ETag Cache" could be backed by Blob store. Filenames will be the requestURLs and file content will be the HTTP response. This is probably terrible in terms of IO performance, but since our concern here is reducing GitHub token usage, might be a good starting point and we can look into improving the implementation of the ETag cache in the future. What do you guys think? Thoughts/feedback?

@jeffmendoza
Copy link
Member

I've been doing some conditional request testing. Unfortunately the etag changes for each process invocation (each new auth token), even if the resource didn't change. I was using a GitHup App (https://docs.github.com/en/developers/apps/authenticating-with-github-apps#authenticating-as-a-github-app) though, not sure how it works with a PAT or OAuth app.

@azeemshaikh38
Copy link
Contributor

Oh interesting. Did you also try the "Last-Modified" and "If-Modified-Since" options? I would expect that they shouldn't be process dependent.

@jeffmendoza
Copy link
Member

I only tried If-None-Match

@azeemshaikh38
Copy link
Contributor

I'll investigate the If-Modified-Since option. The only drawback I see in it so far is that more API endpoints have support for If-None-Match option than they do for If-Modified-Since.

Also, AFAICT, the If-None-Match option is probably token dependent and not process. I tried using an Etag that I received on my gLinux through my Mac and it seems to work as long as I provide the same token value.

@azeemshaikh38
Copy link
Contributor

Just an update here - I plan on adding some monitoring to the code. It'll help us understand bottlenecks in our code and with figuring out if incoming PRs are making performance better or worse.

Will look into improving the GitHub token efficiency after this.

@azeemshaikh38
Copy link
Contributor

Update:

  1. We now have the end-to-end scaled architecture setup and running.
  2. Cron job runs on a weekly basis instead of daily to handle the large number of repos.
  3. 30k repo data was published last week and this week's run will publish 50k when it completes.
  4. We removed BQ data transfer service and setup a custom data transfer job which runs 2x/week.

Next steps:

  1. CheckIfFileExists and CheckFileContent need to be optimized so that they don't run tarball decompression on every invocation.
  2. HTTP heavy checks like Code-Review and Pull-Request can be improved by using a singe GraphQL call instead of multiple REST API calls.
  3. Look into the Github conditional requests and using Last-Modified header to improve token efficiency. (allows us to use more PubSub worker instances).
  4. Setup a SecretServer which will be shared by multiple instances of PubSub worker for effectively sharing Github tokens.

@azeemshaikh38
Copy link
Contributor

Improvements in HTTP requests and performance after #640:

image

@oliverchang
Copy link
Contributor

That's amazing! Should hopefully reduce a fair bit of our quota consumption for REST API requests.

@inferno-chromium
Copy link
Contributor Author

Improvements in HTTP requests and performance after #640:

image

wow wow wow! that big of a win, great job Azeem is locating this bottleneck!

@azeemshaikh38
Copy link
Contributor

FYI, I'll soon start work on adding a secret server to the cron job. So far, we only had 1 PubSub worker. To add more workers which process requests in parallel, it is necessary to ensure that our GitHub tokens do not get consumed concurrently. Concurrent requests trigger the secondary rate limit/abuse detection in GitHub servers. To avoid this, every worker which requires a GitHub token will checkout a token from the secret server. A checked-out token will be inaccessible to other workers until either - (i) it is released back to the secret server or (ii) a wait time of X has passed. With this in place we can increase our GitHub token count, to hopefully scale to 1M.

Feedback/comments welcome.

@laurentsimon laurentsimon removed this from the milestone-q2 milestone Oct 7, 2021
@azeemshaikh38 azeemshaikh38 added this to the milestone v4 milestone Oct 20, 2021
@azeemshaikh38
Copy link
Contributor

Update here: we now have implemented token server which is being used by production cron workers. This allows us to scale the number of workers easily and efficiently use the GitHub tokens.

Next steps: increase the number of workers and slowly rollout number of repos to 1M.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/must-do Upcoming release
Projects
None yet
Development

No branches or pull requests

6 participants