Scale scorecard from 2K to 100K to a million repos #318

inferno-chromium · 2021-03-30T22:51:18Z

No description provided.

naveensrinivasan · 2021-03-31T01:40:40Z

I agree with the goal. But I think it should be initially 10k repositories.

azeemshaikh38 · 2021-05-05T00:56:13Z

Outcome of initial internal discussion - we'll start with focusing on horizontal scalability of the cron job itself. That is, instead of a single worker machine processing all repositories, we will re-design the architecture such that more workers can be added to improve the performance of the job. To that end, the proposed solution is to use a PubSub architecture - all the repositories to be processed will be pushed to a Topic and multiple subscribers/workers read from this and write to GCS in JSON formatted output.

The JSON output will be sharded, and BQ will import this sharded JSON using BQ data transfer service.

BQ table will be date partitioned for better scalability. Detailed design discussions about how this sharded JSON output will interact with the BQ table is here - #366

PS: GCS - Google Cloud Storage, BQ - BigQuery.

azeemshaikh38 · 2021-05-17T17:45:05Z

@inferno-chromium @oliverchang @naveensrinivasan

Update on the GitHub token issue: my thoughts on it so far are that we should consider using GitHub conditional requests as a starter. Basically, we store the entire HTTP response from GitHub (along with ETag) indexed by the requestURL. Our subsequent HTTP calls will be backed by this "ETag Cache".

A simple implementation of the "ETag Cache" could be backed by Blob store. Filenames will be the requestURLs and file content will be the HTTP response. This is probably terrible in terms of IO performance, but since our concern here is reducing GitHub token usage, might be a good starting point and we can look into improving the implementation of the ETag cache in the future. What do you guys think? Thoughts/feedback?

jeffmendoza · 2021-05-17T18:40:37Z

I've been doing some conditional request testing. Unfortunately the etag changes for each process invocation (each new auth token), even if the resource didn't change. I was using a GitHup App (https://docs.github.com/en/developers/apps/authenticating-with-github-apps#authenticating-as-a-github-app) though, not sure how it works with a PAT or OAuth app.

azeemshaikh38 · 2021-05-17T19:07:27Z

Oh interesting. Did you also try the "Last-Modified" and "If-Modified-Since" options? I would expect that they shouldn't be process dependent.

jeffmendoza · 2021-05-17T19:53:10Z

I only tried If-None-Match

azeemshaikh38 · 2021-05-18T19:05:31Z

I'll investigate the If-Modified-Since option. The only drawback I see in it so far is that more API endpoints have support for If-None-Match option than they do for If-Modified-Since.

Also, AFAICT, the If-None-Match option is probably token dependent and not process. I tried using an Etag that I received on my gLinux through my Mac and it seems to work as long as I provide the same token value.

azeemshaikh38 · 2021-05-21T16:42:14Z

Just an update here - I plan on adding some monitoring to the code. It'll help us understand bottlenecks in our code and with figuring out if incoming PRs are making performance better or worse.

Will look into improving the GitHub token efficiency after this.

azeemshaikh38 · 2021-06-22T18:19:23Z

Update:

We now have the end-to-end scaled architecture setup and running.
Cron job runs on a weekly basis instead of daily to handle the large number of repos.
30k repo data was published last week and this week's run will publish 50k when it completes.
We removed BQ data transfer service and setup a custom data transfer job which runs 2x/week.

Next steps:

CheckIfFileExists and CheckFileContent need to be optimized so that they don't run tarball decompression on every invocation.
HTTP heavy checks like Code-Review and Pull-Request can be improved by using a singe GraphQL call instead of multiple REST API calls.
Look into the Github conditional requests and using Last-Modified header to improve token efficiency. (allows us to use more PubSub worker instances).
Setup a SecretServer which will be shared by multiple instances of PubSub worker for effectively sharing Github tokens.

azeemshaikh38 · 2021-07-01T23:09:47Z

Improvements in HTTP requests and performance after #640:

oliverchang · 2021-07-02T00:56:02Z

That's amazing! Should hopefully reduce a fair bit of our quota consumption for REST API requests.

inferno-chromium · 2021-07-02T01:44:13Z

Improvements in HTTP requests and performance after #640:

wow wow wow! that big of a win, great job Azeem is locating this bottleneck!

azeemshaikh38 · 2021-09-21T16:35:47Z

FYI, I'll soon start work on adding a secret server to the cron job. So far, we only had 1 PubSub worker. To add more workers which process requests in parallel, it is necessary to ensure that our GitHub tokens do not get consumed concurrently. Concurrent requests trigger the secondary rate limit/abuse detection in GitHub servers. To avoid this, every worker which requires a GitHub token will checkout a token from the secret server. A checked-out token will be inaccessible to other workers until either - (i) it is released back to the secret server or (ii) a wait time of X has passed. With this in place we can increase our GitHub token count, to hopefully scale to 1M.

Feedback/comments welcome.

azeemshaikh38 · 2021-10-20T16:56:48Z

Update here: we now have implemented token server which is being used by production cron workers. This allows us to scale the number of workers easily and efficiently use the GitHub tokens.

Next steps: increase the number of workers and slowly rollout number of repos to 1M.

inferno-chromium added core feature priority/must-do Upcoming release labels Mar 30, 2021

inferno-chromium assigned azeemshaikh38 Mar 30, 2021

azeemshaikh38 mentioned this issue May 10, 2021

🌱 Setup PubSub framework code #428

Merged

2 tasks

azeemshaikh38 added this to the milestone-q2 milestone May 17, 2021

This was referenced May 26, 2021

Reducing GitHub API calls to scale scanning repositories #202

Closed

Scorecard scalability limitation: Reduce GitHub API calls #80

Closed

azeemshaikh38 mentioned this issue Sep 21, 2021

🌱 Enforce non-concurrent token usage #1048

Merged

2 tasks

laurentsimon removed this from the milestone-q2 milestone Oct 7, 2021

azeemshaikh38 added this to the milestone v4 milestone Oct 20, 2021

azeemshaikh38 mentioned this issue Oct 28, 2021

✨ Sanitized repo URLs ~1M #1182

Merged

2 tasks

azeemshaikh38 closed this as completed Oct 29, 2021

This was referenced Oct 29, 2021

🌱 Store metadata in BigQuery #1197

Merged

Enable metrics for Scorecard #234

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scale scorecard from 2K to 100K to a million repos #318

Scale scorecard from 2K to 100K to a million repos #318

inferno-chromium commented Mar 30, 2021

naveensrinivasan commented Mar 31, 2021

azeemshaikh38 commented May 5, 2021 •

edited

Loading

azeemshaikh38 commented May 17, 2021

jeffmendoza commented May 17, 2021

azeemshaikh38 commented May 17, 2021

jeffmendoza commented May 17, 2021

azeemshaikh38 commented May 18, 2021

azeemshaikh38 commented May 21, 2021

azeemshaikh38 commented Jun 22, 2021

azeemshaikh38 commented Jul 1, 2021

oliverchang commented Jul 2, 2021

inferno-chromium commented Jul 2, 2021

azeemshaikh38 commented Sep 21, 2021

azeemshaikh38 commented Oct 20, 2021

Scale scorecard from 2K to 100K to a million repos #318

Scale scorecard from 2K to 100K to a million repos #318

Comments

inferno-chromium commented Mar 30, 2021

naveensrinivasan commented Mar 31, 2021

azeemshaikh38 commented May 5, 2021 • edited Loading

azeemshaikh38 commented May 17, 2021

jeffmendoza commented May 17, 2021

azeemshaikh38 commented May 17, 2021

jeffmendoza commented May 17, 2021

azeemshaikh38 commented May 18, 2021

azeemshaikh38 commented May 21, 2021

azeemshaikh38 commented Jun 22, 2021

azeemshaikh38 commented Jul 1, 2021

oliverchang commented Jul 2, 2021

inferno-chromium commented Jul 2, 2021

azeemshaikh38 commented Sep 21, 2021

azeemshaikh38 commented Oct 20, 2021

azeemshaikh38 commented May 5, 2021 •

edited

Loading