-
Notifications
You must be signed in to change notification settings - Fork 510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scale scorecard from 2K to 100K to a million repos #318
Comments
I agree with the goal. But I think it should be initially 10k repositories. |
Outcome of initial internal discussion - we'll start with focusing on horizontal scalability of the cron job itself. That is, instead of a single worker machine processing all repositories, we will re-design the architecture such that more workers can be added to improve the performance of the job. To that end, the proposed solution is to use a PubSub architecture - all the repositories to be processed will be pushed to a Topic and multiple subscribers/workers read from this and write to GCS in JSON formatted output. The JSON output will be sharded, and BQ will import this sharded JSON using BQ data transfer service. BQ table will be date partitioned for better scalability. Detailed design discussions about how this sharded JSON output will interact with the BQ table is here - #366 PS: GCS - Google Cloud Storage, BQ - BigQuery. |
@inferno-chromium @oliverchang @naveensrinivasan Update on the GitHub token issue: my thoughts on it so far are that we should consider using GitHub conditional requests as a starter. Basically, we store the entire HTTP response from GitHub (along with ETag) indexed by the requestURL. Our subsequent HTTP calls will be backed by this "ETag Cache". A simple implementation of the "ETag Cache" could be backed by Blob store. Filenames will be the requestURLs and file content will be the HTTP response. This is probably terrible in terms of IO performance, but since our concern here is reducing GitHub token usage, might be a good starting point and we can look into improving the implementation of the ETag cache in the future. What do you guys think? Thoughts/feedback? |
I've been doing some conditional request testing. Unfortunately the etag changes for each process invocation (each new auth token), even if the resource didn't change. I was using a GitHup App (https://docs.github.com/en/developers/apps/authenticating-with-github-apps#authenticating-as-a-github-app) though, not sure how it works with a PAT or OAuth app. |
Oh interesting. Did you also try the "Last-Modified" and "If-Modified-Since" options? I would expect that they shouldn't be process dependent. |
I only tried |
I'll investigate the Also, AFAICT, the |
Just an update here - I plan on adding some monitoring to the code. It'll help us understand bottlenecks in our code and with figuring out if incoming PRs are making performance better or worse. Will look into improving the GitHub token efficiency after this. |
Update:
Next steps:
|
Improvements in HTTP requests and performance after #640: |
That's amazing! Should hopefully reduce a fair bit of our quota consumption for REST API requests. |
wow wow wow! that big of a win, great job Azeem is locating this bottleneck! |
FYI, I'll soon start work on adding a Feedback/comments welcome. |
Update here: we now have implemented token server which is being used by production cron workers. This allows us to scale the number of workers easily and efficiently use the GitHub tokens. Next steps: increase the number of workers and slowly rollout number of repos to 1M. |
No description provided.
The text was updated successfully, but these errors were encountered: