-
Notifications
You must be signed in to change notification settings - Fork 382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent link checker from getting rate limited #1972
Comments
Finding all the links and grouping might cost a lot of time but this task is running every hour in the background...nothing is waiting for it to complete and I'm sure we can complete the operation within the hour. |
I'm not sure about this, I would rather have all of the links checked in one report, even if it takes longer as a result. What if we tried reducing Though I guess that would only work if we were polling one URL every 3.6s or more. Why not batch the queries as you suggested, but keep them all in the same action with some artificial sleeps in between batches? |
Related PR that we may want to partially revert: #1964 |
As long as it doesn't exceed the maximum job-time of whatever CI system you're using 😉 |
Github actions are limited to 6 hours so the sleep strategy limits us to 6,000 links per execution and if we have 6,001 links how do we get that 6000-N remainder ? We'll have to batch anyways. I think batching is more robust since it increases our throughput to 24,000 links every 24 hours and if we get rate limited on the first bucket of urls we won't on the next (not guaranteed but a good guess). I won't be surprised though if we do this batching and we'll have to decrease max-concurrency if sites don't like us sending 500 requests in 1 second. |
Atm, the checker gets about a 400+ errors if 429 is not added as an accepted status code. Check out my issue here: lycheeverse/lychee#634 |
Since the link checker is looking up a lot of URLs at once it is getting rate limited by docker hub but could also get rate limited by github.
Github 1000 request/hour: https://docs.github.com/en/rest/overview/resources-in-the-rest-api#rate-limiting
Docker hub does not say their rate limit: https://docs.docker.com/docker-hub/download-rate-limit/#other-limits
A really simple solution is to have the checker run once every hour and before the link checker step we collect all the links in the docs repo and divide them into groups of 1000. Based on the UTC hour we select that hours group and store them into a links file. Now when lychee runs we specify to only check that links file.
Example: if we have 10,000 links to check, 10,000 / 1000 (max_requests) = 10 groups. At 00:00 we check the first group of 1000. at 01:00 we check the second group...etc. At 11:00 we have no more groups so it will cycle back to the first group (00:00). This would allow us to safely check 24,000 links per github 1000 requests/hour limit. Docker hub might make it lower and we'll only know about experimenting (locally).
Inspired because of #1962
The text was updated successfully, but these errors were encountered: