Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent link checker from getting rate limited #1972

Closed
20k-ultra opened this issue Jul 8, 2021 · 6 comments
Closed

Prevent link checker from getting rate limited #1972

20k-ultra opened this issue Jul 8, 2021 · 6 comments

Comments

@20k-ultra
Copy link
Contributor

20k-ultra commented Jul 8, 2021

Since the link checker is looking up a lot of URLs at once it is getting rate limited by docker hub but could also get rate limited by github.

Github 1000 request/hour: https://docs.github.com/en/rest/overview/resources-in-the-rest-api#rate-limiting
Docker hub does not say their rate limit: https://docs.docker.com/docker-hub/download-rate-limit/#other-limits

A really simple solution is to have the checker run once every hour and before the link checker step we collect all the links in the docs repo and divide them into groups of 1000. Based on the UTC hour we select that hours group and store them into a links file. Now when lychee runs we specify to only check that links file.

Example: if we have 10,000 links to check, 10,000 / 1000 (max_requests) = 10 groups. At 00:00 we check the first group of 1000. at 01:00 we check the second group...etc. At 11:00 we have no more groups so it will cycle back to the first group (00:00). This would allow us to safely check 24,000 links per github 1000 requests/hour limit. Docker hub might make it lower and we'll only know about experimenting (locally).

Inspired because of #1962

@20k-ultra
Copy link
Contributor Author

Finding all the links and grouping might cost a lot of time but this task is running every hour in the background...nothing is waiting for it to complete and I'm sure we can complete the operation within the hour.

@klutchell
Copy link
Contributor

klutchell commented Jul 8, 2021

I'm not sure about this, I would rather have all of the links checked in one report, even if it takes longer as a result.

What if we tried reducing max-concurrency and/or threads to super low values and see if we can avoid hitting the limits?

Though I guess that would only work if we were polling one URL every 3.6s or more.

Why not batch the queries as you suggested, but keep them all in the same action with some artificial sleeps in between batches?

@klutchell
Copy link
Contributor

Related PR that we may want to partially revert: #1964

@lurch
Copy link
Contributor

lurch commented Jul 8, 2021

even if it takes longer as a result.

As long as it doesn't exceed the maximum job-time of whatever CI system you're using 😉

@20k-ultra
Copy link
Contributor Author

Github actions are limited to 6 hours so the sleep strategy limits us to 6,000 links per execution and if we have 6,001 links how do we get that 6000-N remainder ? We'll have to batch anyways.

I think batching is more robust since it increases our throughput to 24,000 links every 24 hours and if we get rate limited on the first bucket of urls we won't on the next (not guaranteed but a good guess). I won't be surprised though if we do this batching and we'll have to decrease max-concurrency if sites don't like us sending 500 requests in 1 second.

@vipulgupta2048
Copy link
Member

vipulgupta2048 commented Oct 31, 2022

Atm, the checker gets about a 400+ errors if 429 is not added as an accepted status code. Check out my issue here: lycheeverse/lychee#634
Closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants