Prevent link checker from getting rate limited #1972

20k-ultra · 2021-07-08T01:59:04Z

Since the link checker is looking up a lot of URLs at once it is getting rate limited by docker hub but could also get rate limited by github.

Github 1000 request/hour: https://docs.github.com/en/rest/overview/resources-in-the-rest-api#rate-limiting
Docker hub does not say their rate limit: https://docs.docker.com/docker-hub/download-rate-limit/#other-limits

A really simple solution is to have the checker run once every hour and before the link checker step we collect all the links in the docs repo and divide them into groups of 1000. Based on the UTC hour we select that hours group and store them into a links file. Now when lychee runs we specify to only check that links file.

Example: if we have 10,000 links to check, 10,000 / 1000 (max_requests) = 10 groups. At 00:00 we check the first group of 1000. at 01:00 we check the second group...etc. At 11:00 we have no more groups so it will cycle back to the first group (00:00). This would allow us to safely check 24,000 links per github 1000 requests/hour limit. Docker hub might make it lower and we'll only know about experimenting (locally).

Inspired because of #1962

20k-ultra · 2021-07-08T02:00:22Z

Finding all the links and grouping might cost a lot of time but this task is running every hour in the background...nothing is waiting for it to complete and I'm sure we can complete the operation within the hour.

klutchell · 2021-07-08T14:04:09Z

I'm not sure about this, I would rather have all of the links checked in one report, even if it takes longer as a result.

What if we tried reducing max-concurrency and/or threads to super low values and see if we can avoid hitting the limits?

Though I guess that would only work if we were polling one URL every 3.6s or more.

Why not batch the queries as you suggested, but keep them all in the same action with some artificial sleeps in between batches?

klutchell · 2021-07-08T14:06:05Z

Related PR that we may want to partially revert: #1964

lurch · 2021-07-08T14:58:04Z

even if it takes longer as a result.

As long as it doesn't exceed the maximum job-time of whatever CI system you're using 😉

20k-ultra · 2021-07-08T15:20:29Z

Github actions are limited to 6 hours so the sleep strategy limits us to 6,000 links per execution and if we have 6,001 links how do we get that 6000-N remainder ? We'll have to batch anyways.

I think batching is more robust since it increases our throughput to 24,000 links every 24 hours and if we get rate limited on the first bucket of urls we won't on the next (not guaranteed but a good guess). I won't be surprised though if we do this batching and we'll have to decrease max-concurrency if sites don't like us sending 500 requests in 1 second.

vipulgupta2048 · 2022-10-31T10:17:09Z

Atm, the checker gets about a 400+ errors if 429 is not added as an accepted status code. Check out my issue here: lycheeverse/lychee#634
Closing

klutchell mentioned this issue Jul 12, 2021

Resume checking for rate-limit return codes #1977

Merged

vipulgupta2048 closed this as completed Oct 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent link checker from getting rate limited #1972

Prevent link checker from getting rate limited #1972

20k-ultra commented Jul 8, 2021 •

edited

Loading

20k-ultra commented Jul 8, 2021

klutchell commented Jul 8, 2021 •

edited

Loading

klutchell commented Jul 8, 2021

lurch commented Jul 8, 2021

20k-ultra commented Jul 8, 2021

vipulgupta2048 commented Oct 31, 2022 •

edited

Loading

Prevent link checker from getting rate limited #1972

Prevent link checker from getting rate limited #1972

Comments

20k-ultra commented Jul 8, 2021 • edited Loading

20k-ultra commented Jul 8, 2021

klutchell commented Jul 8, 2021 • edited Loading

klutchell commented Jul 8, 2021

lurch commented Jul 8, 2021

20k-ultra commented Jul 8, 2021

vipulgupta2048 commented Oct 31, 2022 • edited Loading

20k-ultra commented Jul 8, 2021 •

edited

Loading

klutchell commented Jul 8, 2021 •

edited

Loading

vipulgupta2048 commented Oct 31, 2022 •

edited

Loading