Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle HTTP 429 errors + add failure limit #393

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

benoit74
Copy link
Contributor

Fix #392 (mostly, see NB below, but this is ok for me)

Changes

  • add a failedLimit CLI argument which interrupts the crawler if the number of failed pages is greater or equal to this limit
  • add a pageLoadAttempts + defaultRetryPause CLI arguments. In case of HTTP 429 error:
    • the crawler will retry up to pageLoadAttempts times
    • the pause between retries will be based on the Retry-After HTTP response header
      • both absolute and relative formats are supported
    • if the header is not provided, the pause will default to defaultRetryPause seconds
    • for now, other statuses are not retried (HTTP 503 could theoretically be retried as well, but with even larger pauses so not very practical for a crawler especially since the chance of getting a good response are typically lower than on HTTP 429)

NB: the failedLimit argument is not based on a count of successive failures as originally suggested in the issue, because it is indeed way more complex + potentially not that great (e.g. if there is many failures but some random success, the limit might not apply ; if there is random failures on a limited number of pages, the limit might never apply but the result be pretty bad)

@benoit74
Copy link
Contributor Author

BTW, I did not add any tests, this seems pretty hard to do.

@benoit74
Copy link
Contributor Author

I just converted it to Draft because my manual tests are showing that my code is wrong, I will submit a fix as soon as code is OK

@benoit74 benoit74 marked this pull request as draft September 26, 2023 13:52
@benoit74 benoit74 marked this pull request as ready for review September 27, 2023 06:34
@benoit74
Copy link
Contributor Author

Code is now ready for review, my tests were not behaving as expected only because Cloudflare suddenly stopped returning "Retry-After" header for the failing requests.

@ikreymer
Copy link
Member

Thanks for this PR! We should be able to add it to the next feature release 0.12.0.

One potential issue is the overall timeout for the page, which is calculated here:
https://github.com/webrecorder/browsertrix-crawler/blob/main/crawler.js#L94
Do you think the 429 timeouts should extend this timeout, or just make sure it is set high enough?

NB: the failedLimit argument is not based on a count of successive failures as originally suggested in the issue, because it is indeed way more complex + potentially not that great (e.g. if there is many failures but some random success, the limit might not apply ; if there is random failures on a limited number of pages, the limit might never apply but the result be pretty bad)

I actually think the consecutive limit might make more sense, especially for multiple crawler instances, eg. if one instance is having a lot of issues, it should be interrupted, while others continue. The state would track total failures across all instances, but maybe for your use case that doesn't matter as much.. Were you seeing worse results with consecutive failures? Will think about this a bit more.

@benoit74
Copy link
Contributor Author

benoit74 commented Oct 2, 2023

Great, thank you ! Both points are very valid!

I think that the 429 should "pause" the overall timeout, because 429 are not really timeouts, they are a request from the server to slow down our requests. So from my perspective it does not means that the server / crawler is malfunctioning (what I consider we try to capture with the overall timeout) but only that we are too "aggressive" for the server. How to implement seems is a bit complex, because I consider it should only "pause" the overall timeout only when 429 errors are handled. We should not increase the overall timeout if 429 error are not returned for the current page. I will have a look into it, but if you have any suggestion, they are welcomed.

Regarding the consecutive limits, I still don't think that consecutive limits make sense. You could easily get into situations where the crawler won't stop but the result is garbage. For instance if only one page out of 10 is good, and you have set the limit to 50 because you are crawling of website with thousands of pages, you will never hit the limit but the result is that 90% of the website is garbage. Maybe we should track an individual limit per crawler instance (is an instance what is controlled by the --workers parameter?), but this is not what we want to capture.

There are two scenarios we encounter (for now):

  • suddenly, all pages are failing with timeouts (but we are usually using only 1 "worker", if this is what you refer to when speaking about crawler instances). This is a side-question, but have you already encountered this situation were a crawler suddenly fails to load all pages? We were wondering whether it is a networking issue or a webserver issue, but never thought it might be a browser issue.
  • more rarely, we see situation where many pages are failing (something like 30-50% or even more) but the crawler continues ; these failures are sometimes random ; after 100 failures, we already know that the final result will not be valuable and we should not continue to waste compute time

@benoit74
Copy link
Contributor Author

benoit74 commented Oct 2, 2023

I think that I've implemented retry on 429 errors at the wrong place.

I suggest that I change it this way:

  • create a custom RetryWithPauseError class, with information on how long the crawler should pause
  • in loadPage function of crawler.js, keep only the code detecting the 429 and extracting the Retry-After if present ; raise a RetryWithPauseError with proper information
  • handle retries in
    • probably with a nested while loop for retries + a new try block detecting the RetryWithPauseError raised above

With this solution:

  • we do not have to modify maxPageTime and make assumptions about how long the webserver might ask us to pause
  • we continue to be able to retry as many times as wanted
  • we restart the whole page processing, which is probably even better than what I did at a lower level

WDYT?

ikreymer added a commit that referenced this pull request Oct 3, 2023
- logger.fatal() also sets crawl status to 'failed'
- add 'failOnFailedLimit' to set crawl status to 'failed' if number of failed pages exceeds limit, refactored from #393
@ikreymer
Copy link
Member

ikreymer commented Oct 3, 2023

Regarding the failures, I refactored that into a separate PR with additional cleanup (#402) - in this case, I think we want to mark the crawl as 'failed', rather than merely interrupt it, which means the crawler will wait for other workers to finish, and possibly upload the WACZ file. I think the desired behavior for that is to fail the crawl, which would also prevent it from being restarted again in our use case.

Let's focus this PR on just the 429 handling perhaps?

@benoit74
Copy link
Contributor Author

benoit74 commented Oct 3, 2023

Yes, perfect, let's focus on 429 handling on this PR!
Thank you for the refactoring around failure, and yes, you are right regarding failing the crawl.

ikreymer added a commit that referenced this pull request Oct 4, 2023
- logger.fatal() also sets crawl status to 'failed' and adds endTime before exiting
- add 'failOnFailedLimit' to set crawl status to 'failed' if number of failed pages exceeds limit, refactored from #393 to now use logger.fatal() to end crawl.
@benoit74
Copy link
Contributor Author

benoit74 commented Oct 9, 2023

@ikreymer do you have any more thoughts to share?
does my idea above about the alternative way to handle those 429 errors (and not impact page maximum time) makes sense to you?

README.md Outdated Show resolved Hide resolved
util/argParser.js Outdated Show resolved Hide resolved
@ikreymer
Copy link
Member

@ikreymer do you have any more thoughts to share? does my idea above about the alternative way to handle those 429 errors (and not impact page maximum time) makes sense to you?

Sorry for the delay - just catching up. Yes, this is a better approach, as it allows retrying w/o having to wait for within page counter. I think that could work. Some caveats:

  • This probably only works with one worker, otherwise multiple workers will still retry at a more frequent interval, right?
  • Another option could be to continue on to the next page and put this page back into the queue, however, that probably only makes sense if crawling across multiple domains, otherwise they will likely all be 429s.
  • We also have the pageExtraDelay flag (which probably should be moved into the worker), which could also be updated to reflect the 429 limit. Maybe it is also set to the maximum retry time?
    I wonder if some sort of per-domain retry is needed, that could be complicated. For your use case, you're just using this with one worker, and one domain right, so that would be the most important option to get working?

@benoit74
Copy link
Contributor Author

No worries, I know what this is to have too many things on the plate.

Your caveats are very valid.

I will have a look about how to implement the pause per domain and for all workers, you are probably right this would make even more sense.

@benoit74
Copy link
Contributor Author

I had a look at the code and searched a bit what could be done.

The logic handling the pause in case of 429 errors could be moved in timedCrawlPage in crawler.js, and this is probably sufficient to make it work in Kiwix scenario (only one worker, mostly only one domain). It would also work with more workers, but each worker will have to receive its own 429 error before stopping. And it means we could "loose" some time in a multi-domain scenario where we will pause instead of moving on to another domain. Those two limitations are not a problem for us of course.

Moving this logic further up (typically in runLoop of worker.js) would be meaningful since it would allow to ask all workers to pause processing a given domain as soon as one page in one worker returned of 429, and to continue to process other domains. However this is too complex for me to implement (we need to push page back to the queue, ensure we do not impact computation of retries at this level, probably consider impact on pending pages computations, inform other workers of domain which has to be paused, etc ...). Since this is not needed in our scenario, I really don't see a strong interest in moving into this direction alone.

@ikreymer what do you think about this? Should we join efforts and try to tackle the second solution above or should I start making some progress by implementing the easy solution which is sufficient in our scenario (and probably many other scenarios)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Slow down + retry on HTTP 429 errors
2 participants