Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make a distinction between soft and hard limits #304

Open
benoit74 opened this issue May 27, 2024 · 2 comments
Open

Make a distinction between soft and hard limits #304

benoit74 opened this issue May 27, 2024 · 2 comments

Comments

@benoit74
Copy link
Collaborator

We have three limits which can stop the crawler in the middle of a run:

  • --sizeLimit: the maximum warc size
  • --timeLimit: the maximum duration of the crawl
  • --diskUtilization: the maximum disk usage (in percentage) ; crawler stops if threshold is reached OR expected to be reached

While the two first limits are used by zimit.kiwix.org to control a fair usage of the system, the third is usually set to 90% and just ensures we do not fill the disk (45% would make more sense in fact, since we need to double crawler disk usage to have enough space to create the ZIM, and this number does not even takes into account the fact that other tasks might be running at the same time and are sharing disk).

When a limit is reached, crawler returns code 11 ; zimit continues to create the ZIM, probably since limits are in general hit when we are in the zimit.kiwix.org scenario and we want to provide a ZIM (even incomplete) to the user.

This is a problem for tasks running on farm.openzim.org where we expect the ZIM to not be interrupted by a limit.

I suggest to reverse logic for "safety":

  • by default, zimit stops whenever a limit is reached
  • a new flag --continue-on-crawler-limits is added to keep current behavior
    • to be set only on zimit.kiwix.org (probably by zimit-frontend)

Ideally we should push it in 2.0 milestone since it is a breaking change.

@benoit74 benoit74 added this to the 2.0.0 milestone May 27, 2024
@benoit74 benoit74 changed the title Add option to fail scrape when crawler limit is reached By default, fail scrape when crawler limit is reached May 27, 2024
@rgaudin
Copy link
Member

rgaudin commented May 27, 2024

a new flag --continue-on-crawler-limits is added to keep current behavior

This makes no sense to me.

As you can see in the flag names, the disk one is not named Limit and this shows that it's different. I understand the size and time limits as requests by the user to stop (crawling) when reaching that point.

I understand the diskUtilization one as a technical safety net (actually I don't think that flag should be in zimit nor in browsertrix ; it's a lazy and dirty way to cope for bad practices).

If we agree that diskUtilization is different than the limits, we can have it fail on a different code ; that would make sense.

Now I'm not sure we currently have scenarios for failing on reached limit but I can certainly imagine users wanting to. I'd prefer clarity with something like sizeSoftLimit and sizeHardLimit (better labels can be found) which would behave differently. The soft one only stopping the crawler and the hard one stopping it all. It will even allow setting both. WDYT?

@benoit74
Copy link
Collaborator Author

I can only agree, your point are right (but labels sucks, even if I don't find better ones for now).

I've opened webrecorder/browsertrix-crawler#584 since it is a prerequisites

@benoit74 benoit74 changed the title By default, fail scrape when crawler limit is reached Make a distinction between soft and hard limits May 27, 2024
@kelson42 kelson42 modified the milestones: 2.0.0, 2.1.0 May 31, 2024
@benoit74 benoit74 modified the milestones: 2.1.0, 2.2.0 Jun 18, 2024
@benoit74 benoit74 modified the milestones: 2.2.0, later Aug 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants