Make a distinction between soft and hard limits #304

benoit74 · 2024-05-27T08:51:06Z

We have three limits which can stop the crawler in the middle of a run:

--sizeLimit: the maximum warc size
--timeLimit: the maximum duration of the crawl
--diskUtilization: the maximum disk usage (in percentage) ; crawler stops if threshold is reached OR expected to be reached

While the two first limits are used by zimit.kiwix.org to control a fair usage of the system, the third is usually set to 90% and just ensures we do not fill the disk (45% would make more sense in fact, since we need to double crawler disk usage to have enough space to create the ZIM, and this number does not even takes into account the fact that other tasks might be running at the same time and are sharing disk).

When a limit is reached, crawler returns code 11 ; zimit continues to create the ZIM, probably since limits are in general hit when we are in the zimit.kiwix.org scenario and we want to provide a ZIM (even incomplete) to the user.

This is a problem for tasks running on farm.openzim.org where we expect the ZIM to not be interrupted by a limit.

I suggest to reverse logic for "safety":

by default, zimit stops whenever a limit is reached
a new flag --continue-on-crawler-limits is added to keep current behavior
- to be set only on zimit.kiwix.org (probably by zimit-frontend)

Ideally we should push it in 2.0 milestone since it is a breaking change.

The text was updated successfully, but these errors were encountered:

rgaudin · 2024-05-27T09:23:45Z

a new flag --continue-on-crawler-limits is added to keep current behavior

This makes no sense to me.

As you can see in the flag names, the disk one is not named Limit and this shows that it's different. I understand the size and time limits as requests by the user to stop (crawling) when reaching that point.

I understand the diskUtilization one as a technical safety net (actually I don't think that flag should be in zimit nor in browsertrix ; it's a lazy and dirty way to cope for bad practices).

If we agree that diskUtilization is different than the limits, we can have it fail on a different code ; that would make sense.

Now I'm not sure we currently have scenarios for failing on reached limit but I can certainly imagine users wanting to. I'd prefer clarity with something like sizeSoftLimit and sizeHardLimit (better labels can be found) which would behave differently. The soft one only stopping the crawler and the hard one stopping it all. It will even allow setting both. WDYT?

benoit74 · 2024-05-27T13:26:05Z

I can only agree, your point are right (but labels sucks, even if I don't find better ones for now).

I've opened webrecorder/browsertrix-crawler#584 since it is a prerequisites

benoit74 added enhancement question labels May 27, 2024

benoit74 added this to the 2.0.0 milestone May 27, 2024

benoit74 changed the title ~~Add option to fail scrape when crawler limit is reached~~ By default, fail scrape when crawler limit is reached May 27, 2024

benoit74 changed the title ~~By default, fail scrape when crawler limit is reached~~ Make a distinction between soft and hard limits May 27, 2024

kelson42 modified the milestones: 2.0.0, 2.1.0 May 31, 2024

benoit74 modified the milestones: 2.1.0, 2.2.0 Jun 18, 2024

benoit74 modified the milestones: 2.2.0, later Aug 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make a distinction between soft and hard limits #304

Make a distinction between soft and hard limits #304

benoit74 commented May 27, 2024

rgaudin commented May 27, 2024

benoit74 commented May 27, 2024

Make a distinction between soft and hard limits #304

Make a distinction between soft and hard limits #304

Comments

benoit74 commented May 27, 2024

rgaudin commented May 27, 2024

benoit74 commented May 27, 2024