You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have three limits which can stop the crawler in the middle of a run:
--sizeLimit: the maximum warc size
--timeLimit: the maximum duration of the crawl
--diskUtilization: the maximum disk usage (in percentage) ; crawler stops if threshold is reached OR expected to be reached
While the two first limits are used by zimit.kiwix.org to control a fair usage of the system, the third is usually set to 90% and just ensures we do not fill the disk (45% would make more sense in fact, since we need to double crawler disk usage to have enough space to create the ZIM, and this number does not even takes into account the fact that other tasks might be running at the same time and are sharing disk).
When a limit is reached, crawler returns code 11 ; zimit continues to create the ZIM, probably since limits are in general hit when we are in the zimit.kiwix.org scenario and we want to provide a ZIM (even incomplete) to the user.
This is a problem for tasks running on farm.openzim.org where we expect the ZIM to not be interrupted by a limit.
I suggest to reverse logic for "safety":
by default, zimit stops whenever a limit is reached
a new flag --continue-on-crawler-limits is added to keep current behavior
to be set only on zimit.kiwix.org (probably by zimit-frontend)
Ideally we should push it in 2.0 milestone since it is a breaking change.
The text was updated successfully, but these errors were encountered:
a new flag --continue-on-crawler-limits is added to keep current behavior
This makes no sense to me.
As you can see in the flag names, the disk one is not named Limit and this shows that it's different. I understand the size and time limits as requests by the user to stop (crawling) when reaching that point.
I understand the diskUtilization one as a technical safety net (actually I don't think that flag should be in zimit nor in browsertrix ; it's a lazy and dirty way to cope for bad practices).
If we agree that diskUtilization is different than the limits, we can have it fail on a different code ; that would make sense.
Now I'm not sure we currently have scenarios for failing on reached limit but I can certainly imagine users wanting to. I'd prefer clarity with something like sizeSoftLimit and sizeHardLimit (better labels can be found) which would behave differently. The soft one only stopping the crawler and the hard one stopping it all. It will even allow setting both. WDYT?
We have three limits which can stop the crawler in the middle of a run:
--sizeLimit
: the maximum warc size--timeLimit
: the maximum duration of the crawl--diskUtilization
: the maximum disk usage (in percentage) ; crawler stops if threshold is reached OR expected to be reachedWhile the two first limits are used by zimit.kiwix.org to control a fair usage of the system, the third is usually set to 90% and just ensures we do not fill the disk (45% would make more sense in fact, since we need to double crawler disk usage to have enough space to create the ZIM, and this number does not even takes into account the fact that other tasks might be running at the same time and are sharing disk).
When a limit is reached, crawler returns code 11 ; zimit continues to create the ZIM, probably since limits are in general hit when we are in the zimit.kiwix.org scenario and we want to provide a ZIM (even incomplete) to the user.
This is a problem for tasks running on farm.openzim.org where we expect the ZIM to not be interrupted by a limit.
I suggest to reverse logic for "safety":
--continue-on-crawler-limits
is added to keep current behaviorIdeally we should push it in 2.0 milestone since it is a breaking change.
The text was updated successfully, but these errors were encountered: