Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better indicate the interruption reason #584

Open
benoit74 opened this issue May 27, 2024 · 3 comments
Open

Better indicate the interruption reason #584

benoit74 opened this issue May 27, 2024 · 3 comments

Comments

@benoit74
Copy link
Contributor

We have three things which can stop the crawler in the middle of a run:

  • --sizeLimit: the maximum warc size
  • --timeLimit: the maximum duration of the crawl
  • --diskUtilization: the maximum disk usage (in percentage) ; crawler stops if threshold is reached OR expected to be reached

As can be seen in the flag names, the disk one is not named Limit and this shows that it's different.

We understand the size and time limits as requests by the user to stop (crawling) when reaching that point.

We understand the diskUtilization one as a technical safety net.

Currently, all these two limits + technical safety net + the browser disconnection leads to an exit code 11, which makes it hard to diagnose / automate for users (especially zimit ^^)

Would it make sense from your PoV to implement different return code for each limit / technical safety net / browser disconnection?

I can work on this issue if ok for you.

@tw4l
Copy link
Member

tw4l commented May 27, 2024

Hi @benoit74 , will follow up further tomorrow but some of the rationale for the 11 exit code is here: #549.

Essentially, it's useful to have exit codes that Browsertrix can pick up on to know whether or not to restart crawler pods. Of course, this could be done through looking for several exit codes and in general we could use a better rationalization of what exit code is given when, so I think you're right that there is room for improvement here!

@benoit74
Copy link
Contributor Author

Yep, using the exit code for zimit is also our goal, but we realize we need more fine-grained details than only one "general" 11 exit code. Especially since exit code 11 is now returned for far more than the original --timeLimit and --sizeLimit. I'm not sure this was totally intentional, or at least this is was cause us some trouble (we shouldn't try to create a ZIM when the --diskUtilization is already above limit or when the browser connection has been lost).

Issue #549 makes me realize that this part of the documentation seems to have been lost when transitioning to MkDocs, this issue should probably also add this back somewhere.

All that been said, no rush, better to well define the plan than rushing into something which will not make it in the end.

@benoit74
Copy link
Contributor Author

After some thought, I propose that:

  • we do not modify exit code (and maybe more clearly document the exit code 11 is meant to indicate the task can be restarted, but I don't know where this should fall)
  • we add more details to the stats file (which we are already using at Kiwix) to add info about other limits hit / browser disconnection (currently we only populate it with --limit details) so that anyone can take whatever decision he wants with fine details on what happened

Proposed new stats format:

{
  "crawled": xx,
  "total": xx,
  "pending": xx,
  "failed": xx,
  "limit": {
    "max": xx,
    "hit": true/false
  },
  "sizeLimit": {
    "max": xx,
    "hit": true/false
  },
  “timeLimit": {
    "max": xx,
    "hit": true/false
  },
  "diskUtilization": {
    "max": xx,
    "hit": true/false
  },
  "browser_disconnected": true/false,
  "final_status": "done"/"canceled"/"interrupted"/"failed",
  "pendingPages": [
    ...
  ]
}

Are you OK with this idea? May I propose a PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Triage
Development

No branches or pull requests

2 participants