Crawler getting stuck on Page Crashed #391

benoit74 · 2023-09-22T15:32:18Z

Kiwix has a crawler which got stuck without returning, with 0.11.1 (i.e. with #385 merged). A last log is output and then process is still up but nothing more seems to be happening.

Launch command (note that I modified the userAgentSuffix):

Running browsertrix-crawler crawl: crawl --failOnFailedSeed --waitUntil load --title Plotly Documentation --depth -3 --timeout 90 --scopeType domain --behaviors autoplay,autofetch,siteSpecific --behaviorTimeout 90 --sizeLimit 4294967296 --diskUtilization 90 --timeLimit 7200 --url https://plotly.com/python/ --userAgentSuffix [email protected] --cwd /output/.tmpq4knpe8p --statsFilename /output/crawl.json

Version log line:

{"timestamp":"2023-09-20T01:40:03.312Z","logLevel":"info","context":"general","message":"Browsertrix-Crawler 0.11.1 (with warcio.js 1.6.2 pywb 2.7.4)","details":{}}

Last log line is:

{"timestamp":"2023-09-20T03:01:59.247Z","logLevel":"error","context":"worker","message":"Page Crashed","details":{"type":"exception","message":"Page crashed!","stack":"Error: Page crashed!\n    at #onTargetCrashed (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Page.js:284:28)\n    at file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Page.js:153:41\n    at file:///app/node_modules/puppeteer-core/lib/esm/third_party/mitt/index.js:1:248\n    at Array.map (<anonymous>)\n    at Object.emit (file:///app/node_modules/puppeteer-core/lib/esm/third_party/mitt/index.js:1:232)\n    at CDPSessionImpl.emit (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/EventEmitter.js:82:22)\n    at CDPSessionImpl._onMessage (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Connection.js:425:18)\n    at Connection.onMessage (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Connection.js:255:25)\n    at WebSocket.<anonymous> (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/NodeWebSocketTransport.js:46:32)\n    at callListener (/app/node_modules/puppeteer-core/node_modules/ws/lib/event-target.js:290:14)","page":"https://plotly.com/python/3d-surface-plots/","workerid":0}}

Do not hesitate to ask if more info is needed.

The text was updated successfully, but these errors were encountered:

benoit74 · 2023-09-22T15:34:54Z

youzim.it task: https://farm.youzim.it/pipeline/1d123407-0aa3-4094-8355-c59cd5a41c52

ikreymer · 2023-09-22T16:12:41Z

Thanks for the report, will try to repro. It should have been able to continue after the page crash.

ikreymer · 2023-10-04T06:30:48Z

This has hopefully been fixed in 0.11.2 - very hard to be 100% sure, but hopefully won't happen again.

benoit74 · 2024-02-19T08:11:44Z

Some good and some bad news on this topic.

I confirm the crawler is now continuing after a page crash. That's great.

It however looks like we have new situations (with 0.12.3 and 0.12.4) around page crashes.

Details are present in openzim/zimit#266 and openzim/zimit#283

it looks like once a page has crashed, it will continue to happen for other pages (some will be successful but we will encounter new crashes)
it usually get so bad that we face a new issue (timeout at page initialization if I read the log properly) which stops the crawler (but the crawler exits properly now)
the issue happens on few websites, most of our recipes are still running fine ; for unknown reasons https://solar.lowtechmagazine.com/ seems to be a good test website to experience these crashes
on two occasions (which led to the creation of solar.lowtechmagazine.com is very unstable openzim/zimit#283), the final symptoms were a bit different but after investigation I consider the root cause might be identical

Help or any suggestion on what to test to progress on this topic would be welcomed. Most important topic for us is probably the new situations of openzim/zimit#283 were the crawler seems to return code 11 while indeed it has faced a critical situation, not a limit. This is a problem for us because we consider that hitting a limit is "normal" and we should continue processing by creating our ZIM. It is more serious than real crawler crashes because we are not alerted of the issue. If it is easy to identify and fix what led the crawler to "believe" it hits a limit, it would be a great enhancement.

One side-question: is it possible to ask the crawler to stop on first page crash (instead of trying to continue)?

benoit74 · 2024-05-28T11:57:07Z

I confirm that crawler 1.x seems to have solved this issue.

Thank you all for the very great work that has been pushed into 1.x release(s)!

github-project-automation bot added this to Webrecorder Projects Sep 22, 2023

github-project-automation bot moved this to Triage in Webrecorder Projects Sep 22, 2023

benoit74 changed the title ~~Crawler getting stuck~~ Crawler getting stuck on Page Crashed Sep 22, 2023

ikreymer mentioned this issue Sep 26, 2023

additional fixes for worker getting stuck #396

Merged

benoit74 closed this as completed May 28, 2024

github-project-automation bot moved this from Triage to Done! in Webrecorder Projects May 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawler getting stuck on Page Crashed #391

Crawler getting stuck on Page Crashed #391

benoit74 commented Sep 22, 2023

benoit74 commented Sep 22, 2023

ikreymer commented Sep 22, 2023

ikreymer commented Oct 4, 2023

benoit74 commented Feb 19, 2024

benoit74 commented May 28, 2024

Crawler getting stuck on Page Crashed #391

Crawler getting stuck on Page Crashed #391

Comments

benoit74 commented Sep 22, 2023

benoit74 commented Sep 22, 2023

ikreymer commented Sep 22, 2023

ikreymer commented Oct 4, 2023

benoit74 commented Feb 19, 2024

benoit74 commented May 28, 2024