Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

solar.lowtechmagazine.com is very unstable #283

Closed
benoit74 opened this issue Feb 19, 2024 · 5 comments
Closed

solar.lowtechmagazine.com is very unstable #283

benoit74 opened this issue Feb 19, 2024 · 5 comments

Comments

@benoit74
Copy link
Collaborator

In addition to #266 issue which has often been encountered on solar.lowtechmagazine.com recipe, the situation became clear the situation is even worse with last ZIM update I tried to perform which failed in very various manners.

All executions mentioned below ran with

  • zimit 2.0.0-dev1
  • warc2zim 2.0.0-dev4 (i.e. with last fixes for document decoding enhancements / fallbacks)
  • Browsertrix-Crawler 0.12.4 (with warcio.js 1.6.2 pywb 2.7.4)

First execution (which ran on athena18) succeeded from Zimfarm perspective, but the ZIM was only 53.61 MB instead of about 300-310 MB.

Second execution (which ran on athena18) failed with a new occurrence of #266 (see #266 (comment) for new details I found out).

Third execution (which ran on athena18) succeeded from Zimfarm perspective, but the ZIM was only 63.57 MB instead of about 300-310 MB.

Fourth execution (which ran on ondemand) succeeded to produce a ZIM of about 300-310 MB. We still have 1388 failed pages (out of 2779), but we do not have any "Page crashed" message in the log. Looking at few failed pages, it looks like real issues (invalid href links) in the source website.

@benoit74 benoit74 added the bug label Feb 19, 2024
@benoit74
Copy link
Collaborator Author

benoit74 commented Feb 19, 2024

Diving a bit deeper into the third execution failure above, I've found that it is probably only a slightly different occurrence of #266

  • we have encountered 12 Page crashed
  • some pages have been crawled successfully between the Page crashed
  • the ZIM was produced because Browsertrix considered it had to stop and probably returned code 11, which zimit interprets as crawl interupted by a limit (we do not have the return code in the log but we have this zimit message which is displayed only when return code is 11.
  • for the last page before the crawler stops, we have 4 times New Window Timed Out message (never occurred before in this crawl session) and what looks like the final message Browser disconnected (crashed?), interrupting crawl
{"timestamp":"2024-02-17T04:16:43.909Z","logLevel":"info","context":"general","message":"Awaiting page load","details":{"page":"https://solar.lowtechmagazine.com/2021/10/how-to-build-a-low-tech-solar-panel/","workerid":0}}
{"timestamp":"2024-02-17T04:16:44.358Z","logLevel":"error","context":"worker","message":"Page Crashed","details":{"type":"exception","message":"Page crashed!","stack":"Error: Page crashed!\n    at #onTargetCrashed (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Page.js:284:28)\n    at file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Page.js:153:41\n    at file:///app/node_modules/puppeteer-core/lib/esm/third_party/mitt/index.js:1:248\n    at Array.map (<anonymous>)\n    at Object.emit (file:///app/node_modules/puppeteer-core/lib/esm/third_party/mitt/index.js:1:232)\n    at CDPSessionImpl.emit (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/EventEmitter.js:82:22)\n    at CDPSessionImpl._onMessage (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Connection.js:425:18)\n    at Connection.onMessage (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Connection.js:255:25)\n    at WebSocket.<anonymous> (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/NodeWebSocketTransport.js:46:32)\n    at callListener (/app/node_modules/puppeteer-core/node_modules/ws/lib/event-target.js:290:14)","page":"https://solar.lowtechmagazine.com/2021/10/how-to-build-a-low-tech-solar-panel/","workerid":0}}
{"timestamp":"2024-02-17T04:16:44.371Z","logLevel":"warn","context":"pageStatus","message":"Page Load Failed","details":{"loadState":0,"page":"https://solar.lowtechmagazine.com/2021/10/how-to-build-a-low-tech-solar-panel/","workerid":0}}
{"timestamp":"2024-02-17T04:16:44.729Z","logLevel":"info","context":"worker","message":"Starting page","details":{"workerid":0,"page":"https://solar.lowtechmagazine.com/2021/12/the-printed-website-volume-iii-and-more/"}}
{"timestamp":"2024-02-17T04:16:44.730Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":234,"total":1337,"pending":1,"failed":34,"limit":{"max":0,"hit":false},"pendingPages":["{\"seedId\":0,\"started\":\"2024-02-17T04:16:44.422Z\",\"url\":\"https://solar.lowtechmagazine.com/2021/12/the-printed-website-volume-iii-and-more/\",\"added\":\"2024-02-17T04:06:40.040Z\",\"depth\":2}"]}}
{"timestamp":"2024-02-17T04:16:45.061Z","logLevel":"warn","context":"general","message":"Link Extraction failed in frame","details":{"reason":{"name":"TargetCloseError"},"page":"https://solar.lowtechmagazine.com/2021/10/how-to-build-a-low-tech-solar-panel/","workerid":0}}
{"timestamp":"2024-02-17T04:16:45.151Z","logLevel":"info","context":"general","message":"Awaiting page load","details":{"page":"https://solar.lowtechmagazine.com/2021/12/the-printed-website-volume-iii-and-more/","workerid":0}}
{"timestamp":"2024-02-17T04:16:45.499Z","logLevel":"error","context":"worker","message":"Page Crashed","details":{"type":"exception","message":"Page crashed!","stack":"Error: Page crashed!\n    at #onTargetCrashed (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Page.js:284:28)\n    at file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Page.js:153:41\n    at file:///app/node_modules/puppeteer-core/lib/esm/third_party/mitt/index.js:1:248\n    at Array.map (<anonymous>)\n    at Object.emit (file:///app/node_modules/puppeteer-core/lib/esm/third_party/mitt/index.js:1:232)\n    at CDPSessionImpl.emit (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/EventEmitter.js:82:22)\n    at CDPSessionImpl._onMessage (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Connection.js:425:18)\n    at Connection.onMessage (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Connection.js:255:25)\n    at WebSocket.<anonymous> (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/NodeWebSocketTransport.js:46:32)\n    at callListener (/app/node_modules/puppeteer-core/node_modules/ws/lib/event-target.js:290:14)","page":"https://solar.lowtechmagazine.com/2021/12/the-printed-website-volume-iii-and-more/","workerid":0}}
{"timestamp":"2024-02-17T04:16:45.500Z","logLevel":"info","context":"pageStatus","message":"Page Finished","details":{"loadState":2,"page":"https://solar.lowtechmagazine.com/2021/12/the-printed-website-volume-iii-and-more/","workerid":0}}
{"timestamp":"2024-02-17T04:16:46.046Z","logLevel":"warn","context":"general","message":"Link Extraction failed in frame","details":{"reason":{"name":"TargetCloseError"},"page":"https://solar.lowtechmagazine.com/2021/12/the-printed-website-volume-iii-and-more/","workerid":0}}
{"timestamp":"2024-02-17T04:17:05.550Z","logLevel":"error","context":"worker","message":"New Window Timed Out","details":{"seconds":20,"workerid":0}}
{"timestamp":"2024-02-17T04:17:05.550Z","logLevel":"warn","context":"worker","message":"Error getting new page","details":{"workerid":0,"type":"exception","message":"timed out","stack":"Error: timed out\n    at PageWorker.initPage (file:///app/util/worker.js:135:17)\n    at async PageWorker.runLoop (file:///app/util/worker.js:254:22)\n    at async PageWorker.run (file:///app/util/worker.js:227:7)\n    at async Promise.allSettled (index 0)\n    at async Crawler.crawl (file:///app/crawler.js:887:5)\n    at async Crawler.run (file:///app/crawler.js:323:7)"}}
{"timestamp":"2024-02-17T04:17:06.051Z","logLevel":"warn","context":"worker","message":"Retrying getting new page","details":{"page":"https://solar.lowtechmagazine.com/2021/12/the-printed-website-volume-iii-and-more/","workerid":0}}
{"timestamp":"2024-02-17T04:17:26.053Z","logLevel":"error","context":"worker","message":"New Window Timed Out","details":{"seconds":20,"workerid":0}}
{"timestamp":"2024-02-17T04:17:26.054Z","logLevel":"warn","context":"worker","message":"Error getting new page","details":{"workerid":0,"type":"exception","message":"timed out","stack":"Error: timed out\n    at PageWorker.initPage (file:///app/util/worker.js:135:17)\n    at async PageWorker.runLoop (file:///app/util/worker.js:254:22)\n    at async PageWorker.run (file:///app/util/worker.js:227:7)\n    at async Promise.allSettled (index 0)\n    at async Crawler.crawl (file:///app/crawler.js:887:5)\n    at async Crawler.run (file:///app/crawler.js:323:7)"}}
{"timestamp":"2024-02-17T04:17:26.554Z","logLevel":"warn","context":"worker","message":"Retrying getting new page","details":{"page":"https://solar.lowtechmagazine.com/2021/12/the-printed-website-volume-iii-and-more/","workerid":0}}
{"timestamp":"2024-02-17T04:17:46.556Z","logLevel":"error","context":"worker","message":"New Window Timed Out","details":{"seconds":20,"workerid":0}}
{"timestamp":"2024-02-17T04:17:46.556Z","logLevel":"warn","context":"worker","message":"Error getting new page","details":{"workerid":0,"type":"exception","message":"timed out","stack":"Error: timed out\n    at PageWorker.initPage (file:///app/util/worker.js:135:17)\n    at async PageWorker.runLoop (file:///app/util/worker.js:254:22)\n    at async PageWorker.run (file:///app/util/worker.js:227:7)\n    at async Promise.allSettled (index 0)\n    at async Crawler.crawl (file:///app/crawler.js:887:5)\n    at async Crawler.run (file:///app/crawler.js:323:7)"}}
{"timestamp":"2024-02-17T04:17:47.057Z","logLevel":"warn","context":"worker","message":"Retrying getting new page","details":{"page":"https://solar.lowtechmagazine.com/2021/12/the-printed-website-volume-iii-and-more/","workerid":0}}
{"timestamp":"2024-02-17T04:18:07.057Z","logLevel":"error","context":"worker","message":"New Window Timed Out","details":{"seconds":20,"workerid":0}}
{"timestamp":"2024-02-17T04:18:07.058Z","logLevel":"warn","context":"worker","message":"Error getting new page","details":{"workerid":0,"type":"exception","message":"timed out","stack":"Error: timed out\n    at PageWorker.initPage (file:///app/util/worker.js:135:17)\n    at async PageWorker.runLoop (file:///app/util/worker.js:254:22)\n    at async PageWorker.run (file:///app/util/worker.js:227:7)\n    at async Promise.allSettled (index 0)\n    at async Crawler.crawl (file:///app/crawler.js:887:5)\n    at async Crawler.run (file:///app/crawler.js:323:7)"}}
{"timestamp":"2024-02-17T04:18:07.557Z","logLevel":"warn","context":"worker","message":"Retrying getting new page","details":{"page":"https://solar.lowtechmagazine.com/2021/12/the-printed-website-volume-iii-and-more/","workerid":0}}
{"timestamp":"2024-02-17T04:18:09.084Z","logLevel":"error","context":"browser","message":"Browser disconnected (crashed?), interrupting crawl","details":{}}
{"timestamp":"2024-02-17T04:18:09.088Z","logLevel":"warn","context":"worker","message":"Error getting new page","details":{"workerid":0,"type":"exception","message":"Protocol error (Network.setUserAgentOverride): Session closed. Most likely the page has been closed.","stack":"TargetCloseError: Protocol error (Network.setUserAgentOverride): Session closed. Most likely the page has been closed.\n    at CDPSessionImpl.send (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Connection.js:405:35)\n    at NetworkManager.setUserAgent (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/NetworkManager.js:144:28)\n    at CDPPage.setUserAgent (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Page.js:451:50)\n    at Browser.newWindowPageWithCDP (file:///app/util/browser.js:288:20)\n    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at async PageWorker.initPage (file:///app/util/worker.js:126:24)\n    at async PageWorker.runLoop (file:///app/util/worker.js:254:22)\n    at async PageWorker.run (file:///app/util/worker.js:227:7)\n    at async Promise.allSettled (index 0)\n    at async Crawler.crawl (file:///app/crawler.js:887:5)"}}
{"timestamp":"2024-02-17T04:18:09.088Z","logLevel":"info","context":"worker","message":"Starting page","details":{"workerid":0,"page":"https://solar.lowtechmagazine.com/2022/03/how-to-build-a-practical-household-bike-generator/"}}
{"timestamp":"2024-02-17T04:18:09.090Z","logLevel":"warn","context":"pageStatus","message":"Page Load Failed","details":{"loadState":0}}
{"timestamp":"2024-02-17T04:18:09.091Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":235,"total":1337,"pending":1,"failed":35,"limit":{"max":0,"hit":false},"pendingPages":["{\"seedId\":0,\"started\":\"2024-02-17T04:16:45.517Z\",\"url\":\"https://solar.lowtechmagazine.com/2022/03/how-to-build-a-practical-household-bike-generator/\",\"added\":\"2024-02-17T04:06:40.040Z\",\"depth\":2}"]}}
{"timestamp":"2024-02-17T04:18:09.112Z","logLevel":"info","context":"worker","message":"Worker exiting, all tasks complete","details":{"workerid":0}}
{"timestamp":"2024-02-17T04:18:09.153Z","logLevel":"info","context":"general","message":"Saving crawl state to: /output/.tmper_31529/collections/crawl-20240217040548965/crawls/crawl-20240217041809-5a6f97c85870.yaml","details":{}}
{"timestamp":"2024-02-17T04:18:09.176Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":236,"total":1337,"pending":0,"failed":35,"limit":{"max":0,"hit":false},"pendingPages":[]}}
{"timestamp":"2024-02-17T04:18:09.178Z","logLevel":"info","context":"general","message":"Waiting to ensure pending data is written to WARCs...","details":{}}
{"timestamp":"2024-02-17T04:18:09.203Z","logLevel":"info","context":"general","message":"Crawling done","details":{}}
{"timestamp":"2024-02-17T04:18:09.204Z","logLevel":"info","context":"general","message":"Exiting, Crawl status: interrupted","details":{}}

@Jaifroid
Copy link

Remember this server is a solar-powered installation with a backup battery. I noticed that our scrapes always seem to be running when it's around 25% battery (coincidence?). Maybe this is irrelevant, but it's not a fast server. Any way to slow the crawl down so it doesn't overwhelm the server? Not sure if this would help, but it's a guess...

@benoit74
Copy link
Collaborator Author

Thank you for the suggestion, I had this in mind as well.

I doubt there is a link because here the logs says that the problem is not a timeout while loading the page (which would indicate an upstream server issue) but a page crash or a new window timeout, meaning an issue with the brave browser on the scraper.

I will keep this in mind anyway.

@benoit74
Copy link
Collaborator Author

Looks like Browsertrix crawler 1.0 might have enhanced the situation

@benoit74
Copy link
Collaborator Author

I confirm that issue seems to be gone, I ran the recipe many times recently without any problem. Last runs produced quite a smaller ZIM but there is no significant message in the logs about pages which might be missing (143 links are broken on the solar.lowtechmagazine.com domain, but the ones I randomly checked are really not available online, so this is "normal").

@benoit74 benoit74 closed this as not planned Won't fix, can't repro, duplicate, stale Jun 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants