Skip to content

Releases: webrecorder/browsertrix-crawler

Browsertix Crawler 0.7.0 Beta 5

21 Sep 01:30
65933c6
Compare
Choose a tag to compare
Pre-release

What's Changed

  • Interrupt Handling Fixes by @ikreymer in #167
  • Update to Browsertrix Behaviors 0.3.4 - Fix for lazy-loaded images #165

Full Changelog: 0.7.0-beta.4...0.7.0-beta.5

Browsertix Crawler 0.7.0 Beta 4

09 Sep 06:57
314ee3f
Compare
Choose a tag to compare
Pre-release

Fixing related to wait times, including:

  • netIdleWait better defaults: if not set, set to 15 seconds for page/page-spa scope, otherwise to 2 seconds
  • default behaviors: include autoscroll in default behavior as well
  • restart: if crawl already done, don't attempt to crawl further. if 'waitOnDone' set, wait for signal before exiting.
  • bump to puppeteer-core 17.1.2

Full Changelog: 0.7.0-beta.3...0.7.0-beta.4

Browsertrix Crawler 0.7.0 Beta 3

03 Sep 01:06
Compare
Choose a tag to compare
Pre-release

What's Changed

  • Overhaul of page concurrency system: better detection of windows that are stuck, only reuse same window for every 25 pages, #157
  • Logging improvements: pywb.log written with --logging pywb, JS errors logged with --logging jserrors #158
  • Avoid getting stuck on pending requests at end of crawl: #161
  • Update to Browsertrix Behaviors 0.3.3: Better Crawling of twitter and autoplay of videos
  • Update to pywb 2.6.8: Includes better rewriting of embedded twitter videos.

Full Changelog: 0.7.0-beta.2...0.7.0-beta.3

Browsertix Crawler 0.7.0 Beta 2

18 Aug 05:23
Compare
Choose a tag to compare
Pre-release

Fixes include:

  • Default --waitUntil set to load instead of load,networkidle2, due to occasional hanging waiting for both
  • Add --netIdleWait to specify wait for network idle after load (defaults to 10 seconds)
  • Update to puppeteer 16.1.0
  • Logging: if pywb logging is enabled, write logs to collection dir ./logs/pywb.log and ./logs/redis.log
  • Logging: reduce logging by not printing duplicate behavior status logs
  • pywb/openssl: allow 'unsafe legacy renegotiation' to avoid errors capturing sites that use older ssl

Full Changelog: 0.7.0-beta.1...0.7.0-beta.2

Browsertix Crawler 0.7.0 Beta 1

03 Jul 18:12
Compare
Choose a tag to compare
Pre-release

Update to Chrome/Chromium 101, using new browsertrix-browser-base image with browser installed. Also includes update to Ubuntu 22.04 and additional fonts.

Browsertix Crawler 0.6.0

17 Jun 20:12
cf90304
Compare
Choose a tag to compare

This release features additional improvements to support parallel crawls in Browsertrix Cloud:

  • Add a --waitOnDone option, which has browsertrix crawler wait when finished (for use with Browsertrix Cloud)
  • When running with redis shared state, set the :status field to running, failing/failed or done to let job controller know crawl is finished.
  • Set redis state to failing in case of exception, set to failed in case of >3 or more failed exits within 60 seconds (but don't mark as failed if all pages are finished and >0 pages.
  • When receiving a SIGUSR1, don't wait on down (assume final exit due to scale down).

Screencasting Fixes:

  • More efficient screencasting, don't end screencasting when page ends, only when target is destroyed!
  • Keep same screencasting connection from one page to next, as the target are reused in 'window' concurrency mode

Crawl Limits (from 0.6.0 beta)

  • Size limit (in bytes) via --sizeLimit
  • Total time limit (in bytes) via --timeLimit
  • Overwrite collection (delete existing) via --overwrite
  • Fixes to interrupting a single instance in a shared state crawl

Profile Creation (from 0.6.0 beta)

  • force all cookies, including session cookies, to fixed duration in days, configurable via --cookieDays

Browsertrix Crawler 0.6.0-beta.1

19 May 06:45
Compare
Choose a tag to compare
Pre-release

Additional crawl limits:

  • Size limit (in bytes) via --sizeLimit
  • Total time limit (in bytes) via --timeLimit
  • Overwrite collection (delete existing) via --overwrite
  • Fixes to interrupting a single instance in a shared state crawl

Improved profile creation:

  • Additional API
  • ability to shutdown profile browser if no incoming pings (via --shutdownWait)
  • force all cookies to fixed duration (via --cookieDays)

Browsertrix Crawler 0.5.1

16 Apr 03:14
5dfbfbe
Compare
Choose a tag to compare

Changes

Dependency Update Release:

  • update pywb to 2.6.7, fix possible error cdx indexing ever via --generateCDX
  • update wacz to 0.4.6, ensure wacz file is closed and better and more error-resilient text extraction
  • update browsertrix-behaviors to 0.3.0, support for Telegram behavior auto-scrolling behavior (and generic 'autoscroll up' support)

Full Changelog: 0.5.0...0.5.1

Browsertrix Crawler 0.5.0

11 Apr 22:27
Compare
Choose a tag to compare

Changes and Features

  • Scope: support for scopeType: domain to include all subdomains and ignoring 'www.' if specified in the seed.
  • Profiles: support loading remote profile from URL as well as local file
  • Non-HTML Pages: Load non-200 responses in browser, even if non-html, fix waiting issues with non-HTML pages (eg. PDFs)
  • Config options: Fix setting user-agent
  • Page behavior: latest browsertrix-behaviors, also add experimental Cloudflare interstitial wait.
  • Error handling: better error handling for redis errors
  • State: Support loading of crawl state from config.yaml
  • State: Support serialization of crawl state to crawls subdirectory, both while running (keeping last N states) and on exit.
  • State: Graceful saving of crawl state on ctrl+c interrupt
  • State: Memory or Redis based crawl state
  • Config: Support additional options via CRAWL_ARGS environment variable
  • WACZ Upload: Support for S3 upload of WACZ upon crawl completion
  • WACZ Upload: HTTP/Redis webhook to notify of upload completion
  • Crawl Scope: Support for extraHops to optionally crawl an extra hop beyond scope
  • Signing: Support for optional signing of WACZ
  • Dependencies: update to latest pywb, wacz and browsertrix-behaviors packages
  • Params: Support custom browser args
  • Docs: Improve customization documentation

New Contributors

Full Changelog: 0.4.4...0.5.0

Browsertrix Crawler 0.5.0 Beta 8

23 Mar 01:08
Compare
Choose a tag to compare
Pre-release

This release includes fix for:

  • Improved capture of non-HTML pages, fixes #129
  • For scopeType: domain, if specified URL starts with www., include the non-www version.