Skip to content

Releases: webrecorder/browsertrix-crawler

Browsertrix Crawler 0.12.1

03 Nov 22:18
dd7b926
Compare
Choose a tag to compare

Fixes

  • Optimize exclusion removal, follow-up to #408
  • Fix regression with --text false being rejected, while in use with Browsertrix Cloud (see: webrecorder/browsertrix#1334)

What's Changed

  • Exclusion Filtering Optimizations: check exclusion before loading new page + additional improvements @ikreymer in #423

Full Changelog: v0.12.0...v0.12.1

Browsertrix Crawler 0.12.0

02 Nov 18:55
15661eb
Compare
Choose a tag to compare

Major Changes

  • Use Brave same version of Brave for base image, instead of slightly different Chrome (amd64) and Chromium (arm64)
  • Support for faster cancelation of crawl via Redis key + signal
  • Include CRC32 in storage webhook for nested WACZ support
  • Dynamic exclusion addition/queue filter/removal via redis message queue
  • Text extraction stored in WARC records (both initial and final page after behaviors) with new --text options

What's Changed

  • Switch to Brave Base Image by @ikreymer in #400
  • Store crawler start and end times in Redis lists by @tw4l in #397
  • additional failure logic: by @ikreymer in #402
  • tests: disable ad-block tests: seeing inconsistent ci behavior by @ikreymer in #407
  • Fast cancelation + remove time counter by @ikreymer in #406
  • disable component updates by setting --component-updater to invalid URL by @ikreymer in #413
  • storage: also compute crc32 as part of storage webhook when uploading… by @ikreymer in #414
  • Support adding/removing exclusions without restarting the crawler by @ikreymer in #408
  • load saved state fixes + redis tests by @ikreymer in #415
  • Return User-Agent on all code path to set headers appropriately by @benoit74 in #420
  • improved text extraction: (addresses #403) by @ikreymer in #404
  • More flexible multi value arg parsing + README update for 0.12.0 by @ikreymer in #422

Full Changelog: v0.11.2...v0.12.0

Browsertix Crawler 0.12.0 Beta 2

28 Oct 01:36
Compare
Choose a tag to compare
Pre-release

What's Changed

  • disable component updates by setting --component-updater to invalid URL by @ikreymer in #413
  • storage: also compute crc32 as part of storage webhook when uploading… by @ikreymer in #414
  • Support adding/removing exclusions without restarting the crawler by @ikreymer in #408
  • load saved state fixes + redis tests by @ikreymer in #415
  • Return User-Agent on all code path to set headers appropriately by @benoit74 in #420

Full Changelog: v0.12.0-beta.1...v0.12.0-beta.2

Browsertrix Crawler 0.12.0 Beta 1

09 Oct 21:05
Compare
Choose a tag to compare
Pre-release

What's Changed

  • Store crawler start and end times in Redis lists by @tw4l in #397
  • additional failure logic: by @ikreymer in #402
  • tests: disable ad-block tests: seeing inconsistent ci behavior by @ikreymer in #407
  • Fast cancelation + remove time counter by @ikreymer in #406

Full Changelog: v0.12.0-beta.0...v0.12.0-beta.1

Browsertrix Crawler 0.12.0 Beta 0

02 Oct 21:40
f453dbf
Compare
Choose a tag to compare
Pre-release

Switching to Brave from Chrome/Chromium!

What's Changed

Full Changelog: v0.11.2...v0.12.0-beta.0

Browsertrix Crawler 0.11.2

29 Sep 18:54
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.11.1...v0.11.2

Browsertrix Crawler 0.11.1

19 Sep 03:45
c6cbbc1
Compare
Choose a tag to compare

Bug Fix Release

Should fix a few issues related to crawls getting stuck and not continuing and/or screencast stopping after a while, including:

  • Detecting 'page crash' events and logging them
  • Detecting 'browser crash' events and interrupting crawl (after saving state / ensuring data is written to WARCs)

What's Changed

Full Changelog: v0.11.0...v0.11.1

Browsertrix Crawler 0.11.0

15 Sep 18:28
Compare
Choose a tag to compare

New Features

  • Store favicon urls as favIconUrl in pages.jsonl
  • Support for filtering sitemap by date (from specified date)
  • Link extraction optimizations
  • Behaviors only run after page is fully loaded and links extraction has finished, previously autoplay/autofetch would start right away.

What's Changed

New Contributors

Full Changelog: v0.10.4...v0.11.0

Browsertrix Crawler 0.10.4

23 Aug 00:22
cf404ef
Compare
Choose a tag to compare

Bug fix release

What's Changed

  • args parsing: fix parseRx() for inclusions/exclusions to deal with no… by @ikreymer in #353
  • mark for upload-and-delete when crawl is interrupted for any limit: by @ikreymer in #354
  • improve crawl stopped check with unified isCrawlRunning() check with … by @ikreymer in #356

Full Changelog: v0.10.3...v0.10.4

Browsertrix Crawler 0.10.3

08 Aug 17:24
Compare
Choose a tag to compare

What's Changed

  • Fix for sizeLimit: only delete local data if a WACZ has been uploaded by @ikreymer in #347
  • seed parsing: return null if invalid url encountered in parseUrl to a… by @ikreymer in #349

Full Changelog: 0.10.2...v0.10.3