Releases · webrecorder/browsertrix-crawler

03 Nov 22:18

ikreymer

v0.12.1

dd7b926

Browsertrix Crawler 0.12.1

Fixes

Optimize exclusion removal, follow-up to #408
Fix regression with --text false being rejected, while in use with Browsertrix Cloud (see: webrecorder/browsertrix#1334)

What's Changed

Exclusion Filtering Optimizations: check exclusion before loading new page + additional improvements @ikreymer in #423

Full Changelog: v0.12.0...v0.12.1

Contributors

ikreymer

Assets 2

02 Nov 18:55

ikreymer

v0.12.0

15661eb

Browsertrix Crawler 0.12.0

Major Changes

Use Brave same version of Brave for base image, instead of slightly different Chrome (amd64) and Chromium (arm64)
Support for faster cancelation of crawl via Redis key + signal
Include CRC32 in storage webhook for nested WACZ support
Dynamic exclusion addition/queue filter/removal via redis message queue
Text extraction stored in WARC records (both initial and final page after behaviors) with new --text options

What's Changed

Switch to Brave Base Image by @ikreymer in #400
Store crawler start and end times in Redis lists by @tw4l in #397
additional failure logic: by @ikreymer in #402
tests: disable ad-block tests: seeing inconsistent ci behavior by @ikreymer in #407
Fast cancelation + remove time counter by @ikreymer in #406
disable component updates by setting --component-updater to invalid URL by @ikreymer in #413
storage: also compute crc32 as part of storage webhook when uploading… by @ikreymer in #414
Support adding/removing exclusions without restarting the crawler by @ikreymer in #408
load saved state fixes + redis tests by @ikreymer in #415
Return User-Agent on all code path to set headers appropriately by @benoit74 in #420
improved text extraction: (addresses #403) by @ikreymer in #404
More flexible multi value arg parsing + README update for 0.12.0 by @ikreymer in #422

Full Changelog: v0.11.2...v0.12.0

Contributors

ikreymer, tw4l, and benoit74

Assets 2

28 Oct 01:36

ikreymer

v0.12.0-beta.2

064db52

Browsertix Crawler 0.12.0 Beta 2 Pre-release

Pre-release

What's Changed

disable component updates by setting --component-updater to invalid URL by @ikreymer in #413
storage: also compute crc32 as part of storage webhook when uploading… by @ikreymer in #414
Support adding/removing exclusions without restarting the crawler by @ikreymer in #408
load saved state fixes + redis tests by @ikreymer in #415
Return User-Agent on all code path to set headers appropriately by @benoit74 in #420

Full Changelog: v0.12.0-beta.1...v0.12.0-beta.2

Contributors

ikreymer and benoit74

Assets 2

09 Oct 21:05

ikreymer

v0.12.0-beta.1

9ae297c

Browsertrix Crawler 0.12.0 Beta 1 Pre-release

Pre-release

What's Changed

Store crawler start and end times in Redis lists by @tw4l in #397
additional failure logic: by @ikreymer in #402
tests: disable ad-block tests: seeing inconsistent ci behavior by @ikreymer in #407
Fast cancelation + remove time counter by @ikreymer in #406

Full Changelog: v0.12.0-beta.0...v0.12.0-beta.1

Contributors

ikreymer and tw4l

Assets 2

02 Oct 21:40

ikreymer

v0.12.0-beta.0

f453dbf

Browsertrix Crawler 0.12.0 Beta 0 Pre-release

Pre-release

Switching to Brave from Chrome/Chromium!

What's Changed

Switch to Brave Base Image by @ikreymer in #400

Full Changelog: v0.11.2...v0.12.0-beta.0

Contributors

ikreymer

Assets 2

29 Sep 18:54

ikreymer

v0.11.2

4c7ebf1

Browsertrix Crawler 0.11.2

What's Changed

more logging improvements by @ikreymer in #389
additional fixes for worker getting stuck by @ikreymer in #396
Update README.md by @gitreich in #390
Set new logic for invalid seeds by @tw4l in #395

New Contributors

@gitreich made their first contribution in #390

Full Changelog: v0.11.1...v0.11.2

Contributors

ikreymer, tw4l, and gitreich

Assets 2

19 Sep 03:45

ikreymer

v0.11.1

c6cbbc1

Browsertrix Crawler 0.11.1

Bug Fix Release

Should fix a few issues related to crawls getting stuck and not continuing and/or screencast stopping after a while, including:

Detecting 'page crash' events and logging them
Detecting 'browser crash' events and interrupting crawl (after saving state / ensuring data is written to WARCs)

What's Changed

favicon: use 127.0.0.1 instead of localhost by @ikreymer in #384
Error handling fixes to avoid crawler getting stuck. by @ikreymer in #385
Update CI Release Action by @ikreymer in #386

Full Changelog: v0.11.0...v0.11.1

Contributors

ikreymer

Assets 2

15 Sep 18:28

ikreymer

v0.11.0

debfe89

Browsertrix Crawler 0.11.0

New Features

Store favicon urls as favIconUrl in pages.jsonl
Support for filtering sitemap by date (from specified date)
Link extraction optimizations
Behaviors only run after page is fully loaded and links extraction has finished, previously autoplay/autofetch would start right away.

What's Changed

link extraction optimization: for scopeType page, set depth == extraH… by @ikreymer in #364
improve exit features: individual instance exit + exit code for interrupt by @ikreymer in #366
feat: precommit by @Chickensoupwithrice in #363
Capture Favicon by @Chickensoupwithrice in #362
logging: resolve confusion with 'crawl done' not being written to log… by @ikreymer in #375
logging fixes: avoid duplicate logging for same error by @ikreymer in #377
Surface lastmod option for sitemap parser by @ghukill in #367
Add example of mounting custom behaviours by @Chickensoupwithrice in #369
various fixes regarding state restart: by @ikreymer in #370
status: fix typo setting status to log message by @ikreymer in #379
Add option to output stats file live, i.e. after each page crawled by @benoit74 in #374
behavior logging tweaks, add netIdle by @ikreymer in #381
Update tldextract cache for pywb during build by @vnznznz in #383
Enhance file stats test to detect file modification by @benoit74 in #382
optimize link extraction: (fixes #376) by @ikreymer in #380

New Contributors

@Chickensoupwithrice made their first contribution in #363
@ghukill made their first contribution in #367
@benoit74 made their first contribution in #374
@vnznznz made their first contribution in #383

Full Changelog: v0.10.4...v0.11.0

Contributors

ikreymer, ghukill, and 3 other contributors

Assets 2

23 Aug 00:22

ikreymer

v0.10.4

cf404ef

Browsertrix Crawler 0.10.4

Bug fix release

What's Changed

args parsing: fix parseRx() for inclusions/exclusions to deal with no… by @ikreymer in #353
mark for upload-and-delete when crawl is interrupted for any limit: by @ikreymer in #354
improve crawl stopped check with unified isCrawlRunning() check with … by @ikreymer in #356

Full Changelog: v0.10.3...v0.10.4

Contributors

ikreymer

Assets 2

08 Aug 17:24

ikreymer

v0.10.3

16751de

Browsertrix Crawler 0.10.3

What's Changed

Fix for sizeLimit: only delete local data if a WACZ has been uploaded by @ikreymer in #347
seed parsing: return null if invalid url encountered in parseUrl to a… by @ikreymer in #349

Full Changelog: 0.10.2...v0.10.3

Contributors

ikreymer

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes

What's Changed

Contributors

Major Changes

What's Changed

Contributors

What's Changed

Contributors

What's Changed

Contributors

What's Changed

Contributors

What's Changed

New Contributors

Contributors

Bug Fix Release

What's Changed

Contributors

New Features

What's Changed

New Contributors

Contributors

What's Changed

Contributors

What's Changed

Contributors

Releases: webrecorder/browsertrix-crawler

Browsertrix Crawler 0.12.1

Fixes

What's Changed

Contributors

Browsertrix Crawler 0.12.0

Major Changes

What's Changed

Contributors

Browsertix Crawler 0.12.0 Beta 2

What's Changed

Contributors

Browsertrix Crawler 0.12.0 Beta 1

What's Changed

Contributors

Browsertrix Crawler 0.12.0 Beta 0

What's Changed

Contributors

Browsertrix Crawler 0.11.2

What's Changed

New Contributors

Contributors

Browsertrix Crawler 0.11.1

Bug Fix Release

What's Changed

Contributors

Browsertrix Crawler 0.11.0

New Features

What's Changed

New Contributors

Contributors

Browsertrix Crawler 0.10.4

What's Changed

Contributors

Browsertrix Crawler 0.10.3

What's Changed

Contributors