Releases: webrecorder/browsertrix-crawler
Releases · webrecorder/browsertrix-crawler
Browsertrix Crawler 0.12.1
Fixes
- Optimize exclusion removal, follow-up to #408
- Fix regression with
--text false
being rejected, while in use with Browsertrix Cloud (see: webrecorder/browsertrix#1334)
What's Changed
- Exclusion Filtering Optimizations: check exclusion before loading new page + additional improvements @ikreymer in #423
Full Changelog: v0.12.0...v0.12.1
Browsertrix Crawler 0.12.0
Major Changes
- Use Brave same version of Brave for base image, instead of slightly different Chrome (amd64) and Chromium (arm64)
- Support for faster cancelation of crawl via Redis key + signal
- Include CRC32 in storage webhook for nested WACZ support
- Dynamic exclusion addition/queue filter/removal via redis message queue
- Text extraction stored in WARC records (both initial and final page after behaviors) with new --text options
What's Changed
- Switch to Brave Base Image by @ikreymer in #400
- Store crawler start and end times in Redis lists by @tw4l in #397
- additional failure logic: by @ikreymer in #402
- tests: disable ad-block tests: seeing inconsistent ci behavior by @ikreymer in #407
- Fast cancelation + remove time counter by @ikreymer in #406
- disable component updates by setting --component-updater to invalid URL by @ikreymer in #413
- storage: also compute crc32 as part of storage webhook when uploading… by @ikreymer in #414
- Support adding/removing exclusions without restarting the crawler by @ikreymer in #408
- load saved state fixes + redis tests by @ikreymer in #415
- Return User-Agent on all code path to set headers appropriately by @benoit74 in #420
- improved text extraction: (addresses #403) by @ikreymer in #404
- More flexible multi value arg parsing + README update for 0.12.0 by @ikreymer in #422
Full Changelog: v0.11.2...v0.12.0
Browsertix Crawler 0.12.0 Beta 2
What's Changed
- disable component updates by setting --component-updater to invalid URL by @ikreymer in #413
- storage: also compute crc32 as part of storage webhook when uploading… by @ikreymer in #414
- Support adding/removing exclusions without restarting the crawler by @ikreymer in #408
- load saved state fixes + redis tests by @ikreymer in #415
- Return User-Agent on all code path to set headers appropriately by @benoit74 in #420
Full Changelog: v0.12.0-beta.1...v0.12.0-beta.2
Browsertrix Crawler 0.12.0 Beta 1
What's Changed
- Store crawler start and end times in Redis lists by @tw4l in #397
- additional failure logic: by @ikreymer in #402
- tests: disable ad-block tests: seeing inconsistent ci behavior by @ikreymer in #407
- Fast cancelation + remove time counter by @ikreymer in #406
Full Changelog: v0.12.0-beta.0...v0.12.0-beta.1
Browsertrix Crawler 0.12.0 Beta 0
Browsertrix Crawler 0.11.2
Browsertrix Crawler 0.11.1
Bug Fix Release
Should fix a few issues related to crawls getting stuck and not continuing and/or screencast stopping after a while, including:
- Detecting 'page crash' events and logging them
- Detecting 'browser crash' events and interrupting crawl (after saving state / ensuring data is written to WARCs)
What's Changed
- favicon: use 127.0.0.1 instead of localhost by @ikreymer in #384
- Error handling fixes to avoid crawler getting stuck. by @ikreymer in #385
- Update CI Release Action by @ikreymer in #386
Full Changelog: v0.11.0...v0.11.1
Browsertrix Crawler 0.11.0
New Features
- Store favicon urls as
favIconUrl
in pages.jsonl - Support for filtering sitemap by date (from specified date)
- Link extraction optimizations
- Behaviors only run after page is fully loaded and links extraction has finished, previously autoplay/autofetch would start right away.
What's Changed
- link extraction optimization: for scopeType page, set depth == extraH… by @ikreymer in #364
- improve exit features: individual instance exit + exit code for interrupt by @ikreymer in #366
- feat: precommit by @Chickensoupwithrice in #363
- Capture Favicon by @Chickensoupwithrice in #362
- logging: resolve confusion with 'crawl done' not being written to log… by @ikreymer in #375
- logging fixes: avoid duplicate logging for same error by @ikreymer in #377
- Surface lastmod option for sitemap parser by @ghukill in #367
- Add example of mounting custom behaviours by @Chickensoupwithrice in #369
- various fixes regarding state restart: by @ikreymer in #370
- status: fix typo setting status to log message by @ikreymer in #379
- Add option to output stats file live, i.e. after each page crawled by @benoit74 in #374
- behavior logging tweaks, add netIdle by @ikreymer in #381
- Update tldextract cache for pywb during build by @vnznznz in #383
- Enhance file stats test to detect file modification by @benoit74 in #382
- optimize link extraction: (fixes #376) by @ikreymer in #380
New Contributors
- @Chickensoupwithrice made their first contribution in #363
- @ghukill made their first contribution in #367
- @benoit74 made their first contribution in #374
- @vnznznz made their first contribution in #383
Full Changelog: v0.10.4...v0.11.0
Browsertrix Crawler 0.10.4
Bug fix release
What's Changed
- args parsing: fix parseRx() for inclusions/exclusions to deal with no… by @ikreymer in #353
- mark for upload-and-delete when crawl is interrupted for any limit: by @ikreymer in #354
- improve crawl stopped check with unified isCrawlRunning() check with … by @ikreymer in #356
Full Changelog: v0.10.3...v0.10.4
Browsertrix Crawler 0.10.3
What's Changed
- Fix for sizeLimit: only delete local data if a WACZ has been uploaded by @ikreymer in #347
- seed parsing: return null if invalid url encountered in parseUrl to a… by @ikreymer in #349
Full Changelog: 0.10.2...v0.10.3