Releases: webrecorder/browsertrix-crawler
Releases · webrecorder/browsertrix-crawler
Browsertix Crawler 0.7.0 Beta 5
What's Changed
- Interrupt Handling Fixes by @ikreymer in #167
- Update to Browsertrix Behaviors 0.3.4 - Fix for lazy-loaded images #165
Full Changelog: 0.7.0-beta.4...0.7.0-beta.5
Browsertix Crawler 0.7.0 Beta 4
Fixing related to wait times, including:
- netIdleWait better defaults: if not set, set to 15 seconds for page/page-spa scope, otherwise to 2 seconds
- default behaviors: include autoscroll in default behavior as well
- restart: if crawl already done, don't attempt to crawl further. if 'waitOnDone' set, wait for signal before exiting.
- bump to puppeteer-core 17.1.2
Full Changelog: 0.7.0-beta.3...0.7.0-beta.4
Browsertrix Crawler 0.7.0 Beta 3
What's Changed
- Overhaul of page concurrency system: better detection of windows that are stuck, only reuse same window for every 25 pages, #157
- Logging improvements: pywb.log written with
--logging pywb
, JS errors logged with--logging jserrors
#158 - Avoid getting stuck on pending requests at end of crawl: #161
- Update to Browsertrix Behaviors 0.3.3: Better Crawling of twitter and autoplay of videos
- Update to pywb 2.6.8: Includes better rewriting of embedded twitter videos.
Full Changelog: 0.7.0-beta.2...0.7.0-beta.3
Browsertix Crawler 0.7.0 Beta 2
Fixes include:
- Default --waitUntil set to
load
instead ofload,networkidle2
, due to occasional hanging waiting for both - Add --netIdleWait to specify wait for network idle after load (defaults to 10 seconds)
- Update to puppeteer 16.1.0
- Logging: if pywb logging is enabled, write logs to collection dir
./logs/pywb.log
and./logs/redis.log
- Logging: reduce logging by not printing duplicate behavior status logs
- pywb/openssl: allow 'unsafe legacy renegotiation' to avoid errors capturing sites that use older ssl
Full Changelog: 0.7.0-beta.1...0.7.0-beta.2
Browsertix Crawler 0.7.0 Beta 1
Update to Chrome/Chromium 101, using new browsertrix-browser-base image with browser installed. Also includes update to Ubuntu 22.04 and additional fonts.
Browsertix Crawler 0.6.0
This release features additional improvements to support parallel crawls in Browsertrix Cloud:
- Add a
--waitOnDone option
, which has browsertrix crawler wait when finished (for use with Browsertrix Cloud) - When running with redis shared state, set the :status field to
running
, failing/failed or done to let job controller know crawl is finished. - Set redis state to failing in case of exception, set to failed in case of >3 or more failed exits within 60 seconds (but don't mark as failed if all pages are finished and >0 pages.
- When receiving a SIGUSR1, don't wait on down (assume final exit due to scale down).
Screencasting Fixes:
- More efficient screencasting, don't end screencasting when page ends, only when target is destroyed!
- Keep same screencasting connection from one page to next, as the target are reused in 'window' concurrency mode
Crawl Limits (from 0.6.0 beta)
- Size limit (in bytes) via --sizeLimit
- Total time limit (in bytes) via --timeLimit
- Overwrite collection (delete existing) via --overwrite
- Fixes to interrupting a single instance in a shared state crawl
Profile Creation (from 0.6.0 beta)
- force all cookies, including session cookies, to fixed duration in days, configurable via --cookieDays
Browsertrix Crawler 0.6.0-beta.1
Additional crawl limits:
- Size limit (in bytes) via --sizeLimit
- Total time limit (in bytes) via --timeLimit
- Overwrite collection (delete existing) via --overwrite
- Fixes to interrupting a single instance in a shared state crawl
Improved profile creation:
- Additional API
- ability to shutdown profile browser if no incoming pings (via --shutdownWait)
- force all cookies to fixed duration (via --cookieDays)
Browsertrix Crawler 0.5.1
Changes
Dependency Update Release:
- update pywb to 2.6.7, fix possible error cdx indexing ever via --generateCDX
- update wacz to 0.4.6, ensure wacz file is closed and better and more error-resilient text extraction
- update browsertrix-behaviors to 0.3.0, support for Telegram behavior auto-scrolling behavior (and generic 'autoscroll up' support)
Full Changelog: 0.5.0...0.5.1
Browsertrix Crawler 0.5.0
Changes and Features
- Scope: support for
scopeType: domain
to include all subdomains and ignoring 'www.' if specified in the seed. - Profiles: support loading remote profile from URL as well as local file
- Non-HTML Pages: Load non-200 responses in browser, even if non-html, fix waiting issues with non-HTML pages (eg. PDFs)
- Config options: Fix setting user-agent
- Page behavior: latest browsertrix-behaviors, also add experimental Cloudflare interstitial wait.
- Error handling: better error handling for redis errors
- State: Support loading of crawl state from config.yaml
- State: Support serialization of crawl state to
crawls
subdirectory, both while running (keeping last N states) and on exit. - State: Graceful saving of crawl state on ctrl+c interrupt
- State: Memory or Redis based crawl state
- Config: Support additional options via
CRAWL_ARGS
environment variable - WACZ Upload: Support for S3 upload of WACZ upon crawl completion
- WACZ Upload: HTTP/Redis webhook to notify of upload completion
- Crawl Scope: Support for
extraHops
to optionally crawl an extra hop beyond scope - Signing: Support for optional signing of WACZ
- Dependencies: update to latest pywb, wacz and browsertrix-behaviors packages
- Params: Support custom browser args
- Docs: Improve customization documentation
New Contributors
- @CreativeCactus made their first contribution in #96
- @asameshimae made their first contribution in #121
- @simonwiles made their first contribution in #122
- @phiresky made their first contribution in #120
Full Changelog: 0.4.4...0.5.0
Browsertrix Crawler 0.5.0 Beta 8
This release includes fix for:
- Improved capture of non-HTML pages, fixes #129
- For
scopeType: domain
, if specified URL starts withwww.
, include the non-www version.