Releases · webrecorder/browsertrix-crawler

netIdleWait better defaults: if not set, set to 15 seconds for page/page-spa scope, otherwise to 2 seconds
default behaviors: include autoscroll in default behavior as well
restart: if crawl already done, don't attempt to crawl further. if 'waitOnDone' set, wait for signal before exiting.
bump to puppeteer-core 17.1.2

Full Changelog: 0.7.0-beta.3...0.7.0-beta.4

Assets 2

03 Sep 01:06

ikreymer

0.7.0-beta.3

5c93127

Browsertrix Crawler 0.7.0 Beta 3 Pre-release

Pre-release

What's Changed

Overhaul of page concurrency system: better detection of windows that are stuck, only reuse same window for every 25 pages, #157
Logging improvements: pywb.log written with --logging pywb, JS errors logged with --logging jserrors #158
Avoid getting stuck on pending requests at end of crawl: #161
Update to Browsertrix Behaviors 0.3.3: Better Crawling of twitter and autoplay of videos
Update to pywb 2.6.8: Includes better rewriting of embedded twitter videos.

Full Changelog: 0.7.0-beta.2...0.7.0-beta.3

Assets 2

18 Aug 05:23

ikreymer

0.7.0-beta.2

827c153

Browsertix Crawler 0.7.0 Beta 2 Pre-release

Pre-release

Fixes include:

Default --waitUntil set to load instead of load,networkidle2, due to occasional hanging waiting for both
Add --netIdleWait to specify wait for network idle after load (defaults to 10 seconds)
Update to puppeteer 16.1.0
Logging: if pywb logging is enabled, write logs to collection dir ./logs/pywb.log and ./logs/redis.log
Logging: reduce logging by not printing duplicate behavior status logs
pywb/openssl: allow 'unsafe legacy renegotiation' to avoid errors capturing sites that use older ssl

Full Changelog: 0.7.0-beta.1...0.7.0-beta.2

Assets 2

03 Jul 18:12

ikreymer

0.7.0-beta.1

bd10f1a

Browsertix Crawler 0.7.0 Beta 1 Pre-release

Pre-release

Update to Chrome/Chromium 101, using new browsertrix-browser-base image with browser installed. Also includes update to Ubuntu 22.04 and additional fonts.

Assets 2

17 Jun 20:12

ikreymer

0.6.0

cf90304

Browsertix Crawler 0.6.0

This release features additional improvements to support parallel crawls in Browsertrix Cloud:

Add a --waitOnDone option, which has browsertrix crawler wait when finished (for use with Browsertrix Cloud)
When running with redis shared state, set the :status field to running, failing/failed or done to let job controller know crawl is finished.
Set redis state to failing in case of exception, set to failed in case of >3 or more failed exits within 60 seconds (but don't mark as failed if all pages are finished and >0 pages.
When receiving a SIGUSR1, don't wait on down (assume final exit due to scale down).

Screencasting Fixes:

More efficient screencasting, don't end screencasting when page ends, only when target is destroyed!
Keep same screencasting connection from one page to next, as the target are reused in 'window' concurrency mode

Crawl Limits (from 0.6.0 beta)

Size limit (in bytes) via --sizeLimit
Total time limit (in bytes) via --timeLimit
Overwrite collection (delete existing) via --overwrite
Fixes to interrupting a single instance in a shared state crawl

Profile Creation (from 0.6.0 beta)

force all cookies, including session cookies, to fixed duration in days, configurable via --cookieDays

Assets 2

19 May 06:45

ikreymer

0.6.0-beta.1

70ba924

Browsertrix Crawler 0.6.0-beta.1 Pre-release

Pre-release

Additional crawl limits:

Size limit (in bytes) via --sizeLimit
Total time limit (in bytes) via --timeLimit
Overwrite collection (delete existing) via --overwrite
Fixes to interrupting a single instance in a shared state crawl

Improved profile creation:

Additional API
ability to shutdown profile browser if no incoming pings (via --shutdownWait)
force all cookies to fixed duration (via --cookieDays)

Assets 2

16 Apr 03:14

ikreymer

0.5.1

5dfbfbe

Browsertrix Crawler 0.5.1

Changes

Dependency Update Release:

update pywb to 2.6.7, fix possible error cdx indexing ever via --generateCDX
update wacz to 0.4.6, ensure wacz file is closed and better and more error-resilient text extraction
update browsertrix-behaviors to 0.3.0, support for Telegram behavior auto-scrolling behavior (and generic 'autoscroll up' support)

Full Changelog: 0.5.0...0.5.1

Assets 2

11 Apr 22:27

ikreymer

0.5.0

9b93830

Browsertrix Crawler 0.5.0

Changes and Features

Scope: support for scopeType: domain to include all subdomains and ignoring 'www.' if specified in the seed.
Profiles: support loading remote profile from URL as well as local file
Non-HTML Pages: Load non-200 responses in browser, even if non-html, fix waiting issues with non-HTML pages (eg. PDFs)
Config options: Fix setting user-agent
Page behavior: latest browsertrix-behaviors, also add experimental Cloudflare interstitial wait.
Error handling: better error handling for redis errors
State: Support loading of crawl state from config.yaml
State: Support serialization of crawl state to crawls subdirectory, both while running (keeping last N states) and on exit.
State: Graceful saving of crawl state on ctrl+c interrupt
State: Memory or Redis based crawl state
Config: Support additional options via CRAWL_ARGS environment variable
WACZ Upload: Support for S3 upload of WACZ upon crawl completion
WACZ Upload: HTTP/Redis webhook to notify of upload completion
Crawl Scope: Support for extraHops to optionally crawl an extra hop beyond scope
Signing: Support for optional signing of WACZ
Dependencies: update to latest pywb, wacz and browsertrix-behaviors packages
Params: Support custom browser args
Docs: Improve customization documentation

New Contributors

@CreativeCactus made their first contribution in #96
@asameshimae made their first contribution in #121
@simonwiles made their first contribution in #122
@phiresky made their first contribution in #120

Full Changelog: 0.4.4...0.5.0

Contributors

simonwiles, phiresky, and 2 other contributors

Assets 2

23 Mar 01:08

ikreymer

0.5.0-beta.8

7ed5586

Browsertrix Crawler 0.5.0 Beta 8 Pre-release

Pre-release

This release includes fix for:

Improved capture of non-HTML pages, fixes #129
For scopeType: domain, if specified URL starts with www., include the non-www version.

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's Changed

Contributors

What's Changed

Changes

Changes and Features

New Contributors

Contributors

Releases: webrecorder/browsertrix-crawler

Browsertix Crawler 0.7.0 Beta 5

What's Changed

Contributors

Browsertix Crawler 0.7.0 Beta 4

Browsertrix Crawler 0.7.0 Beta 3

What's Changed

Browsertix Crawler 0.7.0 Beta 2

Browsertix Crawler 0.7.0 Beta 1

Browsertix Crawler 0.6.0

Browsertrix Crawler 0.6.0-beta.1

Browsertrix Crawler 0.5.1

Changes

Browsertrix Crawler 0.5.0

Changes and Features

New Contributors

Contributors

Browsertrix Crawler 0.5.0 Beta 8