Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WARC filename prefix + rollover size + improved 'livestream' / truncated response support. #440

Merged
merged 4 commits into from
Dec 8, 2023

Conversation

ikreymer
Copy link
Member

@ikreymer ikreymer commented Nov 16, 2023

Support for rollover size and custom WARC prefix templates:

  • reenable --rolloverSize (default to 1GB) for when a new WARC is created
  • support custom WARC prefix via --warcPrefix, prepended to new WARC filename
  • filename template for new files is: ${prefix}-${crawlId}-$ts-${this.workerid}.warc${his.gzip ? ".gz" : ""} with $ts replaced at new file creation time with current timestamp

Improved support for long (non-terminating) responses, such as from live-streaming:

  • add a size to CDP takeStream to ensure data is streamed in fixed chunks, defaulting to 64k
  • change shutdown order: first close browser, then finish writing all WARCs to ensure any truncated responses can be captured.
  • ensure WARC is not rewritten after it is done, skip writing records if stream already flushed
  • add timeout to final fetch tasks to avoid never hanging on finish
  • fix adding WARC-Truncated header, need to set after stream is finished to determine if its been truncated
  • move temp download tmp-dl dir to main temp folder, outside of collection (no need to be there).

- add a size to takeStream to ensure data is streamed in fixed chunks
- fix adding WARC-Truncated header, need to check after stream is finished
- ensure WARC is not rewritten after it is done, skip writing records if stream already flushed
- first close browser, then finish writing WARC to ensure any truncated responses are captured
- add timeout to fetcherQ to avoid never hanging on finish
- create new filename from template, interpolating '$ts'
- add --warcPrefix arg for setting WARC filename prefix (default to 'rec')
- create new WARC file when offset exceeds rollover size, configured via --rolloverSize
@ikreymer
Copy link
Member Author

Will add some tests for rollover and prefix, truncated / livestream response a bit more tricky.

@ikreymer ikreymer requested a review from tw4l November 16, 2023 07:09
Copy link
Member

@tw4l tw4l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments (I feel most strongly about making the stream size configurable, others could take or leave) but overall LGTM! Tested against a few sites and working beautifully

src/util/recorder.ts Show resolved Hide resolved
src/util/recorder.ts Show resolved Hide resolved
@@ -30,6 +30,8 @@ const MAX_BROWSER_TEXT_FETCH_SIZE = 25_000_000;

const MAX_NETWORK_LOAD_SIZE = 200_000_000;

const TAKE_STREAM_BUFF_SIZE = 1024 * 64;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be good to make this configurable via args so people can fetch larger streams if they want?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this isn't a limit, just size of buffer that is fetched at a time, streaming in 64k chunks. It is then either stored in memory or flushed to disk (if the the MAX_BROWSER_* total size is reached)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, BUFF_SIZE! So the limit is just the appropriate timeouts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those limits are for how much could be kept in memory before being buffered to disk. However, only content that is buffered in memory can be processed through the rewriting system and served back to browser. The limits for text are higher because it might be rewritten. There is no limit (beyond the disk volume size of course) for response that is streamed and buffered to disk, however, it is not served back to the browser. In case of a live-stream audio/video, the data is not actually played in the browser as a result during capture, but is written to disk, which should be fine (only issues might be for screenshot comparison...)

Probably should explain this better somewhere!

@ikreymer ikreymer merged commit 3323262 into dev-1.0.0 Dec 8, 2023
4 checks passed
@ikreymer ikreymer deleted the stream-warc-and-rollover branch December 8, 2023 07:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants