-
-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WARC filename prefix + rollover size + improved 'livestream' / truncated response support. #440
Conversation
- add a size to takeStream to ensure data is streamed in fixed chunks - fix adding WARC-Truncated header, need to check after stream is finished - ensure WARC is not rewritten after it is done, skip writing records if stream already flushed - first close browser, then finish writing WARC to ensure any truncated responses are captured - add timeout to fetcherQ to avoid never hanging on finish
- create new filename from template, interpolating '$ts' - add --warcPrefix arg for setting WARC filename prefix (default to 'rec') - create new WARC file when offset exceeds rollover size, configured via --rolloverSize
Will add some tests for rollover and prefix, truncated / livestream response a bit more tricky. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments (I feel most strongly about making the stream size configurable, others could take or leave) but overall LGTM! Tested against a few sites and working beautifully
@@ -30,6 +30,8 @@ const MAX_BROWSER_TEXT_FETCH_SIZE = 25_000_000; | |||
|
|||
const MAX_NETWORK_LOAD_SIZE = 200_000_000; | |||
|
|||
const TAKE_STREAM_BUFF_SIZE = 1024 * 64; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be good to make this configurable via args so people can fetch larger streams if they want?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, this isn't a limit, just size of buffer that is fetched at a time, streaming in 64k chunks. It is then either stored in memory or flushed to disk (if the the MAX_BROWSER_* total size is reached)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, BUFF_SIZE
! So the limit is just the appropriate timeouts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those limits are for how much could be kept in memory before being buffered to disk. However, only content that is buffered in memory can be processed through the rewriting system and served back to browser. The limits for text are higher because it might be rewritten. There is no limit (beyond the disk volume size of course) for response that is streamed and buffered to disk, however, it is not served back to the browser. In case of a live-stream audio/video, the data is not actually played in the browser as a result during capture, but is written to disk, which should be fine (only issues might be for screenshot comparison...)
Probably should explain this better somewhere!
Support for rollover size and custom WARC prefix templates:
${prefix}-${crawlId}-$ts-${this.workerid}.warc${his.gzip ? ".gz" : ""}
with$ts
replaced at new file creation time with current timestampImproved support for long (non-terminating) responses, such as from live-streaming:
WARC-Truncated
header, need to set after stream is finished to determine if its been truncatedtmp-dl
dir to main temp folder, outside of collection (no need to be there).