Could there be a way to create warcs with certain size after one RUN (combinewarc / rolloversize...) #617

ssairanen · 2024-06-19T12:29:07Z

CombineWARC seems to create warcs from all the warcs in the folder after one run, but there is no way to create limited size warcs out of one run?

For example: if one has one crawl running daily, and the size of all the warcs is 20TB, if one puts on CombineWARC: true, then suddenly browsertrix creates another 20TB of warc next to the original crawl folder. The warcs are read here:

browsertrix-crawler/src/crawler.ts

Line 2329 in 6329b19

const warcLists = await fsp.readdir(this.archivesDir);

Could there be a way to combine all of the worker outputs to certainly sized warcs, but only for one run? rolloversize:100000000 does not work as warcs (the worker outputs) might be anything from 1Mb to 100Mb, I want warcs to be 100Mb always until the last one obviously isn't.

The text was updated successfully, but these errors were encountered:

ikreymer · 2024-06-19T16:54:00Z

Not quite sure what you mean - using --rolloverSize + --combineWARC together should work as you describe. The combineWARC operation combines all the WARCs in a collection folder after each crawl, upto the rollover size. The rollover size is also applied to individual WARCs as well.

steph-nb · 2024-06-21T08:34:42Z

same question for the use of several crawlers in browsertrix:
How could an overall max size be configured, which generates x slices of max size and only one smaller WARC?
(to my understanding --rolloverSize + --combineWARC are used for crawlers individually)

ikreymer · 2024-06-25T02:24:33Z

same question for the use of several crawlers in browsertrix: How could an overall max size be configured, which generates x slices of max size and only one smaller WARC? (to my understanding --rolloverSize + --combineWARC are used for crawlers individually)

I believe that this is how it should work if you use both of those flags. The --rolloverSize applies to the individual WARCs, the --combineWARC then combines them so they are all upto the rollover size, and only one WARC smaller. Is this not working correctly?

ssairanen · 2024-06-25T07:31:35Z

What I basicly meant was: crawler crawls whatever sized warcs to /archive/ -folder, and then does /.. and creates certainly sized warcs in that folder. Now we have archive/ -folder with original warcs, and one level below with combineWARC -creations, which means 2x the space.

If for example one turns on the combineWARC option on a daily crawl, which has been creating warcs for a while, the combineWARC option will take all of the past warcs into consideration when doing the combining (it's fun when you have 10Tb of warcs in archive/ -folder ...). There is no option to get neatly sized warcs from one run only, in the same folder next to the output of another run of the same crawl.

ikreymer · 2024-06-25T18:31:43Z

What I basicly meant was: crawler crawls whatever sized warcs to /archive/ -folder, and then does /.. and creates certainly sized warcs in that folder. Now we have archive/ -folder with original warcs, and one level below with combineWARC -creations, which means 2x the space.

If for example one turns on the combineWARC option on a daily crawl, which has been creating warcs for a while, the combineWARC option will take all of the past warcs into consideration when doing the combining (it's fun when you have 10Tb of warcs in archive/ -folder ...). There is no option to get neatly sized warcs from one run only, in the same folder next to the output of another run of the same crawl.

There is no concept of distinct crawl 'runs' in Browsertrix Crawler - it is assumed that repeated crawls may be part of the same crawl, eg. if a crawl is interrupted/restarted. If you want to separate crawls by day, my suggestion would be to use --collection my-crawl-YYYY-MM-DD and crawl into a new directory for each day, and use --combineWARC and --rolloverSize with these crawls. Or, to put it another ways, all WARCs in ./collections/<name>/archive are assumed to be part of the same crawl - having different directories allows you to isolated and group WARCs from that crawl only.

github-project-automation bot added this to Webrecorder Projects Jun 19, 2024

github-project-automation bot moved this to Triage in Webrecorder Projects Jun 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Could there be a way to create warcs with certain size after one RUN (combinewarc / rolloversize...) #617

Could there be a way to create warcs with certain size after one RUN (combinewarc / rolloversize...) #617

ssairanen commented Jun 19, 2024

ikreymer commented Jun 19, 2024

steph-nb commented Jun 21, 2024

ikreymer commented Jun 25, 2024 •

edited

Loading

ssairanen commented Jun 25, 2024 •

edited

Loading

ikreymer commented Jun 25, 2024 •

edited

Loading

Could there be a way to create warcs with certain size after one RUN (combinewarc / rolloversize...) #617

Could there be a way to create warcs with certain size after one RUN (combinewarc / rolloversize...) #617

Comments

ssairanen commented Jun 19, 2024

ikreymer commented Jun 19, 2024

steph-nb commented Jun 21, 2024

ikreymer commented Jun 25, 2024 • edited Loading

ssairanen commented Jun 25, 2024 • edited Loading

ikreymer commented Jun 25, 2024 • edited Loading

ikreymer commented Jun 25, 2024 •

edited

Loading

ssairanen commented Jun 25, 2024 •

edited

Loading

ikreymer commented Jun 25, 2024 •

edited

Loading