Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could there be a way to create warcs with certain size after one RUN (combinewarc / rolloversize...) #617

Open
ssairanen opened this issue Jun 19, 2024 · 5 comments

Comments

@ssairanen
Copy link

CombineWARC seems to create warcs from all the warcs in the folder after one run, but there is no way to create limited size warcs out of one run?

For example: if one has one crawl running daily, and the size of all the warcs is 20TB, if one puts on CombineWARC: true, then suddenly browsertrix creates another 20TB of warc next to the original crawl folder. The warcs are read here:

const warcLists = await fsp.readdir(this.archivesDir);

Could there be a way to combine all of the worker outputs to certainly sized warcs, but only for one run? rolloversize:100000000 does not work as warcs (the worker outputs) might be anything from 1Mb to 100Mb, I want warcs to be 100Mb always until the last one obviously isn't.

@ikreymer
Copy link
Member

Not quite sure what you mean - using --rolloverSize + --combineWARC together should work as you describe. The combineWARC operation combines all the WARCs in a collection folder after each crawl, upto the rollover size. The rollover size is also applied to individual WARCs as well.

@steph-nb
Copy link

same question for the use of several crawlers in browsertrix:
How could an overall max size be configured, which generates x slices of max size and only one smaller WARC?
(to my understanding --rolloverSize + --combineWARC are used for crawlers individually)

@ikreymer
Copy link
Member

ikreymer commented Jun 25, 2024

same question for the use of several crawlers in browsertrix: How could an overall max size be configured, which generates x slices of max size and only one smaller WARC? (to my understanding --rolloverSize + --combineWARC are used for crawlers individually)

I believe that this is how it should work if you use both of those flags. The --rolloverSize applies to the individual WARCs, the --combineWARC then combines them so they are all upto the rollover size, and only one WARC smaller. Is this not working correctly?

@ssairanen
Copy link
Author

ssairanen commented Jun 25, 2024

What I basicly meant was: crawler crawls whatever sized warcs to /archive/ -folder, and then does /.. and creates certainly sized warcs in that folder. Now we have archive/ -folder with original warcs, and one level below with combineWARC -creations, which means 2x the space.

If for example one turns on the combineWARC option on a daily crawl, which has been creating warcs for a while, the combineWARC option will take all of the past warcs into consideration when doing the combining (it's fun when you have 10Tb of warcs in archive/ -folder ...). There is no option to get neatly sized warcs from one run only, in the same folder next to the output of another run of the same crawl.

@ikreymer
Copy link
Member

ikreymer commented Jun 25, 2024

What I basicly meant was: crawler crawls whatever sized warcs to /archive/ -folder, and then does /.. and creates certainly sized warcs in that folder. Now we have archive/ -folder with original warcs, and one level below with combineWARC -creations, which means 2x the space.

If for example one turns on the combineWARC option on a daily crawl, which has been creating warcs for a while, the combineWARC option will take all of the past warcs into consideration when doing the combining (it's fun when you have 10Tb of warcs in archive/ -folder ...). There is no option to get neatly sized warcs from one run only, in the same folder next to the output of another run of the same crawl.

There is no concept of distinct crawl 'runs' in Browsertrix Crawler - it is assumed that repeated crawls may be part of the same crawl, eg. if a crawl is interrupted/restarted. If you want to separate crawls by day, my suggestion would be to use --collection my-crawl-YYYY-MM-DD and crawl into a new directory for each day, and use --combineWARC and --rolloverSize with these crawls. Or, to put it another ways, all WARCs in ./collections/<name>/archive are assumed to be part of the same crawl - having different directories allows you to isolated and group WARCs from that crawl only.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Triage
Development

No branches or pull requests

3 participants