-
-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Could there be a way to create warcs with certain size after one RUN (combinewarc / rolloversize...) #617
Comments
Not quite sure what you mean - using |
same question for the use of several crawlers in browsertrix: |
I believe that this is how it should work if you use both of those flags. The --rolloverSize applies to the individual WARCs, the --combineWARC then combines them so they are all upto the rollover size, and only one WARC smaller. Is this not working correctly? |
What I basicly meant was: crawler crawls whatever sized warcs to /archive/ -folder, and then does /.. and creates certainly sized warcs in that folder. Now we have archive/ -folder with original warcs, and one level below with combineWARC -creations, which means 2x the space. If for example one turns on the combineWARC option on a daily crawl, which has been creating warcs for a while, the combineWARC option will take all of the past warcs into consideration when doing the combining (it's fun when you have 10Tb of warcs in archive/ -folder ...). There is no option to get neatly sized warcs from one run only, in the same folder next to the output of another run of the same crawl. |
There is no concept of distinct crawl 'runs' in Browsertrix Crawler - it is assumed that repeated crawls may be part of the same crawl, eg. if a crawl is interrupted/restarted. If you want to separate crawls by day, my suggestion would be to use |
CombineWARC seems to create warcs from all the warcs in the folder after one run, but there is no way to create limited size warcs out of one run?
For example: if one has one crawl running daily, and the size of all the warcs is 20TB, if one puts on CombineWARC: true, then suddenly browsertrix creates another 20TB of warc next to the original crawl folder. The warcs are read here:
browsertrix-crawler/src/crawler.ts
Line 2329 in 6329b19
Could there be a way to combine all of the worker outputs to certainly sized warcs, but only for one run?
rolloversize:100000000
does not work as warcs (the worker outputs) might be anything from 1Mb to 100Mb, I want warcs to be 100Mb always until the last one obviously isn't.The text was updated successfully, but these errors were encountered: