-
Notifications
You must be signed in to change notification settings - Fork 6
internetarchive/draintasker
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
DRAINTASKER - clears filling disks ================================================================ * Draintasker wiki: https://webarchive.jira.com/wiki/display/WEBOPS/Draintasker * Draintasker bugs: https://launchpad.net/archivewidecrawl/+bugs supports "draining" a running crawler along two paths: 1) dtmon: IAS3-to-petabox (paired storage) 2) th-dtmon: catalog-direct-to-thumpers (Santa Clara MD) run like this: $ ssh home $ screen $ ssh -A crawler $ cd /path/draintasker $ svn up $ emacs dtmon.yml, save-as: /path/drain.yml $ dtmon.py /path/drain.yml | tee -a /path/drain.log PROCESSING monitor job and drain (with dtmon.py) - while DRAINME file exists, pack warcs (PACKED), make manifests (MANIFEST), launch transfers (TASK), verify transfers (TOMBSTONE), and finally, delete verified warcs, then sleep before trying again. under-the-hood: dtmon.py config | '-> s3-drain-job job_dir xfer_job_dir max_size warc_naming | '-> pack-warcs.sh => PACKED '-> make-manifests.sh => MANIFEST '-> s3-launch-transfers.sh => LAUNCH TASK [RETRY] SUCCESS TOMBSTONE '-> delete-verified-warcs.sh => poof, no w/arcs! th-dtmon.sh config | '-> drain-job job_dir xfer_job_dir thumper max_size warc_naming | '-> pack-warcs.sh => PACKED '-> make-manifests.sh => MANIFEST '-> launch-transfers.sh => LAUNCH, TASK '-> verify-transfers.sh => SUCCESS, TOMBSTONE '-> delete-verified-warcs.sh => poof, no w/arcs! get status of prerequisites and disk capacity like this: $ get-status.sh crawldata_dir xfer_dir some advice: 1) if there are old draintasker procs, kill them. 2) if files in the way, investigate and move aside, eg mv LAUNCH.open LAUNCH.1, mv ERROR ERROR.1 good to number each failure/error file 3) check the status of your disks ./get-status.sh 4) (optional) test petabox-to-thumper path on single series ./launch-transfers.sh 5) log into home and open a screen session 6) in screen, ssh crawler, cd /path/draintasker/, svn up 7) run dtmon.py to continuously drain each job+disk [screen] cd /path/draintasker ./dtmon.py /path/disk1.yml [screen] cd /path/draintasker ./dtmon.py /path/disk3.yml CONFIGURATION directory structure crawldata /{1,3}/crawling rsync_path /{1,3}/incoming job_dir /{crawldata}/{job_name} xfer_job_dir /{rsync_path}/{job_name} warc_series {xfer_job_dir}/{warc_series} depending on config, your warcs might be written in e.g. /1/crawling/{crawljob}/warcs /3/crawling/{crawljob}/warcs and be "packed" into /1/incoming/{crawljob}/{warc_series}/MANIFEST /3/incoming/{crawljob}/{warc_series}/MANIFEST DEPENDENCIES dtmon.py (IAS3-to-petabox) + HOME/.ias3cfg (when using dtmon.py) + add [incoming_x] stanzas to /etc/rsyncd.conf (see wiki) th-dtmon.sh (catalog-direct-to-thumper) + ~/.wgetrc with your archive.org user cookies (see wiki) + ensure user petabox user exists: /home-local/petabox + PETABOX_HOME=/home/user/petabox (codebase from svn) + get petabox authorized_keys from "draintasking" crawler @crawling08:~$ scp /home-local/petabox/.ssh/authorized_keys\ root@ia400131:/home-local/petabox/.ssh/authorized_keys + add [incoming_x] stanzas to /etc/rsyncd.conf (see wiki) PREREQUISITES DRAINME {job_dir}/DRAINME FINISH_DRAIN {job_dir}/FINISH_DRAIN PACKED {warc_series}/PACKED MANIFEST {warc_series}/MANIFEST LAUNCH {warc_series}/LAUNCH TASK {warc_series}/TASK TOMBSTONE {warc_series}/TOMBSTONE if you see a RETRY file, eg RETRY.1284217446 the suffix is the epoch time when a non-blocking retry was scheduled. if this file exists, then the retry was attempted at some time after that. you can get the human readable form of that time with the date cmd, like so: date -d @1284217446 Sat Sep 11 15:04:06 UTC 2010 DRAIN DAEMON dtmon.py run s3-drain-job periodically th-dtmon.sh run drain-job periodically drain-job.sh run draintasker processes in single mode DRAIN PROCESSING delete-verified-warcs.sh delete original (verified) w/arcs from each series get-remote-warc-urls.sh report remote md5 and url for all filesxml in series item-submit-task.sh submit catalog task for series item-verify-download.sh wget remote w/arc and verify checksum for series item-verify-size.sh verify remote size of w/arc series launch-transfers.sh submit transfer tasks for series make-manifests.sh compute md5s into series MANIFEST pack-warcs.sh create warc series when available s3-launch-transfers.sh invoke curl for series task-check-success.sh check and report task success by task_id verify-transfers.sh run task-check-success and item-verify for series UTILS get-status.sh report dtmons, prerequisites and disk usage addup-warcs.sh report count and total size of warcs bundle-crawl-artifacts.sh make tarball of crawldata for permastorage check-crawldata-staged.sh report staged crawldata file count+size check-crawldata.sh report source crawldata file count+size copy-crawldata.sh copy all crawldata preserving dir structure make-and-store-bundle.sh make bundles and scp to staging ---- siznax 2010
About
a tool for continuously ingesting w/arc files into the archive
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published