Partition (W)ARC Files by MIME Type and Year
-
Updated
Feb 13, 2017 - Java
Partition (W)ARC Files by MIME Type and Year
This system evaluates a series of mementos (archived web pages) to determine which are off topic. The series can be part of an Archive-It collection, a single TimeMap, or stored in a WARC file.
ArchiveSpark DataSpec to analyze the Internet Archive's Web archive through temporal search results returned by Tempas (v2)
Wget-compatible web downloader and crawler.
A Splitable Hadoop InputFormat for Concatenated GZIP Files and *.(w)arc.gz
Data for testing the Offtopic detection software
This module builds our Waybacks in the various different configurations we require.
A Tumblr Blog Backup Application
Command-line program to download videos from YouTube.com and other video sites
Download images from Pixiv and more!
🗄 Save an archived copy of websites from Pocket/Pinboard/Bookmarks/RSS. Outputs HTML, PDFs, and more...
Download pictures (or videos) along with their captions and other metadata from Instagram.
Discord Media Loader - Simply download all attachments
🐋 One-Click User Instigated Preservation
A prototype server to swarm multiple DATs for Webrecorder
Archiveror will help you preserve the webpages you love. 💾
Support for writing WARC files with Scrapy
HTTPreserve Analysis of Million Dollar Web Page
Add a description, image, and links to the web-archiving topic page so that developers can more easily learn about it.
To associate your repository with the web-archiving topic, visit your repo's landing page and select "manage topics."