This package installs a CLI tool named ia
for using archive.org from the command-line.
It also installs the internetarchive
python module for programatic access to archive.org.
Please report all bugs and issues on Github.
Table of Contents:
You can install this module via pip:
pip install internetarchive
Alternatively, you can install a few extra dependencies to help speed things up a bit:
pip install "internetarchive[speedups]"
This will install ujson for faster JSON parsing, and gevent for concurrent downloads.
If you want to install this module globally on your system instead of inside a virtualenv
, use sudo:
sudo pip install internetarchive
Help is available by typing ia --help
. You can also get help on a command: ia <command> --help
.
Available subcommands are configure
, metadata
, upload
, download
, search
, mine
, delete
, list
, and catalog
.
To download the entire TripDown1905 item:
$ ia download TripDown1905
ia download
usage examples:
#download just the mp4 files using ``--glob``
$ ia download TripDown1905 --glob='*.mp4'
#download all the mp4 files using ``--formats``:
$ ia download TripDown1905 --format='512Kb MPEG4'
#download multiple formats from an item:
$ ia download TripDown1905 --format='512Kb MPEG4' --format='Ogg Video'
#list all the formats in an item:
$ ia metadata --formats TripDown1905
#download a single file from an item:
$ ia download TripDown1905 TripDown1905_512kb.mp4
#download multiple files from an item:
$ ia download TripDown1905 TripDown1905_512kb.mp4 TripDown1905.ogv
You can use the provided ia
command-line tool to upload items. You
need to supply your IAS3 credentials in environment variables in order
to upload. You can retrieve S3 keys from
https://archive.org/account/s3.php
$ export IAS3_ACCESS_KEY='xxx'
$ export IAS3_SECRET_KEY='yyy'
#upload files:
$ ia upload <identifier> file1 file2 --metadata="title:foo" --metadata="blah:arg"
#upload from `stdin`:
$ curl http://dumps.wikimedia.org/kywiki/20130927/kywiki-20130927-pages-logging.xml.gz |
ia upload <identifier> - --remote-name=kywiki-20130927-pages-logging.xml.gz --metadata="title:Uploaded from stdin."
You can use the ia
command-line tool to download item metadata in JSON format:
$ ia metadata TripDown1905
You can also modify metadata. Be sure that the IAS3_ACCESS_KEY and IAS3_SECRET_KEY environment variables are set.
$ ia metadata <identifier> --modify="foo:bar" --modify="baz:foooo"
IA Mine can be used for data mining Archive.org metadata and search results: https://github.com/jjjake/iamine
You can search using the provided ia
command-line script:
$ ia search 'subject:"market street" collection:prelinger'
If you have the GNU parallel
tool intalled, then you can combine ia search
and ia metadata
to quickly retrieve data for many items in parallel:
$ia search 'subject:"market street" collection:prelinger' | parallel -j40 'ia metadata {} > {}_meta.json'
Below is brief overview of the internetarchive
Python library.
Please refer to the API documentation for more specific details.
The Internet Archive stores data in items. You can query the archive using an item identifier:
>>> from internetarchive import get_item
>>> item = get_item('stairs')
>>> print(item.metadata)
Items contains files. You can download the entire item:
>>> item.download()
or you can download just a particular file:
>>> f = item.get_file('glogo.png')
>>> f.download() #writes to disk
>>> f.download('/foo/bar/some_other_name.png')
You can iterate over files:
>>> for f in item.iter_files():
... print(f.name, f.sha1)
You can use the IA's S3-like interface to upload files to an item. You need to supply your IAS3 credentials in environment variables in order to upload. You can retrieve S3 keys from https://archive.org/account/s3.php
>>> from internetarchive import get_item
>>> item = get_item('new_identifier')
>>> md = dict(mediatype='image', creator='Jake Johnson')
>>> item.upload('/path/to/image.jpg', metadata=md, access_key='xxx', secret_key='yyy')
Item-level metadata must be supplied with the first file uploaded to an item.
You can upload additional files to an existing item:
>>> item = internetarchive.Item('existing_identifier')
>>> item.upload(['/path/to/image2.jpg', '/path/to/image3.jpg'])
You can also upload file-like objects:
>>> import StringIO
>>> fh = StringIO.StringIO('hello world')
>>> fh.name = 'hello_world.txt'
>>> item.upload(fh)
You can modify metadata for existing items, using the
item.modify_metadata()
function. This uses the IA Metadata
API under the hood
and requires your IAS3 credentials.
>>> from internetarchive import get_item
>>> item = get_item('my_identifier')
>>> md = dict(blah='one', foo=['two', 'three'])
>>> item.modify_metadata(md, access_key='xxx', secret_key='yyy')
You can search for items using the archive.org advanced search engine:
>>> from internetarchive import search_items
>>> search = search_items('collection:nasa')
>>> print(search.num_found)
186911
You can iterate over your results:
>>> for result in search:
... print(result['identifier'])