Skip to content
Akash Mahanty edited this page Mar 28, 2022 · 60 revisions

You are currently reading waybackpy docs to use it as a CLI tool. If you want to use waybackpy as a python library by importing it in a python module/file visit Python package docs.

Table of contents


Installation

PyPi

webpage: https://pypi.python.org/project/waybackpy/

pip install waybackpy -U

Snap

webpage: https://snapcraft.io/waybackpy

Use containerized snap package of waybackpy for using waybackpy as a CLI tool across many different Linux distributions.

Arch User Repository (AUR)

webpage: https://aur.archlinux.org/packages/waybackpy


Save

This feature uses Wayback Machine's Save API.

Often while saving a link on Wayback Machine, the link returned is cached and not recently saved. If cached save is False it implies that a new archive was created because of our save request and if cached save is True then the Wayback Machine returned an older archive that was saved before the made the request.

Waybackpy checks the timestamp of the returned archive to determine the cache status.

The archive URL is either parsed from the response header of the SavePageNow API or can also be the response URL itself, we employ three pattern matching checks to find the archive.

The following example does not print the save API response headers, to output the headers use --headers flag.

waybackpy --url "https://en.wikipedia.org/wiki/Social_media" --user_agent "my-unique-user-agent" --save
Archive URL:
https://web.archive.org/web/20220101114012/https://en.wikipedia.org/wiki/Social_media
Cached save:
False

Headers (--headers) flag in action

waybackpy --url "https://en.wikipedia.org/wiki/Social_media" --user_agent "my-unique-user-agent" --save --headers
Archive URL:
https://web.archive.org/web/20220102104608/https://en.wikipedia.org/wiki/Social_media
Cached save:
True
Save API headers:
{'Server': 'nginx/1.19.10', 'Date': 'Sun, 02 Jan 2022 10:54:09 GMT', 'Content-Type': 'text/html; charset=UTF-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'x-archive-orig-date': 'Sun, 02 Jan 2022 10:46:06 GMT', 'x-archive-orig-server': 'mw1385.eqiad.wmnet', 'x-archive-orig-x-content-type-options': 'nosniff', 'x-archive-orig-p3p': 'CP="See https://en.wikipedia.org/wiki/Special:CentralAutoLogin/P3P for more info."', 'x-archive-orig-content-language': 'en', 'x-archive-orig-vary': 'Accept-Encoding,Cookie,Authorization', 'x-archive-orig-last-modified': 'Sun, 02 Jan 2022 09:30:45 GMT', 'x-archive-orig-content-encoding': 'gzip', 'x-archive-orig-age': '2', 'x-archive-orig-x-cache': 'cp4030 miss, cp4027 hit/1', 'x-archive-orig-x-cache-status': 'hit-front', 'x-archive-orig-server-timing': 'cache;desc="hit-front", host;desc="cp4027"', 'x-archive-orig-strict-transport-security': 'max-age=106384710; includeSubDomains; preload', 'x-archive-orig-report-to': '{ "group": "wm_nel", "max_age": 86400, "endpoints": [{ "url": "https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0" }] }', 'x-archive-orig-nel': '{ "report_to": "wm_nel", "max_age": 86400, "failure_fraction": 0.05, "success_fraction": 0.0}', 'x-archive-orig-permissions-policy': 'interest-cohort=()', 'x-archive-orig-x-client-ip': '207.241.232.35', 'x-archive-orig-cache-control': 'private, s-maxage=0, max-age=0, must-revalidate', 'x-archive-orig-accept-ranges': 'bytes', 'x-archive-orig-content-length': '164995', 'x-archive-orig-connection': 'keep-alive', 'x-archive-guessed-content-type': 'text/html', 'x-archive-guessed-charset': 'utf-8', 'memento-datetime': 'Sun, 02 Jan 2022 10:46:08 GMT', 'link': '<https://en.wikipedia.org/wiki/Social_media>; rel="original", <https://web.archive.org/web/timemap/link/https://en.wikipedia.org/wiki/Social_media>; rel="timemap"; type="application/link-format", <https://web.archive.org/web/https://en.wikipedia.org/wiki/Social_media>; rel="timegate", <https://web.archive.org/web/20051215000000/http://en.wikipedia.org/wiki/Social_media>; rel="first memento"; datetime="Thu, 15 Dec 2005 00:00:00 GMT", <https://web.archive.org/web/20220101114012/https://en.wikipedia.org/wiki/Social_media>; rel="prev memento"; datetime="Sat, 01 Jan 2022 11:40:12 GMT", <https://web.archive.org/web/20220102104608/https://en.wikipedia.org/wiki/Social_media>; rel="memento"; datetime="Sun, 02 Jan 2022 10:46:08 GMT", <https://web.archive.org/web/20220102104608/https://en.wikipedia.org/wiki/Social_media>; rel="last memento"; datetime="Sun, 02 Jan 2022 10:46:08 GMT"', 'content-security-policy': "default-src 'self' 'unsafe-eval' 'unsafe-inline' data: blob: archive.org web.archive.org analytics.archive.org pragma.archivelab.org", 'x-archive-src': 'spn2-20220102093111-wwwb-spn10.us.archive.org-8000.warc.gz', 'server-timing': 'captures_list;dur=275.334598, exclusion.robots;dur=0.096415, exclusion.robots.policy;dur=0.088356, RedisCDXSource;dur=1.634125, esindex;dur=0.008082, LoadShardBlock;dur=81.607259, PetaboxLoader3.datanode;dur=51.631773, CDXLines.iter;dur=18.885269, load_resource;dur=19.971806', 'x-app-server': 'wwwb-app204', 'x-ts': '200', 'x-tr': '910', 'X-location': 'All', 'X-Cache-Key': 'httpsweb.archive.org/web/20220102104608/https://en.wikipedia.org/wiki/Social_mediaIN', 'X-RL': '0', 'X-NA': '0', 'X-Page-Cache': 'MISS', 'X-NID': '-', 'Referrer-Policy': 'no-referrer-when-downgrade', 'Permissions-Policy': 'interest-cohort=()', 'Content-Encoding': 'gzip'}

Try this out in your browser @ https://repl.it/@akamhy/WaybackPyBashSave


Oldest archive

This feature uses Wayback Machine's Availability API.

The oldest archive for a webpage can be very useful, to get the oldest archive use --oldest flag.

waybackpy --url "https://en.wikipedia.org/wiki/SpaceX" --user_agent "my-unique-user-agent" --oldest
Archive URL:
https://web.archive.org/web/20040803000845/http://en.wikipedia.org:80/wiki/SpaceX

Try this out in your browser @ https://repl.it/@akamhy/WaybackPyBashOldest


Newest archive

This feature uses Wayback Machine's Availability API.

Get the latest(recent most) archive for an URL. Flag: --newest

waybackpy --url "https://en.wikipedia.org/wiki/YouTube" --user_agent "my-unique-user-agent" --newest
Archive URL:
https://web.archive.org/web/20220101184323/https://en.wikipedia.org/wiki/YouTube

Try this out in your browser @ https://repl.it/@akamhy/WaybackPyBashNewest


Archive near time

This feature uses Wayback Machine's Availability API.

Time used by the Internet Archive's Wayback Machine is in UTC.

waybackpy --url google.com --user_agent "my-unique-user-agent" --near --year 2008 --month 8 --day 8 --hour 8
Archive URL:
https://web.archive.org/web/20080808014003/http://www.google.com:80/

Try this out in your browser @ https://repl.it/@akamhy/WaybackPyBashNear


Fetch all the URLs that the Wayback Machine knows for a domain

  • You can add the '--subdomain' flag to add subdomains.
  • All links will be saved in a file, and the file will be created in the current working directory.
pip install waybackpy

# Ignore the above installation line.

waybackpy --url akamhy.github.io --user_agent "my-user-agent" --known_urls
# Prints all known URLs under akamhy.github.io


waybackpy --url akamhy.github.io --user_agent "my-user-agent" --known_urls --subdomain
# Prints all known URLs under akamhy.github.io including subdomain

Try this out in your browser @ https://repl.it/@akamhy/WaybackpyKnownUrlsFromWaybackMachine#main.sh


CDX Server API

This CDX server API doc is derived from the https://github.com/internetarchive/wayback/blob/master/wayback-cdx-server/README.md.

Basic usage

The following code snippet should print all archives with https://github.com/akamhy/ as prefix as we are using the wildcard "*".

waybackpy --url "https://github.com/akamhy/*" --user-agent "Your-user-agent" --cdx
com,github)/akamhy/akamhy/waybackpy 20220210225324 https://github.com/akamhy/akamhy/waybackpy text/html 404 7NTMXPAOO2NTAH3EDOYQOGQBBS7YTZVM 113680
com,github)/akamhy/antispam 20210113054521 https://github.com/akamhy/antispam text/html 404 DOVRV3NM56PCPIQ2IH2RUINLRDDFXXZO 17318
com,github)/akamhy/dhashpy 20211001180207 https://github.com/akamhy/dhashpy text/html 200 56W6EQISXHZ4PXBCRN7G7ZGWPV2YEMQG 37087
.
. # Many URLs redacted for readability
.
com,github)/akamhy/waybackpy/workflows/tests/badge.svg 20220310220909 https://github.com/akamhy/waybackpy/workflows/Tests/badge.svg image/svg+xml 200 YQ7L3MX5WXNUY4BZIL4INNDVZF4JXZXJ 2459
com,github)/akamhy/waybackpy/workflows/tests/badge.svg 20220315150044 https://github.com/akamhy/waybackpy/workflows/Tests/badge.svg warc/revisit - YQ7L3MX5WXNUY4BZIL4INNDVZF4JXZXJ 1375
com,github)/akamhy/waybackpy/workflows/tests/badge.svg 20220315194257 https://github.com/akamhy/waybackpy/workflows/Tests/badge.svg warc/revisit - YQ7L3MX5WXNUY4BZIL4INNDVZF4JXZXJ 1374

Try this out in your browser @ https://repl.it/@akamhy/CDX-Basic-usage#main.py

Url Match Scope

The default behavior is to return matches for an exact URL. However, the CDX server can also return results matching a certain prefix, a certain host, or all sub-hosts by using the --match-type param.

  • --match-type exact (default if omitted) will return results matching exactly archive.org/about/
  • --match-type prefix will return results for all results under the path archive.org/about/
  • --match-type host will return results from host archive.org
  • --match-type domain will return results from host archive.org and all sub-hosts *.archive.org
waybackpy --url "archive.org/about/" --user-agent "your-user-agent" --cdx --match-type "prefix" --cdx-print "archiveurl"

Try this out in your browser @ https://repl.it/@akamhy/CDX-UrlMatchScope#main.py

Filtering

Date Range

Date Range: Results may be filtered by timestamp using --to and --from params. The ranges are inclusive and are specified in the same 1 to 14 digit format used for Wayback captures: yyyyMMddhhmmss

waybackpy --url google.com --user-agent Your-apps-user-agent --cdx --from 1998 --to 2000 --cdx-print archiveurl

Try this out in your browser @ https://repl.it/@akamhy/CDX-Filtering-Date-Range#main.py

Regex filtering
  • It is possible to filter on a specific field or the entire CDX line (which is space-delimited). Filtering by specific field is often simpler. Any number of filter params of the following form may be specified: filters=["[!]field:regex"] may be specified.

    • field is one of the named cdx fields (listed in the JSON query) or an index of the field. It is often useful to filter by mimetype or statuscode

    • Optional: ! before the query inverts the match, that is, will return results that do NOT match the regex.

    • regex is any standard Java regex pattern (http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html)

  • Ex: Query for 2 capture results with a non-200 status code:

from waybackpy import Cdx
url = "archive.org"
user_agent = "Your-apps-user-agent"

cdx = Cdx(url=url, user_agent=user_agent, filters=["!statuscode:200"])
snapshots = cdx.snapshots()

i = 0
for snapshot in snapshots:
    print(snapshot.statuscode, snapshot.archive_url)
    i += 1
    if i == 2:
        break

Try this out in your browser @ https://repl.it/@akamhy/filtering1#main.py

  • Ex: Query for 10 capture results with a non-200 status code and non text/html mime type matching a specific digest:
from waybackpy import Cdx
url = "archive.org"
user_agent = "Your-apps-user-agent"

cdx = Cdx(url=url, user_agent=user_agent, filters=["!statuscode:200", "!mimetype:text/html", "digest:2WAXX5NUWNNCS2BDKCO5OVDQBJVNKIVV"])
snapshots = cdx.snapshots()

i = 0
for snapshot in snapshots:
    print(snapshot.digest, snapshot.statuscode, snapshot.archive_url)
    i += 1
    if i == 10:
        break

Try this out in your browser @ https://repl.it/@akamhy/filtering2#main.py

Collapsing

A new form of filtering is the option to 'collapse' results based on a field, or a substring of a field. Collapsing is done on adjacent cdx lines where all captures after the first one that is duplicate and are filtered out. This is useful for filtering out captures that are 'too dense' or when looking for unique captures.

To use collapsing, add one or more field or field:N to 'collapses=[]' where the field is one of (urlkey, timestamp, original, mimetype, statuscode, digest, and length) and N is the first N characters of the field to test.

  • Ex: Only show at most 1 capture per hour (compare the first 10 digits of the timestamp field). Given 2 captures 20130226010000 and 20130226010800, since the first 10 digits 2013022601 matches, the 2nd capture will be filtered out.
waybackpy --url "google.com" --user-agent "Your-apps-user-agent" --cdx --collapse "timestamp:10"

Try this out in your browser @ https://repl.it/@akamhy/Cdx-collapsing-first#main.py

  • Ex: Only show unique captures by digest (note that only adjacent digest are collapsed, duplicates elsewhere in the cdx are not affected)
waybackpy --url "google.com" --user-agent "Your-apps-user-agent" --cdx --collapse "digest" --cdx-print "archiveurl"

Try this out in your browser @ https://repl.it/@akamhy/Cdx-collapsing-second#main.py

  • Ex: Only show unique URLs in a prefix query (filtering out captures except for the first capture of a given URL). This is similar to the old prefix query in wayback (note: this query may be slow at the moment):
waybackpy --url archive.org --user-agent "i'm-user-agent" --cdx --match-type prefix --collapse urlkey --cdx-print archiveurl

Try this out in your browser @ https://repl.it/@akamhy/Cdx-collapsing-last#main.py

Clone this wiki locally