Skip to content
Akash Mahanty edited this page Feb 15, 2022 · 60 revisions

You are currently reading waybackpy docs to use it as a CLI tool. If you want to use waybackpy as a python library by importing it in a python module/file visit Python package docs.

Table of contents

Installation

PyPi

webpage: https://pypi.python.org/project/waybackpy/

pip install waybackpy -U

Snap

webpage: https://snapcraft.io/waybackpy

If you only want to use waybackpy as a CLI tool on a Linux machine, snap is the best and officially recommended way to do so.

Arch User Repository (AUR)

webpage: https://aur.archlinux.org/packages/waybackpy

Save

This feature uses Wayback Machine's Save API.

Often while saving a link on Wayback Machine, the link returned is cached and not recently saved. If cached save is False it implies that a new archive was created because of our save request and if cached save is True then the Wayback Machine returned an older archive that was saved before the made the request.

Waybackpy checks the timestamp of the returned archive to determine the cache status.

The archive URL is either parsed from the response header of the SavePageNow API or can also be the response URL itself, we employ three pattern matching checks to find the archive.

The following example does not print the save API response headers, to output the headers use --headers flag.

waybackpy --url "https://en.wikipedia.org/wiki/Social_media" --user_agent "my-unique-user-agent" --save
Archive URL:
https://web.archive.org/web/20220101114012/https://en.wikipedia.org/wiki/Social_media
Cached save:
False

Headers (--headers) flag in action

waybackpy --url "https://en.wikipedia.org/wiki/Social_media" --user_agent "my-unique-user-agent" --save --headers
Archive URL:
https://web.archive.org/web/20220102104608/https://en.wikipedia.org/wiki/Social_media
Cached save:
True
Save API headers:
{'Server': 'nginx/1.19.10', 'Date': 'Sun, 02 Jan 2022 10:54:09 GMT', 'Content-Type': 'text/html; charset=UTF-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'x-archive-orig-date': 'Sun, 02 Jan 2022 10:46:06 GMT', 'x-archive-orig-server': 'mw1385.eqiad.wmnet', 'x-archive-orig-x-content-type-options': 'nosniff', 'x-archive-orig-p3p': 'CP="See https://en.wikipedia.org/wiki/Special:CentralAutoLogin/P3P for more info."', 'x-archive-orig-content-language': 'en', 'x-archive-orig-vary': 'Accept-Encoding,Cookie,Authorization', 'x-archive-orig-last-modified': 'Sun, 02 Jan 2022 09:30:45 GMT', 'x-archive-orig-content-encoding': 'gzip', 'x-archive-orig-age': '2', 'x-archive-orig-x-cache': 'cp4030 miss, cp4027 hit/1', 'x-archive-orig-x-cache-status': 'hit-front', 'x-archive-orig-server-timing': 'cache;desc="hit-front", host;desc="cp4027"', 'x-archive-orig-strict-transport-security': 'max-age=106384710; includeSubDomains; preload', 'x-archive-orig-report-to': '{ "group": "wm_nel", "max_age": 86400, "endpoints": [{ "url": "https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0" }] }', 'x-archive-orig-nel': '{ "report_to": "wm_nel", "max_age": 86400, "failure_fraction": 0.05, "success_fraction": 0.0}', 'x-archive-orig-permissions-policy': 'interest-cohort=()', 'x-archive-orig-x-client-ip': '207.241.232.35', 'x-archive-orig-cache-control': 'private, s-maxage=0, max-age=0, must-revalidate', 'x-archive-orig-accept-ranges': 'bytes', 'x-archive-orig-content-length': '164995', 'x-archive-orig-connection': 'keep-alive', 'x-archive-guessed-content-type': 'text/html', 'x-archive-guessed-charset': 'utf-8', 'memento-datetime': 'Sun, 02 Jan 2022 10:46:08 GMT', 'link': '<https://en.wikipedia.org/wiki/Social_media>; rel="original", <https://web.archive.org/web/timemap/link/https://en.wikipedia.org/wiki/Social_media>; rel="timemap"; type="application/link-format", <https://web.archive.org/web/https://en.wikipedia.org/wiki/Social_media>; rel="timegate", <https://web.archive.org/web/20051215000000/http://en.wikipedia.org/wiki/Social_media>; rel="first memento"; datetime="Thu, 15 Dec 2005 00:00:00 GMT", <https://web.archive.org/web/20220101114012/https://en.wikipedia.org/wiki/Social_media>; rel="prev memento"; datetime="Sat, 01 Jan 2022 11:40:12 GMT", <https://web.archive.org/web/20220102104608/https://en.wikipedia.org/wiki/Social_media>; rel="memento"; datetime="Sun, 02 Jan 2022 10:46:08 GMT", <https://web.archive.org/web/20220102104608/https://en.wikipedia.org/wiki/Social_media>; rel="last memento"; datetime="Sun, 02 Jan 2022 10:46:08 GMT"', 'content-security-policy': "default-src 'self' 'unsafe-eval' 'unsafe-inline' data: blob: archive.org web.archive.org analytics.archive.org pragma.archivelab.org", 'x-archive-src': 'spn2-20220102093111-wwwb-spn10.us.archive.org-8000.warc.gz', 'server-timing': 'captures_list;dur=275.334598, exclusion.robots;dur=0.096415, exclusion.robots.policy;dur=0.088356, RedisCDXSource;dur=1.634125, esindex;dur=0.008082, LoadShardBlock;dur=81.607259, PetaboxLoader3.datanode;dur=51.631773, CDXLines.iter;dur=18.885269, load_resource;dur=19.971806', 'x-app-server': 'wwwb-app204', 'x-ts': '200', 'x-tr': '910', 'X-location': 'All', 'X-Cache-Key': 'httpsweb.archive.org/web/20220102104608/https://en.wikipedia.org/wiki/Social_mediaIN', 'X-RL': '0', 'X-NA': '0', 'X-Page-Cache': 'MISS', 'X-NID': '-', 'Referrer-Policy': 'no-referrer-when-downgrade', 'Permissions-Policy': 'interest-cohort=()', 'Content-Encoding': 'gzip'}

Try this out in your browser @ https://repl.it/@akamhy/WaybackPyBashSave

Oldest archive

This feature uses Wayback Machine's Availability API.

The oldest archive for a webpage can be very useful, to get the oldest archive use --oldest flag.

waybackpy --url "https://en.wikipedia.org/wiki/SpaceX" --user_agent "my-unique-user-agent" --oldest
Archive URL:
https://web.archive.org/web/20040803000845/http://en.wikipedia.org:80/wiki/SpaceX

Use the --json flag to get the availability API's JSON response.

waybackpy --url "https://en.wikipedia.org/wiki/SpaceX" --user_agent "my-unique-user-agent" --oldest --json
Archive URL:
https://web.archive.org/web/20040803000845/http://en.wikipedia.org:80/wiki/SpaceX
JSON response:
{"url": "https://en.wikipedia.org/wiki/SpaceX", "archived_snapshots": {"closest": {"status": "200", "available": true, "url": "http://web.archive.org/web/20040803000845/http://en.wikipedia.org:80/wiki/SpaceX", "timestamp": "20040803000845"}}, "timestamp": "199401021436"}

Try this out in your browser @ https://repl.it/@akamhy/WaybackPyBashOldest

Newest archive

This feature uses Wayback Machine's Availability API.

Get the latest(recent most) archive for an URL. Flag: --newest

waybackpy --url "https://en.wikipedia.org/wiki/YouTube" --user_agent "my-unique-user-agent" --newest
Archive URL:
https://web.archive.org/web/20220101184323/https://en.wikipedia.org/wiki/YouTube

Use the --json flag to get the availability API's JSON response.

waybackpy --url "https://en.wikipedia.org/wiki/YouTube" --user_agent "my-unique-user-agent" --newest --json
Archive URL:
https://web.archive.org/web/20220102124306/https://en.wikipedia.org/wiki/YouTube
JSON response:
{"url": "https://en.wikipedia.org/wiki/YouTube", "archived_snapshots": {"closest": {"status": "200", "available": true, "url": "http://web.archive.org/web/20220102124306/https://en.wikipedia.org/wiki/YouTube", "timestamp": "20220102124306"}}, "timestamp": "20220102143824"}

Try this out in your browser @ https://repl.it/@akamhy/WaybackPyBashNewest

Archive near time

This feature uses Wayback Machine's Availability API.

Time used by the Internet Archive's Wayback Machine is in UTC.

waybackpy --url google.com --user_agent "my-unique-user-agent" --near --year 2008 --month 8 --day 8 --hour 8
Archive URL:
https://web.archive.org/web/20080808014003/http://www.google.com:80/

Use the --json flag to get the availability API's JSON response.

$ waybackpy --url google.com --user_agent "my-unique-user-agent" --near --year 2008 --month 8 --day 8 --hour 8 --json
Archive URL:
https://web.archive.org/web/20080808014003/http://www.google.com:80/
JSON response:
{"url": "google.com", "archived_snapshots": {"closest": {"status": "200", "available": true, "url": "http://web.archive.org/web/20080808014003/http://www.google.com:80/", "timestamp": "20080808014003"}}, "timestamp": "200808080840"}

Try this out in your browser @ https://repl.it/@akamhy/WaybackPyBashNear

Fetch all the URLs that the Wayback Machine knows for a domain

  • You can add the '--subdomain' flag to add subdomains.
  • All links will be saved in a file, and the file will be created in the current working directory.
pip install waybackpy

# Ignore the above installation line.

waybackpy --url akamhy.github.io --user_agent "my-user-agent" --known_urls
# Prints all known URLs under akamhy.github.io


waybackpy --url akamhy.github.io --user_agent "my-user-agent" --known_urls --subdomain
# Prints all known URLs under akamhy.github.io including subdomain

Try this out in your browser @ https://repl.it/@akamhy/WaybackpyKnownUrlsFromWaybackMachine#main.sh

CDX Server API

Basic usage

Url Match Scope

Filtering

Collapsing

Clone this wiki locally