-
Notifications
You must be signed in to change notification settings - Fork 33
CLI docs
You are currently reading waybackpy docs to use it as a CLI tool. If you want to use waybackpy as a python library by importing it in a python module/file visit Python package docs.
- Installation
- Saving webpage
- Oldest archive URL
- Newest archive URL
- Archive near specified time
- Fetch all the URLs that the Wayback Machine knows for a domain
- CDX Server API
- Basic usage
- Url Match Scope
- Filtering
- Collapsing
webpage: https://pypi.python.org/project/waybackpy/
pip install waybackpy -U
webpage: https://snapcraft.io/waybackpy
If you only want to use waybackpy as a CLI tool on a Linux machine, snap is the best and officially recommended way to do so.
webpage: https://aur.archlinux.org/packages/waybackpy
This feature uses Wayback Machine's Save API.
Often while saving a link on Wayback Machine, the link returned is cached and not recently saved. If cached save is False it implies that a new archive was created because of our save request and if cached save is True then the Wayback Machine returned an older archive that was saved before the made the request.
Waybackpy checks the timestamp of the returned archive to determine the cache status.
The archive URL is either parsed from the response header of the SavePageNow API or can also be the response URL itself, we employ three pattern matching checks to find the archive.
The following example does not print the save API response headers, to output the headers use --headers
flag.
waybackpy --url "https://en.wikipedia.org/wiki/Social_media" --user_agent "my-unique-user-agent" --save
Archive URL:
https://web.archive.org/web/20220101114012/https://en.wikipedia.org/wiki/Social_media
Cached save:
False
Headers (--headers
) flag in action
waybackpy --url "https://en.wikipedia.org/wiki/Social_media" --user_agent "my-unique-user-agent" --save --headers
Archive URL:
https://web.archive.org/web/20220102104608/https://en.wikipedia.org/wiki/Social_media
Cached save:
True
Save API headers:
{'Server': 'nginx/1.19.10', 'Date': 'Sun, 02 Jan 2022 10:54:09 GMT', 'Content-Type': 'text/html; charset=UTF-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'x-archive-orig-date': 'Sun, 02 Jan 2022 10:46:06 GMT', 'x-archive-orig-server': 'mw1385.eqiad.wmnet', 'x-archive-orig-x-content-type-options': 'nosniff', 'x-archive-orig-p3p': 'CP="See https://en.wikipedia.org/wiki/Special:CentralAutoLogin/P3P for more info."', 'x-archive-orig-content-language': 'en', 'x-archive-orig-vary': 'Accept-Encoding,Cookie,Authorization', 'x-archive-orig-last-modified': 'Sun, 02 Jan 2022 09:30:45 GMT', 'x-archive-orig-content-encoding': 'gzip', 'x-archive-orig-age': '2', 'x-archive-orig-x-cache': 'cp4030 miss, cp4027 hit/1', 'x-archive-orig-x-cache-status': 'hit-front', 'x-archive-orig-server-timing': 'cache;desc="hit-front", host;desc="cp4027"', 'x-archive-orig-strict-transport-security': 'max-age=106384710; includeSubDomains; preload', 'x-archive-orig-report-to': '{ "group": "wm_nel", "max_age": 86400, "endpoints": [{ "url": "https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0" }] }', 'x-archive-orig-nel': '{ "report_to": "wm_nel", "max_age": 86400, "failure_fraction": 0.05, "success_fraction": 0.0}', 'x-archive-orig-permissions-policy': 'interest-cohort=()', 'x-archive-orig-x-client-ip': '207.241.232.35', 'x-archive-orig-cache-control': 'private, s-maxage=0, max-age=0, must-revalidate', 'x-archive-orig-accept-ranges': 'bytes', 'x-archive-orig-content-length': '164995', 'x-archive-orig-connection': 'keep-alive', 'x-archive-guessed-content-type': 'text/html', 'x-archive-guessed-charset': 'utf-8', 'memento-datetime': 'Sun, 02 Jan 2022 10:46:08 GMT', 'link': '<https://en.wikipedia.org/wiki/Social_media>; rel="original", <https://web.archive.org/web/timemap/link/https://en.wikipedia.org/wiki/Social_media>; rel="timemap"; type="application/link-format", <https://web.archive.org/web/https://en.wikipedia.org/wiki/Social_media>; rel="timegate", <https://web.archive.org/web/20051215000000/http://en.wikipedia.org/wiki/Social_media>; rel="first memento"; datetime="Thu, 15 Dec 2005 00:00:00 GMT", <https://web.archive.org/web/20220101114012/https://en.wikipedia.org/wiki/Social_media>; rel="prev memento"; datetime="Sat, 01 Jan 2022 11:40:12 GMT", <https://web.archive.org/web/20220102104608/https://en.wikipedia.org/wiki/Social_media>; rel="memento"; datetime="Sun, 02 Jan 2022 10:46:08 GMT", <https://web.archive.org/web/20220102104608/https://en.wikipedia.org/wiki/Social_media>; rel="last memento"; datetime="Sun, 02 Jan 2022 10:46:08 GMT"', 'content-security-policy': "default-src 'self' 'unsafe-eval' 'unsafe-inline' data: blob: archive.org web.archive.org analytics.archive.org pragma.archivelab.org", 'x-archive-src': 'spn2-20220102093111-wwwb-spn10.us.archive.org-8000.warc.gz', 'server-timing': 'captures_list;dur=275.334598, exclusion.robots;dur=0.096415, exclusion.robots.policy;dur=0.088356, RedisCDXSource;dur=1.634125, esindex;dur=0.008082, LoadShardBlock;dur=81.607259, PetaboxLoader3.datanode;dur=51.631773, CDXLines.iter;dur=18.885269, load_resource;dur=19.971806', 'x-app-server': 'wwwb-app204', 'x-ts': '200', 'x-tr': '910', 'X-location': 'All', 'X-Cache-Key': 'httpsweb.archive.org/web/20220102104608/https://en.wikipedia.org/wiki/Social_mediaIN', 'X-RL': '0', 'X-NA': '0', 'X-Page-Cache': 'MISS', 'X-NID': '-', 'Referrer-Policy': 'no-referrer-when-downgrade', 'Permissions-Policy': 'interest-cohort=()', 'Content-Encoding': 'gzip'}
Try this out in your browser @ https://repl.it/@akamhy/WaybackPyBashSave
This feature uses Wayback Machine's Availability API.
The oldest archive for a webpage can be very useful, to get the oldest archive use --oldest
flag.
waybackpy --url "https://en.wikipedia.org/wiki/SpaceX" --user_agent "my-unique-user-agent" --oldest
Archive URL:
https://web.archive.org/web/20040803000845/http://en.wikipedia.org:80/wiki/SpaceX
Use the --json
flag to get the availability API's JSON response.
waybackpy --url "https://en.wikipedia.org/wiki/SpaceX" --user_agent "my-unique-user-agent" --oldest --json
Archive URL:
https://web.archive.org/web/20040803000845/http://en.wikipedia.org:80/wiki/SpaceX
JSON response:
{"url": "https://en.wikipedia.org/wiki/SpaceX", "archived_snapshots": {"closest": {"status": "200", "available": true, "url": "http://web.archive.org/web/20040803000845/http://en.wikipedia.org:80/wiki/SpaceX", "timestamp": "20040803000845"}}, "timestamp": "199401021436"}
Try this out in your browser @ https://repl.it/@akamhy/WaybackPyBashOldest
This feature uses Wayback Machine's Availability API.
Get the latest(recent most) archive for an URL. Flag: --newest
waybackpy --url "https://en.wikipedia.org/wiki/YouTube" --user_agent "my-unique-user-agent" --newest
Archive URL:
https://web.archive.org/web/20220101184323/https://en.wikipedia.org/wiki/YouTube
Use the --json
flag to get the availability API's JSON response.
waybackpy --url "https://en.wikipedia.org/wiki/YouTube" --user_agent "my-unique-user-agent" --newest --json
Archive URL:
https://web.archive.org/web/20220102124306/https://en.wikipedia.org/wiki/YouTube
JSON response:
{"url": "https://en.wikipedia.org/wiki/YouTube", "archived_snapshots": {"closest": {"status": "200", "available": true, "url": "http://web.archive.org/web/20220102124306/https://en.wikipedia.org/wiki/YouTube", "timestamp": "20220102124306"}}, "timestamp": "20220102143824"}
Try this out in your browser @ https://repl.it/@akamhy/WaybackPyBashNewest
This feature uses Wayback Machine's Availability API.
Time used by the Internet Archive's Wayback Machine is in UTC.
waybackpy --url google.com --user_agent "my-unique-user-agent" --near --year 2008 --month 8 --day 8 --hour 8
Archive URL:
https://web.archive.org/web/20080808014003/http://www.google.com:80/
Use the --json
flag to get the availability API's JSON response.
$ waybackpy --url google.com --user_agent "my-unique-user-agent" --near --year 2008 --month 8 --day 8 --hour 8 --json
Archive URL:
https://web.archive.org/web/20080808014003/http://www.google.com:80/
JSON response:
{"url": "google.com", "archived_snapshots": {"closest": {"status": "200", "available": true, "url": "http://web.archive.org/web/20080808014003/http://www.google.com:80/", "timestamp": "20080808014003"}}, "timestamp": "200808080840"}
Try this out in your browser @ https://repl.it/@akamhy/WaybackPyBashNear
- You can add the '--subdomain' flag to add subdomains.
- All links will be saved in a file, and the file will be created in the current working directory.
pip install waybackpy
# Ignore the above installation line.
waybackpy --url akamhy.github.io --user_agent "my-user-agent" --known_urls
# Prints all known URLs under akamhy.github.io
waybackpy --url akamhy.github.io --user_agent "my-user-agent" --known_urls --subdomain
# Prints all known URLs under akamhy.github.io including subdomain
Try this out in your browser @ https://repl.it/@akamhy/WaybackpyKnownUrlsFromWaybackMachine#main.sh