Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WaybackMachineCDXServerAPI.newest does not return latest snapshot #176

Open
sissbruecker opened this issue Sep 10, 2022 · 1 comment
Open
Assignees
Labels
bug Something isn't working

Comments

@sissbruecker
Copy link

sissbruecker commented Sep 10, 2022

Describe the bug

Using WaybackMachineCDXServerAPI.newest does not return the last snapshot, but some recent snapshot. For example for https://openlayers.org/, it returns a snapshot from 2022-06-16 17:20:36, the latest snapshot (as of today, September 10th 2022) is from 2022-09-10 08:05:37. There are around 380 snapshots between these two.

I've debugged this a bit and it seems there is an issue either with how sort or limit are configured, or interpreted by the CDX server. The method sets sort = 'closest' and limit = 1. If I configure the WaybackMachineCDXServerAPI instance manually and set to limit = -1 instead, then I actually get the latest snapshot. #155 (comment) hints that limit = -1 should be used for the latest snapshot.

To Reproduce

url = 'https://openlayers.org/'
cdx_api = waybackpy.WaybackMachineCDXServerAPI(url)
newest_snapshot = cdx_api.newest()
print(newest_snapshot.datetime_timestamp)
# prints 2022-06-16 17:20:36, should be 2022-09-10 08:05:37

Workaround

url = 'https://openlayers.org/'
unix_timestamp = int(time.time())
timestamp = waybackpy.utils.unix_timestamp_to_wayback_timestamp(unix_timestamp)
cdx_api = waybackpy.WaybackMachineCDXServerAPI(url)
cdx_api.closest = timestamp
cdx_api.sort = 'closest'
cdx_api.limit = -1

for item in cdx_api.snapshots():
    print(item.datetime_timestamp)
    break

Expected behavior
The newest API should return the newest snapshot.

Version:

  • OS: macOS
  • Version 3.0.6
  • Is latest version? Yes
@sissbruecker sissbruecker added the bug Something isn't working label Sep 10, 2022
@sissbruecker
Copy link
Author

Hmm, with limit = -1 sometimes you don't get any result at all from the CDX API. For example:

http://web.archive.org/cdx/search/cdx?url=https://github.com/awslabs/aws-serverless-express&gzip=false&showResumeKey=true&limit=-1

returns an empty response.

However:

http://web.archive.org/cdx/search/cdx?url=https://github.com/awslabs/aws-serverless-express&gzip=false&showResumeKey=true&limit=-5

returns 5 entries.

The CDX API docs are not super clear, but that looks like a bug. A workaround could be to use a higher limit for newest, and then only take the first result.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants